# Mathematical Proof: You Should Just Start Reading in Your Target Language

The goal is to find a function to approximate the ease of reading as a function as progress through a book with the assumption that nothing is known at the beginning of reading.

To create a model to simulate one's mind when reading a book, let's first set up some variables.

Let \(n=\)the number of unique words in the book

Let \(L=\)the total number of words in the book

We will consider the list of unique words in the book to be ordered from most frequent to least frequent and numbered 1 to \(n\)

Let \(p(w)\) be a function that gets the frequency of the \(w\)th word in the list

According Zipf's Law, the frequency of each word will be equal to \(p(w)=\frac{f_1}{w}\) where \(f_1\) is the frequency of the most frequent word (\(w=1\)).

The sum of the frequencies of the words, \(\sum_{w=1}^{n}p(w)\), will be equal to \(100\%\) simply because the chance of a given word being any word in the list of all possible words is \(100\%\).

Therefore, to find \(f_1\) so that we can fully define \(p(w)\), we need to find what value of \(f_1\) will make the equality \(\sum_{w=1}^{n}p(w)=\sum_{w=1}^{n}\frac{f_1}{w}=100\%\) true.

\(\sum_{w=1}^{n}\frac{f_1}{w}=100\%\)

is equivalent to

\(f_1*\sum_{w=1}^{n}\frac{1}{w}=100*\frac{1}{100}\)

because multiplying each term of the sum by \(f_1\) is the same as multiplying the whole sum by \(f_1\) and \(\%=\frac{1}{100}\) by definition. Simplifying the right side and then dividing each side by \(\sum_{w=1}^{n}\frac{1}{w}\) to isolate \(f_1\) then yields:

\(f_1=\frac{1}{\sum_{w=1}^{n}\frac{1}{w}}\)

This cannot be simplified any further because \(\sum_{w=1}^{n}\frac{1}{w}\) is the partial sum of the harmonic series which does not have a simpler solution, but you can use a calculator to calculate it. It can also be approximated by \(\frac{1}{ln(n)}\) (a slight overestimate).

Continuing expanding on the model:

Let \(r=\) the chance of learning a word after seeing it

Let \(k(w,i)=\)the probability of remembering the \(w\)th most frequent word after reading \(i\) words

by the opposite:

\(k(w,i)=100\%-\)the probability of not remembering the \(w\)th most frequent word after reading \(i\) words

from \(i\) repeated independent events (not really independent, but it's only a slight underestimate):

\(k(w,i)=100\%-(\)the probability of not learning the \(w\)th most frequent word after reading a word\()^i\)

by the opposite:

\(k(w,i)=100\%-(100\%-\)the probability of learning the \(w\)th most frequent word after reading a word\()^i\)

\(k(w,i)=100\%-(100\%-\)the probability of reading the \(w\)th most frequent word\(*\)the probability of learning a word after seeing it\()^i\)

\(k(w,i)=100\%-(100\%-p(w)*r)^i\)

converting from percents \(100\%=1\):

\(k(w,i)=1-(1-p(w)*r)^i\)

This completes the definition of \(k(w,i)\) because \(r\) is a variable we manually manipulate, \(p(w)\) is already defined, and \(w\) and \(i\) are inputs.

Let \(e_w(i)=\)the probability of remembering the \(i\)th word read

\(e_w(i)=\)the sum of the probabilities of remembering each word after reading \(i\) words weighted by the probability of seeing each word

\(e_w(i)=\sum_{w=1}^{n}(\)the probability of remembering the \(w\)th most frequent word\(*\)the probability of seeing the \(w\)th most frequent word\()\)

\(e_w(i)=\sum_{w=1}^{n}(k(w,i)*p(w))\)

This completes the definition of \(e_w(i)\).

Let \(a=\)the average sentence length

Let \(e_s(i)=\)the probability of remembering all the words in the next sentence after reading \(i\) words

\(e_s(i)=\)the probability of remembering the next \(a\) words after reading \(i\) words

by repeated \(a\) repeated independent events (assuming nothing was learned from the next \(a\) words):

\(e_s(i)=(\)the probability of remembering a word after reading \(i\) words\()^a\)

\(e_s(i)=(e_w(i))^a\)

Let \(E_w(t)=\)the chance of remembering the next word after reading \(t\) percent of the book

\(E_w(t)=e_w(\)how many words read after reading \(t\) percent of the book\()\)

\(E_w(t)=e_w(\)the number of words in the book\(*\)the percent of words read\(*\frac{1}{100\%})\)

\(E_w(t)=e_w(\frac{L*t}{100\%})\)

Let \(E_s(t)=\)the chance of remembering the next sentence after reading \(t\) percent of the book

\(E_s(t)=e_s(\)how many words read after reading \(t\) percent of the book\()\)

\(E_w(t)=e_s(\)the number of words in the book\(*\)the percent of words read\(*\frac{1}{100\%})\)

\(E_w(t)=e_s(\frac{L*t}{100\%})\)

This completes the model as we have found a few functions \((e_w(i), e_s(i), E_w(t), E_s(t))\) to estimate the ease of reading as a function of progress. Below is an implementation of the model you can manipulate (note: not all the input variables will be used depending on the calculation).

Based on the model, for me to have a 50% of remembering the next word when reading the first volume of Re:Zero, which as 7519 unique words and 52997 words total (according jpdb.io: https://jpdb.io/novel/1611/re-zero-starting-life-in-another-world), assuming I knew 0 Japanese before reading and that I had a 50% chance of learning a word each time I saw it (probably assisted by Anki), I would only have to read about 1.27% of the book (or about 4 pages) to have about a 50% chance knowing the next word, which is much less than I thought.

By using Yomitan for reading on the web, Takoboto for dictionary lookup on mobile (and creating Anki cards), and Google's Japanese handwriting OCR, I have reduced the friction of looking up new words a lot. So, that combined with the promising predictions has made pushing through reading difficult material a lot more encouraging.

Last updated: Mon, 01 Apr 2024 16:02:40 EDT

Hickey, C. L. (2024, March 25). *Mathematical Proof: You Should Just Start Reading in Your Target Language*. Clayton Hickey. https://claytonhickey.me/blog/mathematical-proof-you-should-just-start-reading-in-your-target-language/.