Mathematical Proof: You Should Just Start Reading in Your Target Language

The goal is to find a function to approximate the ease of reading as a function as progress through a book with the assumption that nothing is known at the beginning of reading.

To create a model to simulate one's mind when reading a book, let's first set up some variables.

Let \(n=\)the number of unique words in the book

Let \(L=\)the total number of words in the book

We will consider the list of unique words in the book to be ordered from most frequent to least frequent and numbered 1 to \(n\)

Let \(p(w)\) be a function that gets the frequency of the \(w\)th word in the list

According Zipf's Law, the frequency of each word will be equal to \(p(w)=\frac{f_1}{w}\) where \(f_1\) is the frequency of the most frequent word (\(w=1\)).

The sum of the frequencies of the words, \(\sum_{w=1}^{n}p(w)\), will be equal to \(100\%\) simply because the chance of a given word being any word in the list of all possible words is \(100\%\).

Therefore, to find \(f_1\) so that we can fully define \(p(w)\), we need to find what value of \(f_1\) will make the equality \(\sum_{w=1}^{n}p(w)=\sum_{w=1}^{n}\frac{f_1}{w}=100\%\) true.

\(\sum_{w=1}^{n}\frac{f_1}{w}=100\%\)

is equivalent to

\(f_1*\sum_{w=1}^{n}\frac{1}{w}=100*\frac{1}{100}\)

because multiplying each term of the sum by \(f_1\) is the same as multiplying the whole sum by \(f_1\) and \(\%=\frac{1}{100}\) by definition. Simplifying the right side and then dividing each side by \(\sum_{w=1}^{n}\frac{1}{w}\) to isolate \(f_1\) then yields:

\(f_1=\frac{1}{\sum_{w=1}^{n}\frac{1}{w}}\)

This cannot be simplified any further because \(\sum_{w=1}^{n}\frac{1}{w}\) is the partial sum of the harmonic series which does not have a simpler solution, but you can use a calculator to calculate it. It can also be approximated by \(\frac{1}{ln(n)}\) (a slight overestimate).

Continuing expanding on the model:

Let \(r=\) the chance of learning a word after seeing it

Let \(k(w,i)=\)the probability of remembering the \(w\)th most frequent word after reading \(i\) words

by the opposite:

\(k(w,i)=100\%-\)the probability of not remembering the \(w\)th most frequent word after reading \(i\) words

from \(i\) repeated independent events (not really independent, but it's only a slight underestimate):

\(k(w,i)=100\%-(\)the probability of not learning the \(w\)th most frequent word after reading a word\()^i\)

by the opposite:

\(k(w,i)=100\%-(100\%-\)the probability of learning the \(w\)th most frequent word after reading a word\()^i\)

\(k(w,i)=100\%-(100\%-\)the probability of reading the \(w\)th most frequent word\(*\)the probability of learning a word after seeing it\()^i\)

\(k(w,i)=100\%-(100\%-p(w)*r)^i\)

converting from percents \(100\%=1\):

\(k(w,i)=1-(1-p(w)*r)^i\)

This completes the definition of \(k(w,i)\) because \(r\) is a variable we manually manipulate, \(p(w)\) is already defined, and \(w\) and \(i\) are inputs.

Let \(e_w(i)=\)the probability of remembering the \(i\)th word read

\(e_w(i)=\)the sum of the probabilities of remembering each word after reading \(i\) words weighted by the probability of seeing each word

\(e_w(i)=\sum_{w=1}^{n}(\)the probability of remembering the \(w\)th most frequent word\(*\)the probability of seeing the \(w\)th most frequent word\()\)

\(e_w(i)=\sum_{w=1}^{n}(k(w,i)*p(w))\)

This completes the definition of \(e_w(i)\).

Let \(a=\)the average sentence length

Let \(e_s(i)=\)the probability of remembering all the words in the next sentence after reading \(i\) words

\(e_s(i)=\)the probability of remembering the next \(a\) words after reading \(i\) words

by repeated \(a\) repeated independent events (assuming nothing was learned from the next \(a\) words):

\(e_s(i)=(\)the probability of remembering a word after reading \(i\) words\()^a\)

\(e_s(i)=(e_w(i))^a\)

Let \(E_w(t)=\)the chance of remembering the next word after reading \(t\) percent of the book

\(E_w(t)=e_w(\)how many words read after reading \(t\) percent of the book\()\)

\(E_w(t)=e_w(\)the number of words in the book\(*\)the percent of words read\(*\frac{1}{100\%})\)

\(E_w(t)=e_w(\frac{L*t}{100\%})\)

Let \(E_s(t)=\)the chance of remembering the next sentence after reading \(t\) percent of the book

\(E_s(t)=e_s(\)how many words read after reading \(t\) percent of the book\()\)

\(E_w(t)=e_s(\)the number of words in the book\(*\)the percent of words read\(*\frac{1}{100\%})\)

\(E_w(t)=e_s(\frac{L*t}{100\%})\)

This completes the model as we have found a few functions \((e_w(i), e_s(i), E_w(t), E_s(t))\) to estimate the ease of reading as a function of progress. Below is an implementation of the model you can manipulate (note: not all the input variables will be used depending on the calculation).

\(n\), number of unique words
\(L\), length of the book
\(r\), chance of remembering (a percent)
\(a\), average sentence length
\(t\), percent of book read
\(E_w(t)\), chance of remembering the next word after reading \(t\) percent of the book
\(E_s(t)\), chance of remembering the next sentence after reading \(t\) percent of the book

Output

Based on the model, for me to have a 50% of remembering the next word when reading the first volume of Re:Zero, which as 7519 unique words and 52997 words total (according jpdb.io: https://jpdb.io/novel/1611/re-zero-starting-life-in-another-world), assuming I knew 0 Japanese before reading and that I had a 50% chance of learning a word each time I saw it (probably assisted by Anki), I would only have to read about 1.27% of the book (or about 4 pages) to have about a 50% chance knowing the next word, which is much less than I thought.

A graph of the predicted recognition chance of Re:Zero Volume 1 showing word recognition rapidly increasing at the beginning and leveling off at around 80-90% after reading 20% of the book and sentence recognition increasing slowly, with some acceleration towards the end, to about 20%.

By using Yomitan for reading on the web, Takoboto for dictionary lookup on mobile (and creating Anki cards), and Google's Japanese handwriting OCR, I have reduced the friction of looking up new words a lot. So, that combined with the promising predictions has made pushing through reading difficult material a lot more encouraging.

Title: 題名：Mathematical Proof: You Should Just Start Reading in Your Target Language
Authors: 作家・Clayton Lopez Hickey
Published: 著した日付：Mon, 25 Mar 2024 17:36:40 EDT
Last updated: Mon, 01 Apr 2024 16:02:40 EDT

MLA Citation:MLA引用：
Hickey, C. L. (2024, March 25). Mathematical Proof: You Should Just Start Reading in Your Target Language. Clayton Hickey. https://claytonhickey.me/blog/mathematical-proof-you-should-just-start-reading-in-your-target-language/.

Mathematical Proof: You Should Just Start Reading in Your Target Language

Like this post? このポストが好き？Follow with RSSRSSでフォローする

What others are saying on: 他の会話があるサイト：𝕏, Google, Reddit

Comment on: 返事のサイト：𝕏, Reddit, Facebook

Comment to me directly: メールで返事する：[email protected]