Diff for "Frequency lists"

Differences between revisions 9 and 10

Note: Not updated yet. Also the format is going to change.

Please read these warnings. For a more extensive explanation, see this legacy page: https://vnscripts.neocities.org/freq.html

1) The word segmentation tool used gives an approximate segmentation.

2) The word segmentation tool doesn't output "this word from this dictionary", it outputs a bunch of arbitrary information.

3) The word segmentation tool is pre-trained with a model of what readings for what words are more common. This leads to problems like 私 always being read as わたくし. I've fixed the most obvious offenders, but there are certainly several common words with the wrong readings attached.

4) The small size of the vn scripts corpus presents a lot of unique challenges. Solving these challenges makes the statistical properties of the frequency list as a whole be more unnatural.

5) Many scripts have duplicated sections or scenes because of how VNs are structured internally, which inflates the frequencies of the words in those scenes. This cannot be 100% fixed, only made less of a problem. The analysis here deduplicates by individual lines if they're longer than five UTF-16 code units.

This is only a summary of the potential quality problems this frequency list may have. However, I think it's still probably better than using frequency lists drafted from almost nothing but nonfiction.

Here's an example of what it looks like when you take five trials of 30 million words from over 100 million words (used the Narou frequency list): https://i.imgur.com/LyL92NJ.png

The vnscripts corpus has around 30-40 million words in it (at the time I wrote this), so where the x axis is the frequency, the y axis is an estimate of the maximum positional error you'd expect to see at that frequency based on only taking 30 million words as samples. Of course, the stats here sample each script individually and normalize them before averaging them, so the actual maximum error is a bit higher. Finally, this isn't the actual error of any given word at that frequency, it's the maximum you'd expect; the actual positional error of a given word is probably going to be around 25~35% the maximum (on average, locally) until you get into the ten-thousands.

This frequency list is subject to arbitrary changes at any time as I find new ways to try to make up for the above problems.

One of the fields was moved from the lemma section to the lexeme sections in the new frequency lists.

New frequency lists:

Frequency list of non-grammatical terms

Frequency list including grammatical terms

Old frequency lists:

Frequency list of non-grammatical terms

Frequency list including grammatical terms

For more information on word segmentation, see this talk about Kuromoji.

-  ⇤ ← Revision 9 as of 2018-04-01 18:17:48 → 
  Size: 2166
  Editor: weh
  Comment:
+   ← Revision 10 as of 2018-05-13 15:46:21 → ⇥
  Size: 3029
  Editor: weh
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 16:
+Here's an example of what it looks like when you take five trials of 30 million words from over 100 million words (used the [[Narou]] frequency list): https://i.imgur.com/LyL92NJ.png

The vnscripts corpus has around 30-40 million words in it (at the time I wrote this), so where the x axis is the frequency, the y axis is an estimate of the maximum positional error you'd expect to see at that frequency based on only taking 30 million words as samples. Of course, the stats here sample each script individually and normalize them before averaging them, so the actual maximum error is a bit higher. Finally, this isn't the actual error of any given word at that frequency, it's the maximum you'd expect; the actual positional error of a given word is probably going to be around 25~35% the maximum (on average, locally) until you get into the ten-thousands.