Diff for "Frequency lists"

Differences between revisions 25 and 26

Please read these warnings. For a more extensive explanation, see this legacy page: https://vnscripts.neocities.org/freq.html

1) The word segmentation tool used gives an approximate segmentation.

2) The word segmentation tool doesn't output "this word from this dictionary", it outputs a bunch of arbitrary information.

3) The word segmentation tool is pre-trained with a model of what readings for what words are more common. This leads to problems like 私 always being read as わたくし. I've fixed the most obvious offenders, but there are certainly several common words with the wrong readings attached.

4) The small size of the vn scripts corpus presents a lot of unique challenges. Approaching these challenges causes the statistical properties of the frequency list as a whole to be more unnatural.

5) Many scripts have duplicated sections or scenes because of how VNs are structured internally, which inflates the frequencies of the words in those scenes. This cannot be 100% fixed, only made less of a problem. The analysis here deduplicates by individual lines if they're longer than a certain short length.

This is only a summary of the potential quality problems this frequency list may have. However, I think it's still probably better than using frequency lists drafted from almost nothing but nonfiction.

Here's an example of what it looks like when you take five trials of 30 million words from around 180 million words (used the Narou frequency list): https://i.imgur.com/LyL92NJ.png

The vnscripts corpus has around 30-40 million words in it (at the time I wrote this), so where the x axis is the frequency, the y axis is an estimate of the maximum positional error you'd expect to see at that frequency based on only taking 30 million words as samples. Of course, the stats here sample each script individually and normalize them before averaging them, so the actual maximum error is a bit higher. Finally, this isn't the actual error of any given word at that frequency, it's the maximum you'd expect; the actual positional error of a given word is probably going to be around 25~35% the maximum (on average, locally) until you get into the ten-thousands.

Also, doubling the number of sampled words decreases the average absolute deviation to about 70% (probably sqrt(0.5)) what it used to be, on average (https://i.imgur.com/oAPliDd.png - left: 60m samples, 1 trial; right: 30m samples, 5 trials; local average absolute deviation).

tl;dr: The useful resolution of these frequency lists is probably something like +/-2 around 100, +/-4 around 500, +/-12 around 1000, +/-30 around 2000, +/-100 around 5000, and so on. (There will be words that deviate even more than this, those are just averages.)

This is relative only to the work being sampled, not to the objective frequency of the word among all genres of Japanese. There are words that are objectively common here that are not common in general Japanese, and vice versa.

This frequency list is subject to arbitrary changes at any time as I find new ways to try to make up for the above problems.

VN frequency lists: https://github.com/wareya/jpstats/tree/master/workspace

Narou frequency lists: https://github.com/wareya/jpstats/tree/master/narou

Old frequency lists:

Frequency list of non-grammatical terms

Frequency list including grammatical terms

Headers (copy and paste into a spreadsheet program): https://pastebin.com/raw/5B2bHfJD

Older frequency lists:

Frequency list of non-grammatical terms

Frequency list including grammatical terms

For more information on word segmentation, see this talk about Kuromoji.

-  ⇤ ← Revision 25 as of 2019-06-29 19:26:11 → 
  Size: 3922
  Editor: weh
  Comment:
+   ← Revision 26 as of 2019-07-02 23:15:27 → ⇥
  Size: 3930
  Editor: weh
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 9:
-) The small size of the vn scripts corpus presents a lot of unique challenges. Solving these challenges makes the statistical properties of the frequency list as a whole be more unnatural.
+) The small size of the vn scripts corpus presents a lot of unique challenges. Approaching these challenges causes the statistical properties of the frequency list as a whole to be more unnatural.