Frequency lists - VN Stats

Note: Not updated yet. Also the format is going to change.

Please read these warnings. For a more extensive explanation, see this legacy page: https://vnscripts.neocities.org/freq.html

1) The word segmentation tool used gives an approximate segmentation.

2) The word segmentation tool doesn't output "this word from this dictionary", it outputs a bunch of arbitrary information.

3) The word segmentation tool is pre-trained with a model of what readings for what words are more common. This leads to problems like 私 always being read as わたくし. I've fixed the most obvious offenders, but there are certainly several common words with the wrong readings attached.

4) The small size of the vn scripts corpus presents a lot of unique challenges. Solving these challenges makes the statistical properties of the frequency list as a whole be more unnatural.

5) Many scripts have duplicated sections or scenes because of how VNs are structured internally, which inflates the frequencies of the words in those scenes. This cannot be 100% fixed, only made less of a problem. The analysis here deduplicates by individual lines if they're longer than five UTF-16 code units.

This is only a summary of the potential quality problems this frequency list may have. However, I think it's still probably better than using frequency lists drafted from almost nothing but nonfiction.

Here's an example of what it looks like when you take five trials of 30 million words from around 180 million words (used the Narou frequency list): https://i.imgur.com/LyL92NJ.png

The vnscripts corpus has around 30-40 million words in it (at the time I wrote this), so where the x axis is the frequency, the y axis is an estimate of the maximum positional error you'd expect to see at that frequency based on only taking 30 million words as samples. Of course, the stats here sample each script individually and normalize them before averaging them, so the actual maximum error is a bit higher. Finally, this isn't the actual error of any given word at that frequency, it's the maximum you'd expect; the actual positional error of a given word is probably going to be around 25~35% the maximum (on average, locally) until you get into the ten-thousands.

Also, doubling the number of sampled words decreases the average absolute deviation to about 70% (probably sqrt(0.5)) what it used to be, on average (https://i.imgur.com/oAPliDd.png - left: 60m samples, 1 trial; right: 30m samples, 5 trials; local average absolute deviation).

tl;dr: The useful resolution of these frequency lists is probably something like +/-2 around 100, +/-4 around 500, +/-12 around 1000, +/-30 around 2000, +/-100 around 5000, and so on. (There will be words that deviate even more than this, those are just averages.)

This is relative only to the work being sampled, not to the objective frequency of the word among all genres of Japanese.

This frequency list is subject to arbitrary changes at any time as I find new ways to try to make up for the above problems.

One of the fields was moved from the lemma section to the lexeme sections in the new frequency lists.

New frequency lists:

Frequency list of non-grammatical terms

Frequency list including grammatical terms

Old frequency lists:

Frequency list of non-grammatical terms

Frequency list including grammatical terms

For more information on word segmentation, see this talk about Kuromoji.