Frequency lists - VN Stats

Note: Not updated yet. Also the format is going to change.

Please read these warnings. For a more extensive explanation, see this legacy page: https://vnscripts.neocities.org/freq.html

1) The word segmentation tool used gives an approximate segmentation.

2) The word segmentation tool doesn't output "this word from this dictionary", it outputs a bunch of arbitrary information.

3) The word segmentation tool is pre-trained with a model of what readings for what words are more common. This leads to problems like 私 always being read as わたくし. I've fixed the most obvious offenders, but there are certainly several common words with the wrong readings attached.

4) The small size of the vn scripts corpus presents a lot of unique challenges. Solving these challenges makes the statistical properties of the frequency list as a whole be more unnatural.

5) Many scripts have duplicated sections or scenes because of how VNs are structured internally, which inflates the frequencies of the words in those scenes. This cannot be 100% fixed, only made less of a problem. The analysis here deduplicates by individual lines if they're longer than five UTF-16 code units.

This is only a summary of the potential quality problems this frequency list may have. However, I think it's still probably better than using frequency lists drafted from almost nothing but nonfiction.

This frequency list is subject to arbitrary changes at any time as I find new ways to try to make up for the above problems.

One of the fields was moved from the lemma section to the lexeme sections in the new frequency lists.

New frequency lists:

Frequency list of non-grammatical terms

Frequency list including grammatical terms

Old frequency lists:

Frequency list of non-grammatical terms

Frequency list including grammatical terms

For more information on word segmentation, see this talk about Kuromoji.