Diff for "Frequency lists"

Differences between revisions 4 and 5

Please read these warnings. For a more extensive explanation, see this legacy page: https://vnscripts.neocities.org/freq.html

1) The word segmentation tool used gives an approximate segmentation.

2) The word segmentation tool doesn't output "this word from this dictionary", it outputs a bunch of arbitrary information.

3) The word segmentation tool is pre-trained with a model of what readings for what words are more common. This leads to problems like 私 always being read as わたくし. I've fixed the most obvious offenders, but there are certainly several common words with the wrong readings attached.

4) The small size of the vn scripts corpus presents a lot of unique challenges. Solving these challenges makes the statistical properties of the frequency list as a whole be more unnatural.

5) Many scripts have duplicated sections or scenes because of how VNs are structured internally. This cannot be fixed programmatically, only made less of a problem.

This is only a summary of the potential quality problems this frequency list may have. However, I think it's still probably better than using frequency lists drafted from almost nothing but nonfiction.

This frequency list is subject to arbitrary changes at any time as I find new ways to try to make up for the above problems.

Frequency list of non-grammatical terms

Frequency list including grammatical terms

For more information on word segmentation, see this talk about Kuromoji.

-  ⇤ ← Revision 4 as of 2017-08-27 01:28:54 → 
  Size: 1456
  Editor: weh
  Comment:
+   ← Revision 5 as of 2017-08-27 01:30:17 → ⇥
  Size: 1624
  Editor: weh
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 11:
+) Many scripts have duplicated sections or scenes because of how VNs are structured internally. This cannot be fixed programmatically, only made less of a problem.