These are the tools used to generate the stats. All VN scripts must be in utf-8.

http://wareya.moe/extra/workspace.zip

analyzer.jar: the core of the stats generation. Creates a lemmatized frequency list from a given VN script. Uses kuromoji-unidic, which uses a viterbi graph and a pre-trained markov model about how what words connect to eachother and how common each lexeme is. VN script must be in utf-8. Invoked by a bash script. Github

normalizer.jar: Merges frequency lists in the format that analyzer.jar outputs. Invoked manually on most of the frequency lists generated under the count/ directory. Used in order to create the frequency list for the 5k columns. Github

dowork.sh: Generates the main frequency lists for each script in workspace/, placing the lists under count/. These lists exclude grammatical lexemes.

altwork.sh: Above, but with altcount/, and not excluding grammatical lexemes.