Making stats - VN Stats

This page is out of date. See the Stats page and read the readme in the tools linked there.

These are the tools used to generate the stats. See Engines for information on script ripping for particular engines.

These tools require Python 3, 64-bit Java, and a bash prompt.

Place scripts under workplace/. All VN scripts must be in utf-8. See Ripping for information on script formatting. Unix line endings (\n, not \r\n) are preferred.

analyzer.jar: the core of the stats generation. Creates a lemmatized frequency list from a given VN script. Uses kuromoji-unidic, which uses a viterbi graph and a pre-trained markov model about how what words connect to eachother and how common each lexeme is. VN script must be in utf-8. Invoked by a bash script. Github

normalizer.jar: Merges frequency lists in the format that analyzer.jar outputs. Invoked manually on most of the frequency lists generated under the count/ directory. Used in order to create the frequency list for the 5k columns. Github

dowork.sh: Generates the main frequency lists for each script in workspace/, placing the lists under count/. These lists exclude grammatical lexemes.

altwork.sh: Above, but with altcount/, and not excluding grammatical lexemes.

refresh.sh: Calculates the hayashi score, coverages, and other stats from every frequency list and script.

newscript.sh: Generates/regenerates the frequency list in count/ and the frequency list in altcount/ for a single script in workspace/.

fullredo.sh: Runs dowork.sh, altwork.sh, and refresh.sh. Might be preferable the first time, but for adding single scripts, you're going to want to use newscript.sh and refresh.sh manually.