VNStats is the successor to the vnscripts stats project, because I don't have the focus left to rip many scripts anymore.

In this wiki, you'll find a wealth of information on how to rip visual novel scripts, how particular engines handle their scripts, how corpus linguistics on Japanese text works, and the statistics we can create based on those scripts. Everything here is strictly for analytical purposes. Nothing here enables or condones the piracy of visual novels or the circumvention of their DRM.

Previous projects for collecting stats about VNs generally used engine-specific analysis tools. This project dumps every script to a common format, to make sure that there's as little difference as possible between the analysis process for different games/engines, and everything after getting the script into that common format is automated.

This project covers 31.5 million lexemes with an upper bound of 37.8MB entropy (1024^2, circa 40 million bytes; upper bound; LZMA2/7z).

VN Stats is a project to maintain objective, useful, up-to-date statistics on visual novel scripts, so that readers and learners can meaningfully talk about the length, difficulty, and complexity of different works.

While traditional media like books and cinema have obvious metrics for length (pages/words, runtime), visual novels "hide" this information by mixing content text with program logic. Unlike gameplay-heavy video games, where an average playthrough's playtime is meaningful, visual novels have a lot of quirks that cause different consumption styles to have radically different playtimes, depending on what that particular visual novel does. Some people skip anything erotic, some people use guides, some people try every choice, etc.

Additionally, while you can meaningfully compare the difficulty of different video games, writing doesn't have an obvious metric for reading difficulty except in the most radical cases. A normal person can't tell you how hard a particular story is with any kind of certainty, novel or visual novel or what. This is compounded by second-language learners having completely different standards for what quirks make reading something difficult than native readers.

Visual novels are interesting because they're a culturally meaningful but non-obvious source of corpus linguistics material. Normally, mainstream culture only accepts certain perverse story themes in the context of high literature, which directly affects the language with which they're presented. Visual novels are seen as a niche thing by default, and have less of a stigma against presenting certain themes in certain ways. This makes the writing less colored by societal norms.

