Ripping scripts properly is a very complicated process, and specific to exactly the engine, and sometimes exactly the visual novel, that you want to rip.
My gists; various ripping tools for specific engines, some hacked together from existing tools. Mixed in with my other gists.
I've made several tools that I never put on gists. I don't know how many I've made. If you need a specific one, ask for it, I'll get it up.
VNs contain narration, dialogue, names, ruby text, etc. This is all indicated in the original scripts using commands or some kind of markup. For the purpose of corpus linguistics, the following formatting rules are enforced:
Newlines in the middle of sentences are ignored. This is a hard rule. If you can't enforce it, the script is almost worthless. Linebreaks between or after sentences are completely okay.
If pagebreaks are different than newlines, pagebreaks should be indicated by two consecutive newlines. Otherwise it's not important. This is not a hard rule, tools will still work if this rule is broken.
Ruby text needs to be removed or placed inside 《》. Analysis will work, but be crappy, if this rule is not followed. The analysis tools need to know how to ignore ruby text.
Scenes do not have to be in order at all.
All scripts must be in utf-8, without a BOM.
Dialogue should be in normal quotes: 「」. Dialogue being in quotes is not a hard rule, it doesn't change the analysis.
Speaker names must not be included. Speaker names not being included is a hard rule. The analysis can't ignore speaker names, so they need to just not be there.
See also: Sanity checks