Differences between revisions 4 and 5
Revision 4 as of 2018-06-05 16:14:30
Size: 11263
Editor: weh
Comment:
Revision 5 as of 2018-06-05 16:14:43
Size: 11265
Editor: weh
Comment:
Deletions are marked like this. Additions are marked like this.
Line 3: Line 3:
12-episode anime series have extremely small amounts of text - 100 to 200 KB in utf-8 - so any analysis of them is going to be extremely unstable. For reference, Hanahira is about 180 KB in utf-8. Also, it's not possible to reliably reconstruct original linebreaks from most subtitles, which is something the "custom metric" desperately needs, so it's disabled here. 12-episode anime series have extremely small amounts of text -- 100 to 200 KB in utf-8 -- so any analysis of them is going to be extremely unstable. For reference, Hanahira is about 180 KB in utf-8. Also, it's not possible to reliably reconstruct original linebreaks from most subtitles, which is something the "custom metric" desperately needs, so it's disabled here.

Stats for random anime subtitles. Just what was easy to dump. No quality guarantees. Not going to be maintained at all. Just for fun.

12-episode anime series have extremely small amounts of text -- 100 to 200 KB in utf-8 -- so any analysis of them is going to be extremely unstable. For reference, Hanahira is about 180 KB in utf-8. Also, it's not possible to reliably reconstruct original linebreaks from most subtitles, which is something the "custom metric" desperately needs, so it's disabled here.

script name kanji (unique) kanji (2+) lines sentences chars /line chars /sentence characters lexemes sjis bytes sjis (dedup) hours estimate Hayashi freqlist 90% Target freqlist 92.5% Target freqlist 95% Target
amagami brilliant park 1198 888 6351 6377 9.44 9.09 59968 31252 122344 117303 1.89 81.33 5047.96 7095.28 10407.23
bakemonogatari 1266 959 9237 9225 9.36 8.98 86446 47088 180066 170112 2.86 84.67 3714.10 5360.33 8079.55
black lagoon 1588 1274 12360 12600 9.01 8.63 111390 60146 235635 226258 3.64 81.43 7882.46 10386.23 13947.02
cardcaptor sakura 1362 1142 32645 32309 8.01 7.30 261594 126381 553771 463862 7.73 84.87 2805.26 4153.60 7043.47
cowboy bebop 1304 1011 9382 9091 9.67 9.16 90735 45920 185698 171931 2.72 76.54 6433.00 8755.69 11828.17
demi-chan wa kataritai 1047 767 4183 4267 11.55 10.93 48318 26910 103967 100954 1.66 83.84 4048.70 6160.28 9386.85
devilman crybaby 996 647 3193 3554 10.62 8.91 33912 17285 71529 67021 1.08 82.80 4889.00 6567.50 10222.00
eromanga sensei 954 702 6708 7178 8.66 7.76 58105 30433 122488 115845 1.84 84.37 3250.91 4298.81 6333.55
eureka seven 1489 1240 24996 25379 8.20 7.57 204927 101555 432784 388647 6.16 79.36 6004.07 8039.99 12652.60
flying witch 904 643 3860 3659 11.17 10.51 43099 22243 94274 83625 1.36 89.10 3177.43 4512.98 6976.65
fractale 554 319 960 1439 11.87 7.50 11398 5777 27014 26416 0.36 86.49 5873.00 9855.00 18224.16
fune wo amu 1115 789 3236 2993 12.66 12.26 40960 21690 88702 78438 1.38 80.40 6622.45 7890.63 11149.16
gabriel dropout 1099 826 7242 7691 8.95 8.07 64811 31584 135685 129955 1.97 81.79 3940.10 5384.08 7638.05
gekkan shoujo nozaki-kun 1066 752 4173 4194 12.99 11.67 54221 28293 118960 107660 1.76 81.32 3072.30 4236.44 6146.25
gochiusa 1186 853 4246 4159 11.97 12.04 50844 27741 103554 99978 1.73 74.60 4894.10 6026.91 9014.77
hyouka 1419 1167 13490 13460 8.85 8.60 119434 65924 251850 236702 4.06 79.33 5121.25 7605.94 10470.62
inu x boku 1119 815 5691 5721 8.75 8.19 49772 26497 104383 96575 1.59 82.99 5550.25 8128.12 13213.75
jinrui 1483 1090 5062 4713 11.92 11.98 60336 33090 129157 118913 2.06 76.14 6237.70 8171.39 10783.85
jojo 1393 1142 14145 14769 8.28 7.66 117159 61076 251507 235637 3.71 79.54 6628.25 8652.69 12442.08
joukamachi no dandelion 1109 834 7446 7293 9.35 8.67 69609 34207 142146 126896 2.09 86.91 3620.60 5161.35 7159.80
kono bijutsubu 969 681 6042 6327 8.61 7.91 52008 25949 108025 100336 1.59 85.86 3246.28 4335.04 6378.02
lucky star 1608 1276 11664 11555 13.38 12.43 156029 79273 339741 311283 4.83 83.25 5128.70 7112.67 10483.63
mahoutsukai no yome 1222 923 9314 9617 8.24 7.60 76756 41461 163354 150031 2.46 87.73 3919.15 5679.91 9724.15
mawaru penguindrum 1332 1009 10363 10877 10.20 8.56 105669 52161 220701 196808 3.15 82.64 4413.99 6546.18 11071.54
mikakunin 942 681 6840 7289 8.67 7.76 59294 31370 125675 118803 1.90 85.11 3698.50 5659.50 9497.35
mob psycho 100 1240 944 6886 7473 9.04 7.92 62249 33280 131039 125729 2.02 81.91 5289.20 6395.90 9045.60
nagi no asukara 1297 960 9225 8316 10.42 10.04 96147 50251 211835 174613 3.06 88.20 3187.85 5299.08 8652.05
ngsrt airantou 1326 1021 14463 16011 9.89 7.81 143071 65751 296831 264531 4.03 89.39 5211.30 6905.48 10198.32
nichijou 1212 933 13498 13719 7.89 7.22 106440 49991 222473 198180 3.10 85.32 4766.70 6580.79 9171.67
no game no life 1237 910 7015 7192 8.95 8.44 62792 32958 130894 124833 2.03 77.91 6547.00 8396.25 10731.50
non non biyori 940 706 6149 6435 8.46 7.65 52003 25933 108793 101528 1.59 86.99 5067.70 5952.78 8822.42
noragami 1179 769 4095 3951 9.49 9.41 38869 22243 83735 77204 1.36 89.62 4727.70 6876.53 10335.70
owari no seraph 1~2 1125 890 10223 10301 8.34 7.79 85309 44984 180143 162224 2.76 82.66 3031.40 4059.30 6357.20
panty and stocking 1236 920 7790 8175 8.71 7.77 67862 31267 141473 130912 1.93 80.34 8380.80 10887.67 13316.81
ping pong 1023 710 4986 4956 8.31 7.86 41437 20829 86286 79877 1.29 81.17 7365.60 9221.06 12225.88
psycho pass 1511 1240 10428 10644 9.40 9.14 98018 51624 208251 199667 3.27 69.10 7431.31 8925.73 11731.61
railgun 1~2 1555 1311 27097 27921 8.21 7.51 222468 114741 475308 428180 6.99 79.15 5353.60 7696.23 11423.80
saki 1264 953 12185 12709 8.48 7.72 103296 54765 220784 197145 3.32 79.70 8363.92 12227.23 17974.68
samflam 1337 1085 11742 12740 8.59 7.37 100867 52146 240007 225486 3.21 75.63 4993.40 6987.65 9964.52
samurai champloo 1322 1012 10894 11001 7.87 7.38 85687 44945 180967 164467 2.74 87.90 4910.30 6885.58 10353.15
sayonara zetsubou sensei 1~2 1678 1317 13135 13455 10.26 9.13 134776 67299 281372 257118 4.19 80.64 6517.40 8918.29 12496.70
scryed 1416 1108 10629 10525 10.82 10.65 115015 61643 238704 225606 3.80 78.09 6066.43 8285.79 11795.25
shiki 1315 1003 6165 8866 13.65 8.51 84126 44718 194106 179941 2.76 87.84 4230.00 6232.50 9576.00
shinsekai yori 1478 1195 11269 11641 9.26 8.61 104381 57416 219180 206791 3.51 80.90 7282.30 11578.16 14825.16
sora no woto 1154 831 3267 3193 11.51 11.17 37596 20443 78016 75306 1.27 65.26 4879.00 6664.25 9202.25
sword art online 1352 1036 10303 10234 8.27 8.26 85248 47928 177161 159707 2.84 76.63 5145.17 6829.05 10157.85
tamako market 1182 835 5187 4789 10.66 10.58 55273 27676 112174 99236 1.70 85.56 5262.90 6254.85 10013.04
toradora 1326 1028 15392 16280 8.41 7.48 129374 65273 278068 258639 4.02 85.23 4636.14 6177.10 9959.02
trigun 1298 1019 12722 12879 8.39 7.65 106713 53235 223933 201750 3.21 85.57 4820.50 6760.75 9795.50
twintails 1103 794 5754 5910 9.72 9.16 55914 27898 112637 108944 1.69 77.96 4793.77 6077.15 8647.10
uchouten kazoku 1152 871 6248 6305 9.07 8.56 56662 31314 118300 110614 1.91 86.05 6183.60 8596.19 12169.80
violet evergarden 1078 810 5926 5758 8.38 8.04 49632 25780 105498 95119 1.58 78.45 4716.05 6113.58 9889.52
youjo senki 1330 1023 5352 5625 9.25 8.40 49531 26785 103556 98936 1.76 65.42 10532.60 13179.90 16504.56
zankyou no terror 1095 782 2979 2636 12.42 11.96 37006 18453 80014 68142 1.16 72.58 6658.17 8697.15 12689.10

Dumper used for .srt files:

   1 #!python
   2 
   3 import sys
   4 import re
   5 
   6 
   7 def print_safe(string, end="\n"):
   8     sys.stdout.buffer.write((str(string)+end).encode("utf-8"))
   9 
  10 nullify = [
  11 "[テレビ]",
  12 "[スピーカ]",
  13 r"\n",
  14 r"\N",
  15 "\r",
  16 ]
  17 
  18 for arg in sys.argv[1:]:
  19     with open(arg, "r", encoding="utf-8-sig") as f:
  20         groups = f.read().split("\n\n")
  21         
  22         last_group = ""
  23         
  24         for i in range(len(groups)):
  25             groups[i] = groups[i].split("\n")[2:]
  26             
  27             if "\n".join(groups[i]) == last_group:
  28                 continue
  29             last_group = "\n".join(groups[i]) 
  30             
  31             did_print = False
  32             for j in range(len(groups[i])):
  33                 line = groups[i][j]
  34                 line = re.sub("([^)]*)","",line)
  35                 line = re.sub(r"\([^\)]*\)","",line)
  36                 line = line.replace("","«")
  37                 line = line.replace("","»")
  38                 for null in nullify:
  39                     line = line.replace(null,"")
  40                 line = line.strip()
  41                 if line != "":
  42                     #print_safe(line)
  43                     did_print = True
  44             if did_print:
  45                 #print_safe("")
  46                 pass
  47         #print_safe("")
  48         print_safe(arg)

Dumper used for .ass files:

   1 #!python
   2 
   3 import sys
   4 import re
   5 
   6 def print_safe(string, end="\n"):
   7     sys.stdout.buffer.write((str(string)+end).encode("utf-8"))
   8 
   9 def parsecsv(string):
  10     fields = []
  11     insomething = False
  12 
  13 nullify = [
  14 "[テレビ]",
  15 "[スピーカ]",
  16 r"\n",
  17 r"\N",
  18 ]
  19 
  20 for arg in sys.argv[1:]:
  21     with open(arg, "r", encoding="utf-8") as f:
  22         events = False
  23         last_group = ""
  24         for line in f:
  25             line = line.strip("\n")
  26             if events:
  27                 if line.startswith("Dialogue:"):
  28                     line = line.replace("Dialogue:","",1)
  29                     
  30                     # do not use the CSV parser for this
  31                     fields = line.split(",",9)
  32                     
  33                     if "人类_声明" in fields[:-1]:
  34                         continue
  35                     if "标题" in fields[:-1]:
  36                         continue
  37                     if "staff" in fields[:-1]:
  38                         continue
  39                     if "Opening" in fields[:-1]:
  40                         continue
  41                     if "Ending" in fields[:-1]:
  42                         continue
  43                     
  44                     
  45                     line = fields[-1]
  46                     basic_line = line
  47                     
  48                     # it contains drawing instructions, which we need a parser to correctly isolate and remove
  49                     # line is probably just pure drawing instructions so get rid of it
  50                     if r"\p" in line:
  51                         continue
  52                     
  53                     line = re.sub(r"\{[^\}]*\}","",line)
  54                     line = re.sub("([^)]*)","",line)
  55                     line = re.sub(r"\([^\)]*\)","",line)
  56                     line = line.strip()
  57                     line = line.replace("","«")
  58                     line = line.replace("","»")
  59                     for null in nullify:
  60                         line = line.replace(null,"")
  61                     # probably per-character karaoke or something
  62                     if len(line) <= 1 and "pos" in basic_line:
  63                         continue
  64                     if line != "":
  65                         if last_group == line:
  66                             continue
  67                         last_group = line
  68                         print_safe(line)
  69             
  70             if line == "[Events]":
  71                 events = True

Anime (last edited 2019-07-13 00:12:22 by weh)