Sunday 1 March 2015

Training Google 1T web corpus with IRSTLM

Thanks to http://www44.atwiki.jp/keisks/pages/50.html , here is the way to train enormous LM with Google 1T web corpus using IRSTLM:
build-sublm.pl --size 3 --ngrams "gunzip -c 3gms/*.gz" --sublm LM.000 --witten-bell
merge-sublm.pl --size 3 --sublm LM -lm g_3grams_LM.gz
compile-lm g_3grams_LM.gz g_3grams_LM.blm
(if you get the error: "lt-compile-lm: lmtable.h:247: virtual double lmtable::setlogOOVpenalty(int): Assertion `dub > dict->size()' failed.")
compile-lm -dub=100000000 g_3grams_LM g_3grams_LM.blm (make the -dub option bigger)

I will validate it soon.

No comments:

Post a Comment