Thursday 5 March 2015

Vowpal Wabbit

Intro: The Vowpal Wabbit (VW) project is a fast out-of-core learning system sponsored by Microsoft Research and (previously) Yahoo! Research. Support is available through the mailing list.

There are two ways to have a fast learning algorithm: (a) start with a slow algorithm and speed it up, or (b) build an intrinsically fast learning algorithm. This project is about approach (b), and it's reached a state where it may be useful to others as a platform for research and experimentation.

Sunday 1 March 2015

Training Google 1T web corpus with IRSTLM

Thanks to http://www44.atwiki.jp/keisks/pages/50.html , here is the way to train enormous LM with Google 1T web corpus using IRSTLM:
build-sublm.pl --size 3 --ngrams "gunzip -c 3gms/*.gz" --sublm LM.000 --witten-bell
merge-sublm.pl --size 3 --sublm LM -lm g_3grams_LM.gz
compile-lm g_3grams_LM.gz g_3grams_LM.blm
(if you get the error: "lt-compile-lm: lmtable.h:247: virtual double lmtable::setlogOOVpenalty(int): Assertion `dub > dict->size()' failed.")
compile-lm -dub=100000000 g_3grams_LM g_3grams_LM.blm (make the -dub option bigger)

I will validate it soon.