Code for the bilingual lexicon induction experiments in:
Georgiana Dinu, Angelikie Lazaridou and Marco Baroni. 2014. Improving zero-shot learning by mitigating the hubness problem
It implements the translation matrix method of:
Tomas Mikolov, Quoc V Le, Ilya Sutskever. 2013. Exploiting similarities among languages for machine translation with translation via 1) nearest neighbour retrieval and 2) corrected retrieval described in Improving zero-shot learning ...
Code + data (tested with Python2.7, requires numpy)
- Download code+data
- Run bash demo.sh (standard NN retrieval) or python train_tm.py -h, python test_tm.py -h for usage.
For example, once a translation matrix is learned and saved in tm.txt running:
- python test_tm.py tm.txt data/OPUS_en_it_europarl_test.txt data/EN.200K.cbow1_wind5_hs0_neg10_size300_smpl1e-05.txt data/IT.200K.cbow1_wind5_hs0_neg10_size300_smpl1e-05.txt translates using NN retrieval.
- python test_tm.py -c 5000 tm.txt data/OPUS_en_it_europarl_test.txt data/EN.200K.cbow1_wind5_hs0_neg10_size300_smpl1e-05.txt data/IT.200K.cbow1_wind5_hs0_neg10_size300_smpl1e-05.txt translates by implementing correction with 5000 randomly chosen additional elements.
- En -> It word pairs from the OPUS collection (En->IT fragment of Europarl-learned dictionary, split in train/test)
Jörg Tiedemann, 2012, Parallel Data, Tools and Interfaces in OPUS. [pdf] In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC'2012)
- 200K English vectors learned with word2vec on BNC and WackyPedia/ukWaC
- 200K Italian word2vec vectors learned on WackyPedia/itWaC
Note 1: Code is highly vectorized and there is memory overhead
Note 2: Among other things, results depend on the test data and on the size of the target language vocabulary (200K and frequency-balanced test data in this case). When using words with similar frequency, results are comparable to those of Exploiting similarities among languages for machine translation. Details on all these things in Improving ....
COMPOSES main page