Tools and Resources
- We constructed very large Web-derived, POS-tagged and lemmatized
corpora of English (with dependency parsing), German, Italian and
French: to download them, visit
the WaCky
project;
- The DISSECT toolkit to construct and compose
distributional semantic representations (for other tools and data sets
produced by the COMPOSES project please visit the project
page);
- BLIND (BLind Italian Norming Data), semantic norms
collected from congenitally blind and highly comparable sighted
subjects;
- BLESS
(Baroni-Lenci Evaluation of Semantic Similarity), a large data set to
evaluate computational models of semantic similarity;
- DM: trained model (weighted word-link-word tuples) and utility scripts;
- Semantic norms for German and Italian: available in
this archive
and documented in
this article;
- zipfR: a
toolkit for lexical statistics in R;
- BootCaT: a toolkit for bootstrapping specialized
language corpora and terms from the web;
- Morph-it!, a free Italian morphological
lexicon;
- Access
to the La Repubblica corpus, a large corpus of Italian newspaper
text;
- Web as Corpus post-processing tools, for
boilerplate stripping and near-duplicate identification;
- The TreeTagger page contains a version of this popular
tagger/lemmatizer trained on our Italian resources;
- Trained Italian models for the taggers in
the ACOPOST
toolkit;
- Knorpora: the Knoppix Linux live CD remastered with
tools and resources for corpus-based computational linguistics
students;
- English token and document frequency lists from the
LOB and Brown corpora;
- regexp_tokenizer: a simple tokenizer based on
regular expressions.
Back to Marco's
page