Tools and Resources
- We constructed very large Web-derived, POS-tagged and lemmatized
corpora of English, German and Italian: to download them, visit
the WaCky
project;
- Strudel: trained model (property-based concept descriptions), training scripts and resources;
- zipfR: a
toolkit for lexical statistics in R;
- BootCaT: a toolkit for bootstrapping specialized
language corpora and terms from the web;
- Morph-it!, a free Italian morphological
lexicon;
- Access
to the La Repubblica corpus, a large corpus of Italian newspaper
text;
- Web as Corpus post-processing tools, for
boilerplate stripping and near-duplicate identification;
- The TreeTagger page contains a version of this popular
tagger/lemmatizer trained on our Italian resources;
- Trained Italian models for the taggers in
the ACOPOST
toolkit;
- Knorpora: the Knoppix Linux live CD remastered with
tools and resources for corpus-based computational linguistics
students;
- English token and document frequency lists from the
LOB and Brown corpora;
- regexp_tokenizer: a simple tokenizer based on
regular expressions.
Back to Marco's
page