Index
Computational Linguistics, Information Retrieval and Such
- The WaCky Corpora: very large corpora from the Web
- The Wikipedia dumps
- Wikicorpus project: linguistically processed English, Spanish, Catalan Wikipedia dumps
- The ClueWeb09 Dataset: 1 billion web pages in ten languages
- Google Book Ngram Dataset
- Google Books Syntactic Ngram Dataset
- The UMBC webBase corpus: over 3 billion English words from a 2007 Web crawl
- Project Gutenberg
- Serge Sharoff's Internet Corpora
- WEBBIT, Italian Web Corpus
- Osnabrück's Interface to the Google Web 1T 5-Gram Database
- Microsoft BING search API
- WordNet
- ConceptNet
- ImageNet
- The IMS Open Corpus Workbench
- The TreeTagger
- C&C Tools for tagging, parsing, etc.
- MaltParser
- DM data and tools (measuring semantic similarity)
- WikiNet, a multi-language ontology from the Wikipedia
- Banca dati dell'italiano parlato
- La Repubblica Corpus
- Corpus del Español
- VIEW: Variation In English Words and Phrases
- PIE: Phrases In English
- Kevin's Word List Page
- Morph-it! Italian morphological lexicon
- CoLFIS: Frequency Lexicon of Written Italian
- MINIPAR
- The Tanl suite
- UIMA: The Apache Unstructured Information Management Architecture, including various NLP tools
- Natural Language Toolkit
- TextSTAT: Free Multi-Platform Concordancer
- FreeLing
- SenseClusters
- The UCS toolkit (for cooccurrence data analysis)
- HiDEX, the High Dimensional Explorer, to build cooccurrence vectors from very large corpora (with precompiled vectors and links to freely available corpora)
- S-Space package to build semantic spaces
- Infomap NLP Software
- Online interface to query English and Italian semantic models built with Infomap
- Semantic Vectors Package
- DISCO, to compute semantic similarity beetween words
- Patrick Pantel's Lexical Semantics Demos
- Divisi, a toolkit to work with semantic networks
- Luminoso, a visualizer for semantic spaces
- ChaSen: Japanese Tokenization and Morphosyntactic Analysis
- Porter Stemming Algorithm
- TextCat Language Guesser
- SFST: Stuttgart Finite State Transducer Tools
- AnalogySpace
- FrameNet
- LSA @ CU Boulder
- LINGUA:
the Language Independent Neighbourhood Generator of the University
of Alberta
- Gensim, Python Framework for Vector Space Modeling
- Johnathan Chang's R LDA
- Mallet's Topic Modeling Library
- Apache's Mahout Project
- The Bow Toolkit
- David Blei's LDA Implementation
- GibbsLDA++ (the Java variant)
- zipfR: a toolkit for lexical statistics in R
- Speech and Language Processing book info
- Foundations of
Statistical Natural Language Processing book info
- Introduction
to Information Retrieval book
- ACL Anthology
- Corpora List
- Heritrix: The Internet Archive's Web Crawler
- BootCaT Toolkit
- Wikipedia Extractor, to clean up Wikipedia dumps
- Lucene, open source search software
- OpenCV, an open Computer Vision package
Back to the Index
Probability, Statistics, Machine Learning, Math, etc.
Back to the Index
Computers
Back to the Index
Back to Marco's Page