#### Previous topic

Welcome to DISSECT!

# DISSECT basics¶

## What DISSECT is for¶

You can use DISSECT to build and explore automated models of word, phrase and sentence meaning based on the principles of distributional semantics. The toolkit focuses in particular on compositional meaning, that is, it provides functions to derive the meaning of phrases and sentences from the meanings of their parts (e.g., derive a meaning representation for black vomit from the representations of black and vomit). However, we hope that DISSECT will also be useful to researchers and practitioners who need models of word meaning (without composition), as it supports various methods to construct semantic spaces, assessing similarity and even evaluating against benchmarks that are independent of the composition infrastructure.

## Main features¶

### Embed in Python code or use command-line tools¶

DISSECT is written in Python. Users with basic familiarity with this language can use DISSECT directly inside their scripts. However, DISSECT also provides many standard functionalities through a set of powerful command-line tools. For example, building different semantic spaces with either positive PMI or logarithm as weighting methods and dimensionality reduction carried out by SVD or NMF is as simple as:

python2.7 build_core_space.py -i input-matrix --input_format sm
--output_format dm -w ppmi,plog -r svd_200,nmf_200
-o outdir

### Semantic space creation from co-occurrence matrices¶

The vectors representing words or other linguistic units in a semantic space ultimately encode values that derive from co-occurrence counts extracted from corpora (or other sources). The pipeline from corpora to semantic spaces can be roughly split into two major steps: pre-processing the corpus to collect the relevant counts, and processing the extracted counts mathematically. The first step will be highly language- and project-dependent: How do you tokenize the corpus? Which elements and linguistic contexts do you count? etc. DISSECT does not support pre-processing or counting, but takes directly a (dense or sparse) matrix of co-occurrence counts as input. In this way, we can focus on the more general mathematical side, where we provide various methods to reweight the counts with association measures, dimensionality reduction (supporting not only singular value decomposition but also the less commonly implemented non-negative matrix factorization), as well as supporting the creation of multiple semantic spaces with a single command-line call.

### Composition functions¶

The core purpose of DISSECT is to make it easy to use a wide range of vector composition functions that have been proposed in the literature, including those of Mitchell and Lapata (2010), Baroni and Zamparelli (2010) (that we call Lexical Function) and Guevara (2010) (that we call Full Additive).

#### Function estimation¶

Generalizing the estimation methods of Baroni and Zamparelli and Guevara, we estimate the parameters of composition functions by approximating corpus-extracted vectors that exemplify their outputs. For example, to optimize an adjective-noun composition function, we might minimize (in a least-squares sense) the distance between the vectors the function produces for, e.g., rotten meat, carnivorous zombie and toxic waste and vectors representing the same phrases that have been directly extracted from the corpus (or obtained in some other way). We have a paper, currently under review, in which we explain this approach in detail: please contact us if you want a copy.

#### Peripheral elements¶

Lexical semantic spaces assume a finite vocabulary of words, each represented by vectors reflecting disjoint distributions in the source corpus. However, compositional spaces contain a potentially infinite number of vectors representing phrases and sentences, and such vectors will not record independent co-occurrences: For example, the vector for slimy maggot will presumably record a subset of the contexts that are also recorded in the vector of maggot. Since both weighting schemes and dimensionality reduction techniques depend on the set of target and context elements in a co-occurrence matrix, these redundancies risk to bias them in unwanted ways: For example, all maggot contexts that are also slimy maggot contexts would be artificially counted twice, distorting the computation of association measures.

We solve this problem by allowing the user to specify a separate co-occurrence matrix for peripheral elements (typically, phrases or sentences), whose values will be weighted and reduced using global statistics that have been collected from core elements only (typically, single words), and that are not affected by the peripheral elements in turn. In this way, you can keep adding new elements to the space without worrying about how they affect the overall characteristics of the space.

## How to DISSECT¶

After downloading and setting up the toolkit, please go through the hands-on tutorial (the required data are downloaded with the tutorial). While you can browse the technical documentation using the index and search functionalities, we hope you will be able to accomplish most of what you want to do with DISSECT by adapting examples from the tutorial.

## What’s next for DISSECT¶

Bug-fixing and incremental improvements are going on all the time (please see below how to contact us if you discover problems or have recommendations).

We have implemented two further composition functions, a weighted multiplicative method and a Full Lexical function inspired by Socher et al. (2012). These will be soon released in the toolkit.

In the not-so-distant future, we would like to add basic vector and matrix visualization options.

A more ambitious long-term plan is to provide an interface that reads simplified output from a syntactic parser and applies the corresponding composition functions to automatically generate a distributional representation of phrases and sentences with flexible structures.

## License, credits, authors and contact information¶

DISSECT is free software. You can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation. When publishing work based on DISSECT, please cite: G. Dinu, N. Pham and M. Baroni. 2013. DISSECT: DIStributional SEmantics Composition Toolkit. Proceedings of the System Demonstrations of ACL 2013 (51st Annual Meeting of the Association for Computational Linguistics), East Stroudsburg PA: ACL.

DISSECT is one of the main deliverables of the COMPOSES project. It is designed, written, documented and maintained by Georgiana Dinu and The Nghia Pham, with some design and documentation help from Marco Baroni.