EEG Dataset for Workshop on Computational Neurolinguistics

NAACL HLT 2010 Workshops

To encourage submissions from researchers who do not have access to brain imaging equipment, the organisers of the Workshop on Computational Neurolinguistics are distributing neural recordings from semantic processing experiments, together with corresponding corpus models. This page describes the Trento EEG dataset, recorded from speakers of Italian during an image naming task. It was originally reported in the following paper, presented at EMNLP 2009:
  • Brian Murphy, Marco Baroni, Massimo Poesio, 2009: EEG Responds to Conceptual Stimuli and Corpus Semantics. Proceedings of ACL/EMNLP 2009, Singapore. [Paper]
The fMRI dataset from the Carnegie Mellon group can be accessed via the workshop main page.

Data Collection

EEG data was gathered from native speakers of Italian during a simple behavioural experiment at
the CIMeC/DiSCoF laboratories at Trento University. Seven participants (five male and two female;
age range 25-33; all with college education) performed a silent naming task. Each of them was presented on screen with a series of photographs of tools and land mammals, for which they had to think of the most appropriate name. They were not explicitly asked to group the entities into superordinate categories, or to concentrate on their semantic properties, but completing the task involved resolving each picture to its corresponding concept. Images remained on screen until a keyboard response was received from the participant to indicate a suitable label had been found, and presentations were interleaved with three second rest periods.

Thirty stimuli in each of the two classes were each presented six times, in random order, to give a total of 360 image presentations in the session. Response rates were over 95%, and a post-session questionnaire determined that participants agreed on image labels in approximately 90% of cases. English terms for the concepts used are listed below (the Italian terms are supplied with the corpus models below - see the concept list file).

Mammals: anteater, armadillo, badger, beaver, bison, boar, camel, chamois, chimpanzee, deer, elephant, fox, giraffe, gorilla, hare, hedgehog, hippopotamus, ibex, kangaroo, koala, llama, mole, monkey, mouse, otter, panda, rhinoceros, skunk, squirrel, zebra

Tools: Allen key, axe, chainsaw, craft knife, crowbar, file, garden fork, garden trowel, hacksaw,
hammer, mallet, nail, paint brush, paint roller, pen knife, pick axe, plaster trowel, pliers, plunger, pneumatic drill, power drill, rake, saw, scissors, scraper, screw, screwdriver, sickle, spanner, tape measure

Corpus Models

We supply the three highest performing corpus models described in the EMNLP paper:
  • yahoo-mitchell: an Italian-language replication of the model used in the Mitchell 2008 paper. 25 verbs that describe embodied interaction were manually selected, and their cooccurrence with our target concepts within 5 words left or right were estimated with web document counts. The Yahoo API was used to restrict the search to Italian language pages.
  • repubblica-window-svd: an Italian language model based on a large tagged and lemmatized corpus of 400 million tokens newspaper text: the la Repubblica/SSLMIT corpus. Cooccurance statistics were collected between the most frequent 20,000 nouns and 5,000 verbs in the corpus, where the noun and verb appeared in the same sentence, without an intervening nominal. The resulting matrix was reduced with Singular Value Decomposition, and the top 25 dimensions selected. Three multi-word concepts are not included, as reasonable single-word lexicalisations could not be found: piedi-di-porco, coltellino svizzero, martello pneumatico.
  • repubblica-window-topfeat: this uses the same noun/verb matrix as the repubblica-window-svd model, but selects 25 verb-features that balance high association with individual noun concepts, and generalization across the full set of concepts.
All the models are stored as tab-delimited text files, with target concepts in rows and corpus cooccurance features in columns. The first column gives the concept name. Due to differences in pre-processing and issues of data-sparseness, the concept name lists differ slightly between the Yahoo and Repubblica derived models. The correspondance between English terms for the concepts and the two Italian variants are given in the concept list file.

EEG Data

We are supplying seven data sets of pre-extracted signal power measures, one for each of the seven experimental participants. The terms of our ethical approval prevent us from allowing uncontrolled distribution of the datasets, so they can be obtained by emailing Brian Murphy to arrange a direct download (after undertaking not to use the data in a manner that is not consistent with the informed consent that participants gave). The datasets are supplied as matlab .mat data files, that can be opened directly with Matlab and Octave, or can be imported into a range of programming platforms.

The EEG signals were recorded from 64 scalp locations based on the 10-20 standard montage. Preprocessing involved applying a band-pass filter at 1-50Hz and down-sampling to 120Hz sampling rate. Independent components related to eye-movement artefacts were manually identified and removed. All signal channels were z-score normalised, and a Laplacian sharpening was applied.

The features were extracted for each participant session, and are metrics of signal power at a particular scalp location, in a particular frequency band, and at a particular time latency relative to the presentation of each image stimulus. For each stimulus presentation, 14,400 signal power features are extracted: 64 electrode channels by 15 frequency bands (of width 3.3Hz, between 1 and 50Hz) by 15 time intervals (of length 67ms, in the first second after image presentation).

The data from each participant is stored in a matlab file of size ~23 MB. Each dataset has the following structure:

>> data = open(filename

data =

                  freqResolution: 15
                  timeResolution: 15
                      freqLimits: [0 50]
                      timeLimits: [0 120]
                     numChannels: 64
                channelLocations: [1x64 struct]
            epochTriggersSamples: [360x2 double]
    selectedEpochOrderToTriggers: {1x360 cell}
                      epochPower: [4-D double]

>> size(data.epochPower)

ans =

   360    64    15    15

>> data.epochPower(1,2,3,4)

ans =


The EEG power features are stored in the epochPower field, as a trial x channel x frequency x time matrix (terminological note: a single experimental presentation of a stimulus is termed a trial, and the corresponding signal an epoch). For example, the signal power estimate stored at data.epochPower(1,2,3,4) is that for the first trial epoch of the experiment, recorded on the second channel, in the third frequency band (6.7 to 9Hz) and the fourth time interval (200 to 267ms). Since the signals have been z-normed, the power estimates do not share a unit scale. [Note that the first three participant sessions used an expanded set of 87 stimuli (total 522 trials), which was whittled down to the 60 that were most reliable (in behavioural terms: ease and uniformity of naming). These extra trials can be ignored.]

The presentation order of image stimuli was randomised for each experimental session, and is stored in code form in the selectedEpochOrderToTriggers string cell array. The meaning of these 'trigger' codes is given in the concept list file (updated - an ealier version of this file was incomplete). So the codes shown below indicate that in the experiment in question, the first three image stimuli to be presented were screw, axe, and monkey. Trigger codes in the 1-50 range correspond to animals, and those in the 65-115 range to tools.

>> data.selectedEpochOrderToTriggers(1:3)

ans =

    'S 98'    'S 67'    'S 29'

Though the identities and scalp locations of the channels are usually not needed for machine learning purposes, they are supplied here in data.channelLocations using data structures adopted from the EEGLAB package. For example, the first channel recorded activity at scalp location 'Fpz' (front-parietal, central), located at the euclidean coordinates (85,0,-2)mm, relative to the centre of the head:

>> data.channelLocations(1)

ans =

        labels: 'Fpz'
             X: 84.9814
             Y: 0
             Z: -1.7801


Brian Murphy, Language, Interaction and Computation Lab, Centre for Mind/Brain Studies, University of Trento
Centre for Mind/Brain Studies Logo