This page contains information and resources pertaining to Rovereto Twitter N-Gram Corpus (RTC). RTC is an n-gram dataset based on almost 75 million short, personal, social media posts in English, along with aggregated information on the gender of the authors of the posts and the time of the posting. It is licensed under a Creative Commons License.
Social media is a natural laboratory for linguistic and sociological purposes. In micro-blogging platforms such as Twitter, people share hundreds of millions of short messages everyday, talking about what they think, what they find interesting, what they want to do, etc. These messages, coupled with the metadata about their authors, provide an opportunity to understand a wide variety of phenomena ranging from political polarization to geographic and demographic lexical variation. A publicly available, common dataset of tweets – enriched with the meta-data – would allow researchers tap a rich dataset and carry out replicable experiments. However, one difficulty widely experienced by the researchers is that Twitter's terms of service prohibits the redistribution of raw tweets. Here, I release the English portion of Rovereto Twitter N-Gram Corpus, an n-gram dataset based on 75 million English tweets contributed by more than 11 million users. For each n-gram, frequency statistics are broken down by gender of the author and posting time (hour of the day and day of the week) in user's local time.
Here is a draft of the manuscript with the details of preparation and analysis of the corpus. Please do not cite without permission.
Rovereto Twitter N-Gram Corpus (RTC) is an n-gram dataset enriched with meta-data such as gender and time of posting. The n-gram corpus is based on 75 million English tweets extracted from a larger sample of 240 million tweets collected from the public stream of Twitter, between December 2010 and July 2011. The English tweets were identified by automated language detection techniques and the dataset was filtered by spam and bot removal heuristics.
RTC does not contain full text of tweets; therefore, it is not in conflict with the terms of service of Twitter. Each tweet in the sample is assigned to a slot which represents the gender of its author and its posting time in the user's local time. Each slot is a combination of gender, day of week, and hour of day.
The gender of a user is guessed as male or female based on his/her first name as it is reported in the user profile. The hour of day (from 0 to 23) and day of week (from monday to sunday) labels are based on the user's local timezone and the tweet's posting time. In ambiguous or unknown cases an additional label, unknown, is used in place of each label. In total, there are 1200 slots corresponding to possible combinations of the day of week, hour of day and gender labels.
All tweets assigned to the same slot are aggregated together and the n-gram statistics are computed separetely for each slot. For each n-gram and slot, the total number of mentions (frequencies) and the number of unique users who mention the n-gram in a tweet (user counts) are reported. For gender guessing, I used the name gender statistics released by US Census Bureau and US Social Security Administration. The gender guessing scripts are released to the public at github.com/amacinho/Name-Gender-Guesser.
I did not apply any deep-preprocessing and used a tokenizer which keeps URLs, hashtags (#hashtag), and user mentions (@username) as single tokens. All n-grams that were observed less than 3 times in the entire English corpus were discarded in the final output. The tokenizer can be found at github.com/amacinho/Rovereto-Twitter-Tokenizer.
A summary of the descriptive statistics of the corpus is given below.
In addition, 78% of the English tweets were posted by users who provided a timezone setting in their profiles. This enables to compute the day of the week and hour of the day label for the tweets in local time of the users. Below, we can see number of tweets that were collected in each hour during a week, aggregated over all weeks between December 2010 and June 2011.
Number of hourly tweets in RTC aggregated over all weeks.
We can combine the time and gender information to see the gender differences in Twitter engagement as a function of time. Below, we see the deviation in the ratio of number of tokens written by males and females from the expected baseline.
The ratio of tokens written by females and males, deviating from the overall gender ratio. Each data point is computed by comparing the total number of tokens written by either gender during the corresponding hour and day of the week, aggregated for all tweets collected in the 7-month period.
Two kinds of statistics are reported for the n-grams, frequency and user count. Frequency is the total number of times the n-gram is mentioned in the corpus; user count is the total number of users who mentioned the n-gram at least once. As already explained, each tokenized tweet in the dataset was assigned to a slot which corresponds to a tuple consisting of gender, day of the week, and hour of the day, depending on the gender of the user and the local time of posting. Tweets in the same slots were aggregated and the number of times each n-gram was observed was counted along with the number of unique users who mention the n-gram for a given slot. In order to keep the corpus size at manageable levels, n-grams which were observed less than 3 times in the entire dataset were removed from the dataset.
There are seven days a week, 24 hours a day, two genders, and two statistics (frequency and user count). Adding the unknown tag (which is used when the timezone or gender is not specified), this translates into (7 + 1) * (24 + 1) * (2 + 1) * 2 = 1200 slots. There is a small redundancy in this representation, which I decided to keep for consistency: if the day of the week is not known -- because the corresponding users do not have a timezone specified in their profiles -- also the hour of the day is not known.
For practical purposes, the corpus is released as six separate files, one for each n value of the n-grams (from 1 to 6). Each n-gram is represented on a separate line. Each line is a tab-separated list of columns where the first column is the n-gram (with space between the tokens). The remaining 1200 columns on each line are the frequency and user-count statistics of the n-gram for every combination of gender, day-of-week, and hour-of-day slots.
Note that the n-gram files do not contain a header line. If you want you can download this single-line, tab-separated file which contains the label of the corresponding slot for each column: header line. A truncated example is given below. In this table, each column corresponds to a tab-separated field in the corpus files (with the headers coming from the header file). The slots iterate over frequency, user count, gender, hour, and day in the given order with the "unknown" tags placed at the end of the iteration for each variable.
RTC contains statistics for 1-, 2-, 3-, 4-, 5-, and 6-grams. The statistics are broken down by slots which correspond to combinations of hour and day of week and gender of the author. For each ngram and slot, two frequencies are reported: Number of unique users who used the n-gram at least once in their tweets and number of total occurrences of the n-gram in the corresponding slot.
For an application of another Twitter-based corpus on computational linguistics and AI see: Herdağdelen, A. and Baroni, M. (2011), Stereotypical gender actions can be extracted from web text. Journal of the American Society for Information Science and Technology, 62: 1741–1749.
For details of Rovereto Twitter N-Gram Corpus, download the preprint.
For questions and suggestions about Rovereto Twitter N-Gram corpus, you can contact Amaç Herdağdelen.