LAMBADA dataset

This webpage contains the dataset described in:

D. Paperno, G. Kruszewski, A. Lazaridou, Q. Pham, R. Bernardi, S. Pezzelle, M. Baroni, G. Boleda and R. Fernández. The LAMBADA dataset: Word prediction requiring a broad discourse context. Proceedings of ACL 2016 (54th Annual Meeting of the Association for Computational Linguistics), East Stroudsburg PA: ACL, pages 1525-1534.

The LAMBADA dataset (LAnguage Modeling Broadened to Account for Discourse Aspects) is a collection of narrative passages (extracted from the BookCorpus described in Zhu, Kiros et al. 2015) sharing the characteristic that human subjects are able to guess their last word if they are exposed to a long passage, but not if they only see the last sentence preceding the target word.

For example, this is a sample data point in the dataset:

  1. Context: "Yes, I thought I was going to lose the baby." "I was scared too," he stated, sincerity flooding his eyes. "You were?" "Yes, of course. Why do you even ask?" "This baby wasn't exactly planned for."
  2. Target sentence: "Do you honestly think that I would want you to have a ________?"
  3. Target word: miscarriage

The LAMBADA task consists in predicting the target word given the whole passage (i.e., the context plus the target sentence). The official figure of merit for LAMBADA is accuracy on the test set, both for LAMBADA passages and for the control set. Please report both measures, as models that are specifically tuned to the LAMBADA passages but are not good at predicting words in other contexts are not of general interest.

Besides the information in the paper cited above, further technical details on the data collection can be found in the Supplementary Material. In addition to the original LAMBADA dataset, we are making available data that was rejected during our data collection process. In particular, we release cases that participants could guess by considering the last sentence only.

To download the dataset (rejected data included), please register via the form below and get ready to dance!*

Credits: If you use the LAMBADA dataset in published work, as well as the rejected data, please cite the article above by Paperno et al. (2016) as well as the BookCorpus article by Zhu, Kiros et al. (2015).

Sharing your results: If you use the dataset to tackle the LAMBADA task, we would be grateful if you could share the results of your model with us (although you don't have to). We plan to conduct a qualitative analysis to better understand the strengths and weaknesses of current computational models against the task. Therefore, we would like to ask you to please send us the word(s) that your model predicted for each data point, rather than quantitative results only. However, we are also happy if you only share your quantitative (accuracy) scores with us. If you agree, we will post them on this website, to keep track of the LAMBADA state of the art. If you send us your results, you agree to let us discuss them in published work, of course giving you proper credit.

If you want to share the results with us, just send them to the LAMBADA team. Please send us one text file with your word prediction(s) for each data point in a separate line (one data point per line). Please include with your results a brief description of your model or a link to a paper/report where we can find information that allows us to better understand your approach to the task.

Thank you in advance!

*NOTE: the vocabulary file included in the current version of the tarball is slightly different from the one which was used in the experiments presented in the paper. The original vocabulary had a minor technical issue, which was fixed in the current version. If you downloaded the data before January, 8, 2018, you might want to get just the current (fixed) version of the vocabulary instead of downloading the whole tarball again. If you downloaded the tarball after Jan 8, you can still find here the original version of the vocabulary.


This project has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie grant agreement No 655577 (LOVe); ERC 2011 Starting Independent Research Grant n.~283554 (COMPOSES); NWO VIDI grant n.~276-89-008 (Asymmetry in Conversation). We gratefully acknowledge the support of NVIDIA Corporation with the donation of the GPUs used in our research.