Language and Vision (LaVi)
Language and Vision are two fundamental modalities through which
human beings acquire knowledge about the world. We see and speak about
things and events around us, and by doing so, we learn properties and
relations about objects. These two modalities are quite interdipendent
and we constantly mix information we acquire through them. However,
computational models of language and vision have been developing
separately and the two research comunities have for a long time been
unaware of each other's work. Interestingly, through these parallel
research lines, they have developed highly compatible
representations of words and images, respectively.
The importance of developing computational models of language and
vision together has been highligted by philosophers and cognitive
scientists since the birth of the Aritificial Intelligence
paradigm. Only recently, however, the challenge has been empirically
taken up by computational linguists and computer vision
In the last two decades, the availability of large amounts of text
on the web has led to tremendous improvements in NLP
research. Sophisticated textual search engines are now well
consolidated and part of everybody's daily life. Images are the
natural next challenge of the digital society. The combination of
language and vision is the winning horse for this new era.
The UniTN researchers are at the fronting edges of this new
challenge. Driven by theoretical questions, we look at applications as
the test bed for our models. The focus so far has been on the
investigation of multimodal models combining linguistic and visual
vector representations; cross-modal mapping from visual to language
space; enhancement of visual recognizers through language models. The
recent results on LaVI at UniTN have profitted on the close
collaboartion with the team of the ERC
and of MHUG Research Group.
We are part of the Cost Action
The European Network on Integrating Vision and Language
News: Call for one PhD position.
If you are interested in working with us look at the CIMeC PhD call
- Submission deadline: 16th of May 2017 at 4 pm Italian time.
- Possible Topics in our research area: Interactive Visual Question Answering, Pragmatics
in Vision, Numerosity in Vision, Compositionality in images;
Searching for images via natural language; Language, vision and
reasoning (the topic will be defined based on the selected PhD candidate)
- Preliminary expression of interest should be sent to Raffaella Bernardi
attaching a CV in pdf or txt format, or a link to an online
CV. Please, specify: PhD call 2017 in the subject
- Raffaella Bernardi
(Language-vision area coordinator, senior researcher)
- Marco Baroni
(senior researcher, on leave -- at Facebook)
Herbelot (senior researcher)
- Sandro Pezzelle (PhD student)
- Ravi Shekhar (PhD student)
- Ionut-Teodor Sorodoc, research assistant
- Yauhen Klimovich (EM LCT student)
Boleda (senior researcher)
- Angeliki Lazaridou (former PhD student)
- Elia Bruni (former PhD)
- Le Diu Thu (former PhD)
- Giovanni Cassani (former MSc student)
- Dat Tien Nguyen (EM LCT student)
- Laura Bostan (EM LCT student)
- Ionut Teodor Sorodoc, Sandro Pezzelle, Mariella Dimiccoli, Aurelie
Herbelot and Raffaella Bernardi Pay attention to those sets! Learning quantification from
- Ravi Shanker, Sandro Pezzelle, Yauhen Klimovich, Aurélie
Herbelot, Moin Nabi, Enver Sangineto and Raffaella Bernardi FOIL
it! Find One mismatch between Image and Language caption ACL 2017
- Sandro Pezzelle, Marco Marelli and Raffaella Bernardi Be Precise or Fuzzy: Learning the Meaning of Cardinals and
Quantifiers from Vision. Short paper at EACL 2017
- Angeliki Lazaridou, Marco Marelli and Marco Baroni Multimodal
word meaning induction from minimal exposure to natural
text. Cognitive Science, to appear.
- Sorodoc, I., Lazaridou, A., Boleda, G., Herbelot, A., Pezzelle,
S. and Bernardi, R. 2016. "Look, some green circles!": Learning to
quantify from images. In proceedings of the 5th Workshop on Vision
and Language (collocated with ACL2016), Berlin, Germany.
- Sandro Pezzelle, Ravi Shekhar and Raffaella Bernardi Building
a bagpipe with a bag and a pipe: Exploring Conceptual Combination in
Vision In Proceedings of the 5th Workshop on Vision and
Language, pages 60–64, Berlin, Germany, August 12 2016. Association
for Computational Linguistics.
- Angeliki Lazaridou, Nghia The Pham and Marco Baroni Towards Multi-Agent Communication-Based Language Learning
- Angeliki Lazaridou, Nghia The Pham and Marco Baroni The red one!: On learning to refer to things based on their
discriminative properties. ACL 2016 Short, Oral
- Boleda, G., S. Padó and M. Baroni. 2016. "Show me the cup":
Reference with with continuous representations.
arXiv e-print 1606.08777.
- Angeliki Lazaridou, Grzegorz Chrupala, Raquel Fernandez and
Marco Baroni Multimodal
semantic learning from child-directed input NAACL 2016
- Marco Baroni (Submitted) Enhancing Distributional Semantics with Visual.
- Angeliki Lazaridou, Georgiana Dinu, Adam Liska and Marco
Baroni, (TACL 2015) From visual
attributes to adjectives through decompositional distributional
- Angeliki Lazaridou, Nghia The Pham and Marco Baroni (NAACL 2015) Combining Language and Vision
with a Multimodal Skip-gram Model
- Dieu-Thu Le, Jasper Uijlings and Raffaella Bernardi (Submitted). Extracting
knowledge from text corpora to aid visual recognition in images.
- E. Bruni, M. Baroni, G.B. Tran Multimodal Distributional Semantics. In Journal of Artificial Intelligence Research, vol. 49, pp. 1-17, 2014.
- Angeliki Lazaridou, Elia Bruni and Marco Baroni. (2014) Is
this a wampimuk? Cross-modal mapping between distributional semantics
and the visual world. Proceedings of ACL 2014 (52nd Annual Meeting
of the Association for Computational Linguistics), East Stroudsburg
PA: ACL, 1403-1414.
- Dat T. Nguyen, Angeliki Lazaridou and
Raffaella Bernardi. (2014) Coloring
objects: Adjective-noun visual semantic compositionality.
Proceedings of VL (Third Workshop on Vision and Language), East
Stroudsburg PA: ACL, 112-114.
- Dieu-Thu Le, Jasper Uijlings and Raffaella Bernardi
(2014) TUHOI: Trento
Universal Human Object Interaction Dataset Proceedings of the Third
Workshop on Vision and Language. Dublin City University and the
Association for Computational Linguistics. pp. 17-24, Dublin [Dataset]
- A. Anderson, E. Bruni, U. Bordignon, M. Poesio and
M. Baroni. (2013) Of
words, eyes and brains: Correlating image-based distributional
semantic models with neural representations of concepts.
Proceedings of EMNLP 2013 (Conference on Empirical Methods in Natural
Language Processing), East Stroudsburg PA: ACL, 1960-1970.
- Dieu-Thu Le, Jasper Uijlings and Raffaella Bernardi
language models for visual recognition EMNLP 2013
- Dieu-Thu Le, Raffaella Bernardi and Jasper Uijlings (2013) Exploiting language
models to recognize unseen actions, ACM International
Conference on Multimedia Retrieval, Dallas, Texas, USA, 2013.
- E. Bruni, J. Uijlings, M. Baroni and N. Sebe. (2012) Distributional
semantics with eyes: Using image analysis to improve computational
representations of word meaning. Brave New Idea
paper. Proceedings of MM 12 (20th ACM International Conference on
Multimedia), New York NY: ACM, 1219-1228.
- E. Bruni, G. Boleda, M. Baroni and
N. Tran (2012) Distributional
semantics in technicolor. Proceedings of ACL 2012 (50th Annual
Meeting of the Association for Computational Linguistics), East
Stroudsburg PA: ACL, 136-145.
sets from this study.
- E. Bruni, G.B. Tran and
M. Baroni (2011) Distributional semantics from text and
images. Proceedings of the EMNLP 2011 Geometrical Models for
Natural Language Semantics (GEMS). Workshop, East Stroudsburg PA: ACL, 22-32.