Paper Co-Authored by IC Faculty Earns Best Paper at EMNLP 2017

A paper co-authored by Assistant Professor Dhruv Batra and Research Scientist Stefan Lee of the School of Interactive Computing will be awarded at this week’s Conference on Empirical Methods in Natural Language Processing (EMNLP 2017), which begins Thursday in Copenhagen, Denmark.

The paper, titled Natural Language Does Not Emerge ‘Naturally’ in Multi-Agent Dialog, earned one of four Best Paper awards (out of 1,500 submissions) for its findings.

It explores the conditions under which human-interpretable languages simply emerge between goal-driven interacting AI agents that invent their own communication protocols. In contrast to many recent works that have shown compositional, human-interpretable languages emerging between agents in multi-agent game settings, this work shows that while most agent-invented languages are effective, achieving near-perfect rewards, they are decidedly not interpretable or compositional.

Batra and Lee, along with collaborators from Carnegie Mellon University, used a Task-and-Tell reference game between two agents as a testbed to come to this conclusion.

Task and Tell is a simple reference game between a questioner and answerer agent set in a simple world. In the game, the answerer is presented with a simple object – a colored shape, for example, with a specific style, such as a circle drawn with a red-dashed line. The questioner is tasked with discovering two of the three attributes of the object. The agents communicate in ungrounded vocabulary, using symbols with no pre-specified meanings. Exchanging such single-symbol utterances over two rounds of dialog, the questioner must predict the requested attributes.

While the language exchanged between the two agents was effective, it was not interpretable.

“As our goal is to explore how natural, human interpretable languages emerge in multi-agent dialogs, we consider these negative results,” Lee said.

In essence, they found that natural language does not, in fact, emerge naturally. Further, they find that restricting the agents’ vocabularies and limiting how they interact is essential for human-interpretable languages to emerge in such a setting. Using just the right set of controls, the two bots invent their own communication protocol and start using certain symbols to ask or answer about certain visual attributes of a given object.

EMNLP is one of the top natural language processing conferences. This year, six papers co-authored by College of Computing faculty and students were accepted to the conference.

Read more on each below.

ABSTRACT: A number of recent works have proposed techniques for end-to-end learning of communication protocols among cooperative multi-agent populations, and have simultaneously found the emergence of grounded human-interpretable language in the protocols developed by the agents, all learned without any human supervision! In this paper, using a Task and Tell reference game between two agents as a testbed, we present a sequence of 'negative' results culminating in a 'positive' one -- showing that while most agent-invented languages are effective (i.e. achieve near-perfect task rewards), they are decidedly not interpretable or compositional. In essence, we find that natural language does not emerge 'naturally', despite the semblance of ease of natural-language-emergence that one may gather from recent literature. We discuss how it is possible to coax the invented languages to become more and more human-like and compositional by increasing restrictions on how two agents may communicate.

ABSTRACT: We introduce ParlAI (pronounced "par-lay"), an open-source software platform for dialog research implemented in Python, available at this http URL Its goal is to provide a unified framework for sharing, training and testing of dialog models, integration of Amazon Mechanical Turk for data collection, human evaluation, and online/reinforcement learning; and a repository of machine learning models for comparing with others' models, and improving upon existing architectures. Over 20 tasks are supported in the first release, including popular datasets such as SQuAD, bAbI tasks, MCTest, WikiQA, QACNN, QADailyMail, CBT, bAbI Dialog, Ubuntu, OpenSubtitles and VQA. Several models are integrated, including neural models such as memory networks, seq2seq and attentive LSTMs.

ABSTRACT: Much of human dialogue occurs in semi-cooperative settings, where agents with different goals attempt to agree on common decisions. Negotiations require complex communication and reasoning skills, but success is easy to measure, making this an interesting task for AI. We gather a large dataset of human-human negotiations on a multi-issue bargaining task, where agents who cannot observe each other's reward functions must reach an agreement (or a deal) via natural language dialogue. For the first time, we show it is possible to train end-to-end models for negotiation, which must learn both linguistic and reasoning skills with no annotated dialogue states. We also introduce dialogue rollouts, in which the model plans ahead by simulating possible complete continuations of the conversation, and find that this technique dramatically improves performance. Our code and dataset are publicly available (this https URL).

ABSTRACT: In this paper, we make a simple observation that questions about images often contain premises - objects and relationships implied by the question - and that reasoning about premises can help Visual Question Answering (VQA) models respond more intelligently to irrelevant or previously unseen questions. When presented with a question that is irrelevant to an image, state-of-the-art VQA models will still answer purely based on learned language biases, resulting in non-sensical or even misleading answers. We note that a visual question is irrelevant to an image if at least one of its premises is false (i.e. not depicted in the image). We leverage this observation to construct a dataset for Question Relevance Prediction and Explanation (QRPE) by searching for false premises. We train novel question relevance detection models and show that models that reason about premises consistently outperform models that do not. We also find that forcing standard VQA models to reason about premises during training can lead to improvements on tasks requiring compositional reasoning.

ABSTRACT: To be able to interact better with humans, it is crucial for machines to understand sound - a primary modality of human perception. Previous works have used sound to learn embeddings for improved generic textual similarity assessment. In this work, we treat sound as a first-class citizen, studying downstream textual tasks which require aural grounding. To this end, we propose sound-word2vec - a new embedding scheme that learns specialized word embeddings grounded in sounds. For example, we learn that two seemingly (semantically) unrelated concepts, like leaves and paper are similar due to the similar rustling sounds they make. Our embeddings prove useful in textual tasks requiring aural reasoning like text-based sound retrieval and discovering foley sound effects (used in movies). Moreover, our embedding space captures interesting dependencies between words and onomatopoeia and outperforms prior work on aurally-relevant word relatedness datasets such as AMEN and ASLex.

ABSTRACT: Word embeddings improve generalization over lexical features by placing each word in a lower-dimensional space, using distributional information obtained from unlabeled data. However, the effectiveness of word embeddings for downstream NLP tasks is limited by out-of-vocabulary (OOV) words, for which embeddings do not exist. In this paper, we present MIMICK, an approach to generating OOV word embeddings compositionally, by learning a function from spellings to distributional embeddings. Unlike prior work, MIMICK does not require re-training on the original word embedding corpus; instead, learning is performed at the type level. Intrinsic and extrinsic evaluations demonstrate the power of this simple approach. On 23 languages, MIMICK improves performance over a word-based baseline for tagging part-of-speech and morphosyntactic attributes. It is competitive with (and complementary to) a supervised character-based model in low-resource settings.


David Mitchell

Communications Officer