Skip to Main content Skip to Navigation

Neural methods for spoken dialogue understanding

Abstract : Conversational AI has received a growing interest in recent years from both the research community and the industry. Products have started to emerge (e.g. Amazon's Alexa, Google's Home, Apple's Siri) but performances of such systems are still far from human-likeness communication. As an example, conversation with the aforementioned systems is often limited to basic question-response interactions. Among all the reasons why people communicate, the exchange of information and the strengthening of social bound appeared to be the main ones. In dialogue research, the two aforementioned problems are well known and addressed using dialogue act classification and emotion/sentiment recognition. Those problems are made even more challenging as they involve spoken dialogues in contrast to written text. A spoken conversation is a complex and collective activity that has a specific dynamic and structure. Thus, there is a need to adapt both natural language processing and natural language understanding techniques which have been tailored for written texts as it does not share the same characteristics. This thesis focuses on methods for spoken dialogue understanding and specifically tackles the problem of spoken dialogues classification with a particular focus on dialogue act and emotion/sentiment labels. Our contributions can be divided into two parts: in the first part, we address the problem of automatically labelling English spoken dialogues. In this part, we start by formulating this problem as a translation problem which leads us to propose a seq2seq model for dialogue act classification. Then, our second contribution focuses on a scenario relying on small annotated datasets and involves both pre-training a hierarchical transformer encoder and proposing a new benchmark for evaluation. This first part addresses the problem of spoken language classification in monolingual (i.e. English) and monomodal (i.e. text) settings. However, spoken dialogue involves phenomena such as code-switching (when a speaker switch languages within a conversation) and relies on multiple channels to communicate (e.g.} audio or visual).Hence, the second part is dedicated to two extensions of the previous contributions in two settings: multilingual and multimodal. We first address the problem of dialogue act classification when multiple languages are involved and thus, we extend the two previous contributions to a multilingual scenario. In our last contribution, we explore a multimodal scenario and focus on the representation and fusion of modalities in the scope of emotion prediction.
Complete list of metadata
Contributor : ABES STAR :  Contact
Submitted on : Tuesday, May 24, 2022 - 5:42:22 PM
Last modification on : Wednesday, May 25, 2022 - 3:24:01 PM


Version validated by the jury (STAR)


  • HAL Id : tel-03677637, version 1



Emile Chapuis. Neural methods for spoken dialogue understanding. Artificial Intelligence [cs.AI]. Institut Polytechnique de Paris, 2021. English. ⟨NNT : 2021IPPAT045⟩. ⟨tel-03677637⟩



Record views


Files downloads