Humanoidly speaking--learning about the world and language with a humanoid friendly robot

Xavier Hinaut, Johannes Twiefel, Marcelo Borghetti, Luiza Mici, Stefan Wermter
IJCAI Video competition, Buenos Aires, Argentina - Jul 2015.
Associated documents : Hinaut_IJCAI_2015.pdf [43Ko]  
This video shows a friendly human-robot interaction using humanoid Nao robots. The speaker teaches the robot some names of objects using speech. This work shows the successful integration of three different projects mainly using Artificial Neural Networks: (1) object recognition with RGB-D (color and depth) sensor, (2) speech to text using an approach that post-processes Google's speech recognition hypotheses, and (3) syntactic interpretation of sentences. The robot is able to identify surfaces in the environment (tables, floor, walls) and establish a relation between these surfaces and the clusters (objects). Multiple viewpoints are easily obtained from the segmented clusters and used for training a Convolutional Neural Network. The features obtained allow the robot to recognise objects and to generalise to unknown viewpoints and scales. The speech recognition system maps the results from Google to expectable sentences in the given scenario using phonemic matching. The syntactic interpretation of the sentence is done with a Recurrent Neural Network (namely an Echo State Network). It maps each semantic word in a sentence to its thematic role. In the end, all roles form predicates which indicate what should be performed (e.g. learning a new object or performing motor actions). At the start, the robot does not know any objects. During the learning of new objects, increasingly complex sentences are used to describe the position of new objects. Motor commands (e.g. pointing) are also provided in order to check the knowledge of the robot. It can be noted that the human user produces natural complex sentences, and thus any human could interact with the robot, not only robot programmers. Furthermore, complex sentences containing multiple commands can be correctly interpreted as a temporal action sequence (e.g. "Before doing 'B' do 'A'") without adding any complementary mechanism.


  author       = "Hinaut, Xavier and Twiefel, Johannes and Borghetti, Marcelo and Mici, Luiza and Wermter, Stefan",
  title        = "Humanoidly speaking--learning about the world and language with a humanoid friendly robot",
  booktitle    = "IJCAI Video competition, Buenos Aires, Argentina",
  month        = "Jul",
  year         = "2015",
  address      = "Buenos Aires, AR",
  url          = ""

» Xavier Hinaut
» Johannes Twiefel
» Marcelo Borghetti
» Luiza Mici
» Stefan Wermter