Semantic Visual Navigation by
Watching Youtube Videos

Matthew Chang
Arjun Gupta
Saurabh Gupta
UIUC
UIUC
UIUC
Appearing in NeurIPS 2020
Paper
Code
Slides (Keynote)
Slides (PDF)
Poster


Semantic cues and statistical regularities in real-world environment layouts can improve efficiency for navigation in novel environments. This paper learns and leverages such semantic cues for navigating to objects of interest in novel environments, by simply watching YouTube videos. This is challenging because YouTube videos don't come with labels for actions or goals, and may not even showcase optimal behavior. Our method tackles these challenges through the use of Q-learning on pseudo-labeled transition quadruples (image, action, next image, reward). We show that such off-policy Q-learning from passive data is able to learn meaningful semantic cues for navigation. These cues, when used in a hierarchical navigation policy, lead to improved efficiency at the ObjectGoal task in visually realistic simulations. We observe a relative improvement of 15-83% over end-to-end RL, behavior cloning, and classical methods, while using minimal direct interaction.


Value Learning from Videos

Egocentric videos tours of indoor spaces are grounded in actions (by labeling via an inverse model), and labeled with goals (using an object detector). This prepares them for Q-learning, which can extract out optimal Q-functions for reaching goals purely by watching egocentric videos.




Paper

 
Matthew Chang, Arjun Gupta, Saurabh Gupta.
Semantic Visual Navigation by Watching Youtube Videos.
In NeurIPS 2020.

@inproceedings{chang2020semantic,
  title={Semantic Visual Navigation by Watching Youtube Videos},
  author={Matthew Chang and Arjun Gupta and Saurabh Gupta},
  booktitle={NeurIPS},
  year={2020}
}


Acknowledgements

We thank Sanjeev Venkatesan for help with data collection. We also thank Rishabh Goyal, Ashish Kumar, and Tanmay Gupta for feedback on the paper. This material is based upon work supported by NSF under Grant No. IIS-2007035, and DARPA Machine Common Sense. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation or DARPA. Website template from here, here, and here.