Semantic Visual Navigation by
Watching Youtube Videos

Matthew Chang
Arjun Gupta
Saurabh Gupta

Semantic cues and statistical regularities in real-world environment layouts can improve efficiency for navigation in novel environments. This paper learns and leverages such semantic cues for navigating to objects of interest in novel environments, by simply watching YouTube videos. This is challenging because YouTube videos don't come with labels for actions or goals, and may not even showcase optimal behavior. Our proposed method tackles these challenges through the use of Q-learning on pseudo-labeled transition quadruples (image, action, next image, reward). Our experiments in visually realistic simulations demonstrate that such off-policy Q-learning from passive data is able to learn meaningful semantic cues for navigation. These cues, when used in a hierarchical navigation policy, lead to improved efficiency for goal reaching, and are able to improve upon end-to-end RL based methods by 66%, while at the same time using 250 times fewer interaction samples. Code, dataset, and models will be made available.

Value Learning from Videos

Egocentric videos tours of indoor spaces are grounded in actions (by labeling via an inverse model), and labeled with goals (using an object detector). This prepares them for Q-learning, which can extract out optimal Q-functions for reaching goals purely by watching egocentric videos.


Matthew Chang, Arjun Gupta, Saurabh Gupta.
Semantic Visual Navigation by Watching Youtube Videos.
In arXiv 2020.

  title={Semantic Visual Navigation by Watching Youtube Videos},
  author={Matthew Chang and Arjun Gupta and Saurabh Gupta},


Coming soon


We thank Sanjeev Venkatesan for help with data collection. We also thank Rishabh Goyal, Ashish Kumar, and Tanmay Gupta for feedback on the paper, and Shubham Tulsiani for feedback on the video. Website template from here, here, and here.