Anil Batra

I am a CDT Ph.D. scholar at School of Informatics in University of Edinburgh and being supervised by Prof. Frank Keller and Dr. Laura Sevillia. I am also collabrating with Prof. Marcus Rohrbach. My research interests are centered at the intersection of Language and Vision, with a focus on developing models that can plan, reason, and execute goal-oriented tasks involving multiple complex events through text comprehension and video analysis. Currently, my work involves analyzing long procedural videos to understand and ground the temporal structure of events. This research is directed towards developing efficient models that can accurately capture the sequence and timing of events, ultimately enhancing their ability to perform complex, real-world tasks.

I also enjoy reading works related to Geospatial data, large language models and how to make models more reliable. Previously, I have completed my Master in Computer Science (by research) at IIIT - Hyderabad, under the supervision of Prof. C.V. Jawahar and Facebook mentors Dr. Guan Pang, Dr. Saikat Basu. During Masters, I was part of Center of Visual Information Technology Lab (CVIT) and developed models to detect roads under occlusion in Satellite Imagery.

I worked as Research Engineer at Facebook in Spatial Computing Team. I was designing, training, and evaluating extraction of connected road network with limited set of labels and large scale noisy labels.

anilbatra2185@gmail.comEmail | CV | Google Scholar | Github | LinkedIn

Currently looking for reserch internships (Early 2025) and PostDocs (Winter 2025).

News
  • [July 2024]: Our new work "Efficient Pre-training for Localized Instruction Generation of Videos" is accepted at ECCV 2024.
  • [Sep 2022]: Our new work "Temporal Ordering in the Segmentation of Instructional Videos" is accepted at BMVC 2022.
  • [Dec 2021]: Volunteer for session chair at LXAI workshop, Neurips 2021.
  • [Nov 2021]: Will be serving as CVPR 2022 Reviewer.
  • [Jun 2021]: Served as ICCV 2021 Reviewer.
  • [Sep 2020]: Joined CDT-NLP Ph.D at University of Edinburgh under the supervision of Dr. Laura Sevillia and Prof. Frank Keller.
  • [Jun 2019]: Succesfully defended my Master thesis. Panel - Prof. C.V. Jawahar, Prof. K. Madhava Krishna, Dr. Girish Varma
  • [Jun 2019]: Poster presentation at CVPR 2019, Long Beach (image).
  • [May 2019]: Received travel sponsporship from Facebook - Spatial Computing team to attend CVPR 2019.
  • [Apr 2019]: Presented CVPR - Improving Road Connectivity work at Facebook Spatial Team.
  • [Mar 2019]: Paper accepted at CVPR 2019 on Improved Road Connectivity.
  • [Jan 2019]: Join Facebook - Spatial Team as Research Engineer (Contingent Worker).
  • [Dec 2018]: Submitted my Master thesis Road Topology Extraction from Satellite images by Knowledge Sharing.
  • [Jun 2018]: Paper accepted at BMVC 2018 on Self-Supervised Learning.
Publications
sym

Efficient Pre-training for Localized Instruction Generation of Videos
Anil Batra, Davide Moltisanti, Laura Sevilla-Lara, Marcus Rohrbach, Frank Keller
European Conference on Computer Vision (ECCV) 2024

pdf abstract bibtex

Procedural videos show step-by-step demonstrations of tasks like recipe preparation. Understanding such videos is challenging, involving the precise localization of steps and the generation of textual instructions. Manually annotating steps and writing instructions is costly, which limits the size of current datasets and hinders effective learning. Leveraging large but noisy video-transcript datasets for pre-training can boost performance, but demands significant computational resources. Furthermore, transcripts contain irrelevant content and exhibit style variation compared to instructions written by human annotators. To mitigate both issues, we propose a technique, Sieve-\&-Swap, to automatically curate a smaller dataset: (i)~Sieve filters irrelevant transcripts and (ii)~Swap enhances the quality of the text instruction by automatically replacing the transcripts with human-written instructions from a text-only recipe dataset. The curated dataset, three orders of magnitude smaller than current web-scale datasets, enables efficient training of large-scale models with competitive performance. We complement our Sieve-\&-Swap approach with a Procedure Transformer (ProcX) for end-to-end step localization and instruction generation for procedural videos. When this model is pre-trained on our curated dataset, it achieves state-of-the-art performance in zero-shot and finetuning settings on YouCook2 and Tasty, while using a fraction of the computational resources.

@InProceedings{batraProcX2024,
author = {Batra, Anil and Moltisanti, Davide and Sevilla-Lara, Laura and Rohrbach, Marcus and Keller, Frank},
title = {Efficient Pre-training for Localized Instruction Generation of Videos},
booktitle = {UnderReview},
year = {2024}
}

sym

A Closer Look at Temporal Ordering in the Segmentation of Instructional Videos
Anil Batra, Shreyank N Gowda, Frank Keller, Laura Sevilla-Lara
British Machine Vision Conference (BMVC), 2022

pdf suppl abstract bibtex

Understanding the steps required to perform a task is an important skill for AI systems. Learning these steps from instructional videos involves two subproblems: (i) identifying the temporal boundary of sequentially occurring segments and (ii) summarizing these steps in natural language. We refer to this task as Procedure Segmentation and Summarization (PSS). In this paper, we take a closer look at PSS and propose three fundamental improvements over current methods. The segmentation task is critical, as generating a correct summary requires each step of the procedure to be correctly identified. However, current segmentation metrics often overestimate the segmentation quality because they do not consider the temporal order of segments. In our first contribution, we propose a new segmentation metric that takes into account the order of segments, giving a more reliable measure of the accuracy of a given predicted segmentation. Current PSS methods are typically trained by proposing segments, matching them with the ground truth and computing a loss. However, much like segmentation metrics, existing matching algorithms do not consider the temporal order of the mapping between candidate segments and the ground truth. In our second contribution, we propose a matching algorithm that constrains the temporal order of segment mapping, and is also differentiable. Lastly, we introduce multi-modal feature training for PSS, which further improves segmentation. We evaluate our approach on two instructional video datasets (YouCook2 and Tasty) and observe an improvement over the state-of-the-art of ∼ 7% and ∼ 2.5% for procedure segmentation and summarization, respectively.

@InProceedings{batraBMVC2022pss,
author = {Batra, Anil and Gowda, Shreyank N and Keller, Frank and Sevilla-Lara, Laura},
title = {A Closer Look at Temporal Ordering in the Segmentation of Instructional Videos},
booktitle = {BMVC},
year = {2022}
}

sym

Improved Road Connectivity by Joint Learning of Orientation and Segmentation
Anil Batra*, Suriya Singh*, Guan Pang, Saikat Basu, C.V. Jawahar and Manohar Paluri (* equal contribution)
Computer Vision and Pattern Recognition (CVPR), 2019

pdf suppl poster abstract bibtex code

Road network extraction from satellite images often produce fragmented road segments leading to road maps unfit for real applications. Pixel-wise classification fails to predict topologically correct and connected road masks due to the absence of connectivity supervision and difficulty in enforcing topological constraints. In this paper, we propose a connectivity task called Orientation Learning, motivated by the human behavior of annotating roads by tracing it at a specific orientation. We also develop a stacked multi-branch convolutional module to effectively utilize the mutual information between orientation learning and segmentation tasks. These contributions ensure that the model predicts topologically correct and connected road masks. We also propose Connectivity Refinement approach to further enhance the estimated road networks. The refinement model is pre-trained to connect and refine the corrupted ground-truth masks and later fine-tuned to enhance the predicted road masks. We demonstrate the advantages of our approach on two diverse road extraction datasets SpaceNet and DeepGlobe. Our approach improves over the state-of-the-art techniques by 9% and 7.5% in road topology metric on SpaceNet and DeepGlobe, respectively.

@InProceedings{Batra_2019_CVPR,
author = {Batra, Anil and Singh, Suriya and Pang, Guan and Basu, Saikat and Jawahar, C.V. and Paluri, Manohar},
title = {Improved Road Connectivity by Joint Learning of Orientation and Segmentation},
booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2019}
}

sym

Self-supervised Feature Learning for Semantic Segmentation of Overhead Imagery
Suriya Singh*, Anil Batra*, Guan Pang, Lorenzo Torresani, Saikat Basu, C.V. Jawahar and Manohar Paluri (* equal contribution)
British Machine Vision Confernece (BMVC), 2018

pdf suppl abstract bibtex

Overhead imageries play a crucial role in many applications such as urban planning, crop yield forecasting, mapping, and policy making. Semantic segmentation could enable automatic, efficient, and large-scale understanding of overhead imageries for these applications. However, semantic segmentation of overhead imageries is a challenging task, primarily due to the large domain gap from existing research in ground imageries, unavailability of large-scale dataset with pixel-level annotations, and inherent complexity in the task. Readily available vast amount of unlabeled overhead imageries share more common structures and patterns compared to the ground imageries, therefore, its large-scale analysis could benefit from unsupervised feature learning techniques.
In this work, we study various self-supervised feature learning techniques for semantic segmentation of overhead imageries. We choose image semantic inpainting as a self-supervised task for our experiments due to its proximity to the semantic segmentation task. We (i) show that existing approaches are inefficient for semantic segmentation, (ii) propose architectural changes towards self-supervised learning for semantic segmentation, (iii) propose an adversarial training scheme for self-supervised learning by increasing the pretext task's difficulty gradually and show that it leads to learning better features, and (iv) propose a unified approach for overhead scene parsing, road network extraction, and land cover estimation. Our approach improves over training from scratch by more than 10% and ImageNet pre-trained network by more than 5% mIoU.

@inproceedings{singhBMVC18overhead,
Author = {Singh, Suriya; Batra, Anil; Pang, Guan; Torresani, Lorenzo; Basu, Saikat; Paluri, Manohar; Jawahar, C. V.},
Title = {Self-supervised Feature Learning for Semantic Segmentation of Overhead Imagery},
Booktitle = {BMVC}, Year = {2018}
}

Professional Activities
  • Reviewer: ICCV/ECCV/CVPR/ICLR/AAAI
Selected Awards
  • Awarded funding for 4 years by School of Informatics and UKRI for Ph.D.
  • Facebook Travel Support to attend CVPR 2019
  • GATE - EC qualified with 254 rank in 2009.
  • Gold Medal in Electronics and Communication at RIMT-Institute of Engineering and Technology, affiliated to PTU-Jalandhar (2007)

Template: this, this and this