Anil Batra
I am a CDT Ph.D. scholar at School of Informatics
in University of Edinburgh and being supervised by Prof. Frank
Keller and Dr. Laura Sevillia.
I am also collabrating with Prof. Marcus Rohrbach.
My research interests are centered at the intersection of Language and Vision, with a focus on developing models that can plan, reason, and execute goal-oriented tasks involving multiple complex events through text comprehension and video analysis. Currently, my work involves analyzing long procedural videos to understand and ground the temporal structure of events. This research is directed towards developing efficient models that can accurately capture the sequence and timing of events, ultimately enhancing their ability to perform complex, real-world tasks.
I also enjoy reading works related to Geospatial data, large language models and how to make models more reliable.
Previously, I have completed my Master in Computer Science (by research) at IIIT - Hyderabad,
under the supervision of Prof. C.V. Jawahar and Facebook mentors
Dr. Guan Pang,
Dr. Saikat Basu.
During Masters, I was part of Center of Visual
Information Technology Lab (CVIT) and developed models to detect roads under occlusion in Satellite Imagery.
I worked as Research Engineer at Facebook in Spatial Computing
Team. I was designing, training, and evaluating extraction of connected road network
with limited set of labels and large scale noisy labels.
Email |
CV |
Google Scholar |
Github |
LinkedIn
|
|
Currently looking for reserch internships (Early 2025) and PostDocs (Winter 2025).
News
- [July 2024]: Our new work "Efficient Pre-training for Localized Instruction Generation of Videos" is accepted at
ECCV 2024.
- [Sep 2022]: Our new work "Temporal Ordering in the Segmentation of
Instructional Videos" is accepted at
BMVC 2022.
- [Dec 2021]: Volunteer for session chair at LXAI workshop,
Neurips 2021.
- [Nov 2021]: Will be serving as CVPR 2022 Reviewer.
- [Jun 2021]: Served as ICCV 2021 Reviewer.
- [Sep 2020]: Joined CDT-NLP Ph.D at University of Edinburgh under the
supervision of Dr. Laura
Sevillia and Prof. Frank Keller.
- [Jun 2019]: Succesfully defended my Master thesis. Panel - Prof. C.V.
Jawahar, Prof. K. Madhava Krishna, Dr. Girish Varma
- [Jun 2019]: Poster presentation at CVPR 2019, Long Beach (image).
- [May 2019]: Received travel sponsporship from Facebook - Spatial Computing
team to attend CVPR 2019.
- [Apr 2019]: Presented CVPR - Improving Road Connectivity work at
Facebook Spatial Team.
- [Mar 2019]: Paper accepted at CVPR 2019 on Improved Road Connectivity.
- [Jan 2019]: Join Facebook - Spatial Team as Research Engineer
(Contingent Worker).
- [Dec 2018]: Submitted my Master thesis Road Topology Extraction
from Satellite images by Knowledge Sharing.
- [Jun 2018]: Paper accepted at BMVC 2018 on Self-Supervised Learning.
|
|
Efficient Pre-training for Localized Instruction Generation of Videos
Anil Batra, Davide Moltisanti, Laura Sevilla-Lara, Marcus Rohrbach, Frank Keller
European Conference on Computer Vision (ECCV) 2024
pdf
abstract
bibtex
Procedural videos show step-by-step demonstrations of tasks like recipe preparation.
Understanding such videos is challenging, involving the precise localization of steps and the
generation of textual instructions. Manually annotating steps and writing instructions is
costly, which limits the size of current datasets and hinders effective learning. Leveraging
large but noisy video-transcript datasets for pre-training can boost performance, but demands
significant computational resources. Furthermore, transcripts contain irrelevant content and
exhibit style variation compared to instructions written by human annotators. To mitigate both
issues, we propose a technique, Sieve-\&-Swap, to automatically curate a smaller dataset:
(i)~Sieve filters irrelevant transcripts and (ii)~Swap enhances the quality of the text instruction
by automatically replacing the transcripts with human-written instructions from a text-only recipe
dataset. The curated dataset, three orders of magnitude smaller than current web-scale datasets,
enables efficient training of large-scale models with competitive performance. We complement our
Sieve-\&-Swap approach with a Procedure Transformer (ProcX) for end-to-end step localization and
instruction generation for procedural videos. When this model is pre-trained on our curated dataset,
it achieves state-of-the-art performance in zero-shot and finetuning settings on YouCook2 and
Tasty, while using a fraction of the computational resources.
@InProceedings{batraProcX2024,
author = {Batra, Anil and Moltisanti, Davide and Sevilla-Lara, Laura and Rohrbach, Marcus and Keller, Frank},
title = {Efficient Pre-training for Localized Instruction Generation of Videos},
booktitle = {UnderReview},
year = {2024}
}
|
|
A Closer Look at Temporal Ordering in the Segmentation of Instructional Videos
Anil Batra, Shreyank N Gowda, Frank Keller, Laura Sevilla-Lara
British Machine Vision Conference (BMVC), 2022
pdf
suppl
abstract
bibtex
Understanding the steps required to perform a task is an important skill for AI systems.
Learning these steps from instructional videos involves two subproblems: (i) identifying
the temporal boundary of sequentially occurring segments and (ii) summarizing
these steps in natural language. We refer to this task as Procedure Segmentation and
Summarization (PSS). In this paper, we take a closer look at PSS and propose three
fundamental improvements over current methods. The segmentation task is critical, as
generating a correct summary requires each step of the procedure to be correctly identified.
However, current segmentation metrics often overestimate the segmentation quality
because they do not consider the temporal order of segments. In our first contribution,
we propose a new segmentation metric that takes into account the order of segments,
giving a more reliable measure of the accuracy of a given predicted segmentation. Current
PSS methods are typically trained by proposing segments, matching them with the
ground truth and computing a loss. However, much like segmentation metrics, existing
matching algorithms do not consider the temporal order of the mapping between candidate
segments and the ground truth. In our second contribution, we propose a matching
algorithm that constrains the temporal order of segment mapping, and is also differentiable.
Lastly, we introduce multi-modal feature training for PSS, which further improves
segmentation. We evaluate our approach on two instructional video datasets (YouCook2
and Tasty) and observe an improvement over the state-of-the-art of ∼ 7% and ∼ 2.5%
for procedure segmentation and summarization, respectively.
@InProceedings{batraBMVC2022pss,
author = {Batra, Anil and Gowda, Shreyank N and Keller, Frank and Sevilla-Lara, Laura},
title = {A Closer Look at Temporal Ordering in the Segmentation of Instructional Videos},
booktitle = {BMVC},
year = {2022}
}
|
|
Improved Road Connectivity by Joint Learning of Orientation and
Segmentation
Anil Batra*, Suriya Singh*, Guan Pang, Saikat Basu, C.V. Jawahar and
Manohar Paluri (* equal contribution)
Computer Vision and Pattern Recognition (CVPR), 2019
pdf
suppl
poster
abstract
bibtex
code
Road network extraction from satellite images often produce fragmented road
segments leading to road maps unfit for real applications. Pixel-wise
classification fails to predict topologically correct and connected road masks
due to the absence of connectivity supervision and difficulty in enforcing
topological constraints. In this paper, we propose a connectivity task called
Orientation Learning, motivated by the human behavior of annotating roads by
tracing it at a specific orientation. We also develop a stacked multi-branch
convolutional module to effectively utilize the mutual information between
orientation learning and segmentation tasks. These contributions ensure that the
model predicts topologically correct and connected road masks. We also propose
Connectivity Refinement approach to further enhance the estimated road networks.
The refinement model is pre-trained to connect and refine the corrupted
ground-truth masks and later fine-tuned to enhance the predicted road masks. We
demonstrate the advantages of our approach on two diverse road extraction
datasets SpaceNet and DeepGlobe. Our approach improves over the state-of-the-art
techniques by 9% and 7.5% in road topology metric on SpaceNet and DeepGlobe,
respectively.
@InProceedings{Batra_2019_CVPR,
author = {Batra, Anil and Singh, Suriya and Pang, Guan and Basu, Saikat and
Jawahar, C.V. and Paluri, Manohar},
title = {Improved Road Connectivity by Joint Learning of Orientation and
Segmentation},
booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition
(CVPR)},
month = {June},
year = {2019}
}
|
|
Self-supervised Feature Learning for Semantic Segmentation of Overhead
Imagery
Suriya Singh*, Anil Batra*, Guan Pang, Lorenzo Torresani, Saikat Basu,
C.V. Jawahar and Manohar Paluri (* equal contribution)
British Machine Vision Confernece (BMVC), 2018
pdf
suppl
abstract
bibtex
Overhead imageries play a crucial role in many applications such as urban
planning, crop yield forecasting, mapping, and policy making. Semantic
segmentation could enable automatic, efficient, and large-scale understanding of
overhead imageries for these applications. However, semantic segmentation of
overhead imageries is a challenging task, primarily due to the large domain gap
from existing research in ground imageries, unavailability of large-scale
dataset with pixel-level annotations, and inherent complexity in the task.
Readily available vast amount of unlabeled overhead imageries share more common
structures and patterns compared to the ground imageries, therefore, its
large-scale analysis could benefit from unsupervised feature learning
techniques. In this work, we study various self-supervised feature
learning techniques for semantic segmentation of overhead imageries. We choose
image semantic inpainting as a self-supervised task for our experiments due to
its proximity to the semantic segmentation task. We (i) show that existing
approaches are inefficient for semantic segmentation, (ii) propose architectural
changes towards self-supervised learning for semantic segmentation, (iii)
propose an adversarial training scheme for self-supervised learning by
increasing the pretext task's difficulty gradually and show that it leads to
learning better features, and (iv) propose a unified approach for overhead scene
parsing, road network extraction, and land cover estimation. Our approach
improves over training from scratch by more than 10% and ImageNet pre-trained
network by more than 5% mIoU.
@inproceedings{singhBMVC18overhead,
Author = {Singh, Suriya; Batra, Anil; Pang, Guan; Torresani, Lorenzo; Basu,
Saikat; Paluri, Manohar; Jawahar, C. V.},
Title = {Self-supervised Feature Learning for Semantic Segmentation of Overhead
Imagery},
Booktitle = {BMVC},
Year = {2018}
}
|
|