Large-Scale Human Activity Recognition in the Wild

Large-Scale Human Activity Recognition in the Wild

Internship Description

With the growth of online media, surveillance and mobile cameras, the amount and size of video databases are increasing at an incredible pace. For example, YouTube reported that over 300 hours of video are uploaded every minute to their servers. Moreover, the commercial availability of inexpensive cameras has led to an overwhelming amount of data, of which video streams from surveillance systems have been quoted to be the largest source of big data. However, the significant improvement in camera hardware has not been paralleled with accompanying automated algorithms and software that are crucial to intelligently sift through this ever-growing data heap. This situation has become so dire that much of large-scale video content (either online or in local networks) is rarely processed for meaningful semantic information. With such a development, this data merely serves as dense video sampling of the real-world, which is void of connectivity, correlation, and a deeper understanding of the spatiotemporal phenomena governing this data. Arguably, people are the most important and interesting subjects of these videos. The computer vision community has embraced this observation to validate the crucial role that human activity/action recognition plays in building smarter surveillance systems (e.g. to monitor public safety and public infrastructure usage), as well as, to enable business intelligence, semantically aware video indices (e.g. intelligent video search in large databases), and more natural human-computer interfaces (e.g. teaching robots to perform activities by example or controlling computers with natural body language). However, despite the explosion of video data available, the ability to automatically detect, recognize, and represent human activities is still rather limited. This is primarily due to impeding challenges inherent to the task, namely the large variability in execution styles, complexity of the visual stimuli in terms of camera motion, background clutter, and viewpoint changes, as well as, the level of detail and number of activities that can be recognized.

 

In this project, we will address the important problems of human activity detection/classification, summarization, and representation with a suite of algorithms that are capable of efficiently and accurately learning from a newly compiled large-scale video dataset equipped with descriptive, hierarchical, and multi-modal annotations, called ActivityNet. We will investigate different facets of these problems with the ultimate goal of improving state-of-the-art performance in detecting and classifying human activities in real-world videos at large-scales.

​ ​

Deliverables/Expectations

Novel techniques to classify snippets of video according to the activities they entail

Novel techniques to quickly localize “activity proposals”, i.e. temporal segments in video where the probability of finding interesting activities is high.

Combining the knowledge of objects and scenes in classifying an activity, since an activity is a spatiotemporal phenomenon where humans interacts with objects in a particular place

Crowd-sourcing framework (e.g. using Amazon Mechanical Turk) to cheaply extend the annotations of ActivityNet to object and place classes, as well as, free-form text description. These annotations will enrich the dataset, forge links with other large-scale datasets, and enable new functionality (e.g. textual translation of a video that enables text queries). 

Faculty Name

Bernard Ghanem

Field of Study

Computer, Electrical , Mathematical Sciences , Engineering