Large-Scale Human Activity Recognition in the Wild
With the growth of online media, surveillance and mobile cameras, the amount and size of video databases are increasing at an incredible pace. For example, YouTube reported that over 300 hours of video are uploaded every minute to their servers. Moreover, the commercial availability of inexpensive cameras has led to an overwhelming amount of data, of which video streams from surveillance systems have been quoted to be the largest source ofbig data. However, the significant improvement in camera hardware has not been paralleled with accompanying automated algorithms and software that are crucial to intelligently sift through this ever-growing data heap. This situation has become so dire that much of large-scale video content (either online or in local networks) is rarely processed for meaningful semantic information. With such a development, this data merely serves as dense video sampling of the real-world, which is void of connectivity, correlation, and a deeper understanding of the spatiotemporal phenomena governing this data. Arguably, people are the most important and interesting subjects of these videos. The computer vision community has embraced this observation to validate the crucial role that human activity/action recognition plays in building smarter surveillance systems (e.g. to monitor public safety and public infrastructure usage), as well as, to enable business intelligence, semantically aware video indices (e.g. intelligent video search in large databases), and more natural human-computer interfaces (e.g. teaching robots to perform activities by example or controlling computers with natural body language). However, despite the explosion of video data available, the ability to automatically detect, recognize, and represent human activities is still rather limited. This is primarily due to impeding challenges inherent to the task, namely the large variability in execution styles, complexity of the visual stimuli in terms of camera motion, background clutter, and viewpoint changes, as well as, the level of detail and number of activities that can be recognized. In this project, we will address the important problems of human activity detection/classification, summarization, and representation with a suite of algorithms that are capable of efficiently and accurately learning from a newly compiled large-scale video dataset equipped with descriptive, hierarchical, and multi-modal annotations, calledActivityNet. We will investigate different facets of these problems with the ultimate goal of improving state-of-the-art performance in detecting and classifying human activities in real-world videos at large-scales.
Computer, Electrical and Mathematical Sciences and Engineering
Center Affiliation -
Visual Computing Center
Field of Study -
Computer, Electrical , Mathematical Sciences , Engineering
Associate Professor, Electrical and Computer Engineering
Professor Ghanem's research interests focus on topics in computer vision, machine learning, and image processing. They include:
- Modeling dynamic objects in video sequences to improve motion segmentation, video compression, video registration, motion estimation, and activity recognition.
- Developing efficient optimization and randomization techniques for large-scale computer vision and machine learning problems.
- Exploring novel means of involving human judgment to develop more effective and perceptually-relevant recognition and compression techniques.
- Developing frameworks for joint representation and classification by exploiting data sparsity and low-rankness.
Desired Project Deliverables
Novel techniques to classify snippets of video according to the activities they entail
Novel techniques to quickly localize “activity proposals”, i.e. temporal segments in video where the probability of finding interesting activities is high.
Combining the knowledge of objects and scenes in classifying an activity, since an activity is a spatiotemporal phenomenon where humans interacts with objects in a particular place
Crowd-sourcing framework (e.g. using Amazon Mechanical Turk) to cheaply extend the annotations of ActivityNet to object and place classes, as well as, free-form text description. These annotations will enrich the dataset, forge links with other large-scale datasets, and enable new functionality (e.g. textual translation of a video that enables text queries).