Date of Award

Spring 2022

Project Type

Dissertation

Program or Major

Computer Science

Degree Name

Doctor of Philosophy

First Advisor

Momotaz Begum

Second Advisor

Laura Dietz

Third Advisor

Marek Petrik

Abstract

Modelling human-led demonstrations of high-level sequential tasks is fundamental to a number of practical inference applications including vision-based policy learning and activity recognition. Demonstrations of these tasks are captured as videos with long durations and similar spatial contents. Learning from this data is challenging since inference cannot be conducted solely on spatial feature presence and must instead consider how spatial features play out across time. To be successful these temporal representations must generalize to variations in the duration of activities and be able to capture relationships between events expressed across the scale of an entire video.

Contemporary deep learning architectures that represent time (convolution-based and Recurrent Neural Networks) do not address these concerns. Representations learned by these models describe temporal features in terms of fixed durations such as minutes, seconds, and frames. They are also developed sequentially and must use unreasonably large models to capture temporal features expressed at scale. Probabilistic temporal models have been successful in representing the temporal information of videos in a duration invariant manner that is robust to scale, however, this has only been accomplished through the use of user-defined spatial features. Such abstractions make unrealistic assumptions about the content being expressed in these videos, the quality of the perception model, and they also limit the potential applications of trained models. To that end, I present D-ITR-L, a temporal wrapper that extends the spatial features extracted from a typically CNN architecture and transforms them into temporal features.

D-ITR-L-derived temporal features are duration invariant and can identify temporal relationships between events at the scale of a full video. Validation of this claim is conducted through various vision-based policy learning and action recognition settings. Additionally, these studies show that challenging visual domains such as human-led demonstration of high-level sequential tasks can be effectively represented when using a D-ITR-L-based model.

Share

COinS