End-to-End, Single-Stream Temporal Action Detection in Untrimmed Videos (SS-TAD)

End-to-End, Single-Stream Temporal Action Detection in Untrimmed Videos (SS-TAD)

Shyamal Buch, Victor Escorcia, Bernard Ghanem, Li Fei-Fei, Juan Carlos Niebles, 
"End-to-End, Single-Stream Temporal Action Detection in Untrimmed Videos (SS-TAD)" 
British Machine Vision Conference (BMVC 2018) [Oral]
Shyamal Buch, Victor Escorcia, Bernard Ghanem, Li Fei-Fei, Juan Carlos Niebles​
action detection, temporal localization, video understanding, RNNs
2017
​In this work,  we present a new intuitive,  end-to-end approach for temporal action detection in untrimmed videos.   We introduce our new architecture for Single-Stream Temporal Action Detection (SS-TAD), which effectively integrates joint action detection with its semantic sub-tasks in a single unifying end-to-end framework.  We develop a method for training our deep recurrent architecture based on enforcing semantic constraints on intermediate modules that are gradually relaxed as learning progresses.  We find that such a dynamic learning scheme enables SS-TAD to achieve higher overall detection performance, with fewer training epochs.  By design, our single-pass network is very efficient and can operate at 701 frames per second, while simultaneously outperforming the state-of-the-art methods for temporal action detection on THUMOS'14.​