Computer Vision

Computer Vision

 

We engage in state of the art research into the mathematical theory of computer vision and artificial intelligence, but to keep the mathematical research relevant to the needs of society. A particular emphasis of the group has been on real time understanding and reconstruction of the world around using mobile cameras, such as those on drones, intelligent glasses or other robots.

 

Philip Torr of Torr Vision Group has worked on the following projects:

 

Action
Action

Collaborative Large-Scale Dense 3D Reconstruction with Online Inter-Agent Pose Optimisation
Stuart Golodetz*, Tommaso Cavallari*, Nicholas A. Lord*, Victor A. Prisacariu, David W. Murray,
Philip H. S. Torr

Reconstructing dense, volumetric models of real-world 3D scenes is important for many tasks, but capturing large scenes can take significant time, and the risk of transient changes to the scene goes up as the capture time increases. These are good reasons to want instead to capture several smaller sub-scenes that can be joined to make the whole scene. Achieving this has traditionally been difficult: joining sub-scenes that may never have been viewed from the same angle requires a high-quality relocaliser that can cope with novel poses, and tracking drift in each sub-scene can prevent them from being joined to make a consistent overall scene. Recent advances in mobile hardware, however, have significantly improved our ability to capture medium-sized sub-scenes with little to no tracking drift. Moreover, high-quality regression forest-based relocalisers have recently been made more practical by the introduction of a method to allow them to be trained and used online. In this paper, we leverage these advances to present what to our knowledge is the first system to allow multiple users to collaborate interactively to reconstruct dense, voxel-based models of whole buildings. Using our system, an entire house or lab can be captured and reconstructed in under half an hour using only consumer-grade hardware.

 

Action

On-the-Fly Adaptation of Regression Forests for Online Camera Relocalisation
Tommaso Cavallari, Stuart Golodetz*, Nicholas A. Lord*, Julien Valentin, Luigi Di Stefano,
Philip H. S. Torr

Camera relocalisation is a key problem in computer vision, with applications as diverse as simultaneous localisation and mapping, virtual/augmented reality and navigation. Common techniques either match the current image against keyframes with known poses coming from a tracker, or establish 2D-to-3D correspondences between keypoints in the current image and points in the scene in order to estimate the camera pose. Recently, regression forests have become a popular alternative to establish such correspondences. They achieve accurate results, but must be trained offline on the target scene, preventing relocalisation in new environments. In this paper, we show how to circumvent this limitation by adapting a pre-trained forest to a new scene on the fly. Our adapted forests achieve relocalisation performance that is on par with that of offline forests, and our approach runs in under 150ms, making it desirable for real-time systems that require online relocalisation.

 

Action

Straight To Shapes
Saumya Jetley*, Michael Sapienza*, Stuart Golodetz, Philip H. S. Torr

Current object detection approaches predict bounding boxes, but these provide little instance-specific information beyond location, scale and aspect ratio. In this work, we propose to directly regress to objects' shapes in addition to their bounding boxes and categories. It is crucial to find an appropriate shape representation that is compact and decodable, and in which objects can be compared for higher-order concepts such as view similarity, pose variation and occlusion. To achieve this, we use a denoising convolutional auto-encoder to establish an embedding space, and place the decoder after a fast end-to-end network trained to regress directly to the encoded shape vectors. This yields what to the best of our knowledge is the first real-time shape prediction network, running at ~35 FPS on a high-end desktop.

 

Action

Action Recognition
Harkirat S. Behl, Suman Saha, Gurkirt Singh, Michael Sapienza, Fabio Cuzzolin, Philip H. S. Torr

Human action recognition in challenging video data is becoming an increasingly important research area. Given the growing number of cameras and robots pointing their lenses at humans, the need for automatic recognition of human actions arises, promising Google-style video search and

 

Struck

Struck: Structured Output Tracking with Kernels (TPAMI 2015 Version)
Sam Hare*, Stuart Golodetz*, Amir Saffari*, Vibhav Vineet, Ming-Ming Cheng, Stephen L. Hicks, Philip H. S. Torr

We present a framework for adaptive visual object tracking based on structured output prediction. By explicitly allowing the output space to express the needs of the tracker, we avoid the need for an intermediate classification step. Our method uses a kernelised structured output support vector machine (SVM), which is learned online to provide adaptive tracking. To allow our tracker to run at high frame rates, we (a) introduce a budgeting mechanism that prevents the unbounded growth in the number of support vectors that would otherwise occur during tracking, and (b) show how to implement tracking on the GPU.

 

SemanticFusion

Incremental Dense Semantic Stereo Fusion for Large-Scale Semantic Scene Reconstruction
Vibhav Vineet*, Ondrej Miksik*, Morten Lidegaard, Matthias Nießner, Stuart Golodetz, Victor A. Prisacariu, Olaf Kahler, David W. Murray, Shahram Izadi, Patrick Perez, Philip H. S. Torr

We propose an end-to-end system that can process the data incrementally and perform real-time dense stereo reconstruction and semantic segmentation of unbounded outdoor environments. The system outputs a per-voxel probability distribution instead of a single label (soft predictions are desirable in robotics, as the vision output is usually fed as input into other subsystems). Our system is also able to handle moving objects more effectively than prior approaches by incorporating knowledge of object classes into the reconstruction process. In order to achieve fast test times, we extensively use the computational power of modern GPUs.

 

Andrew Zisserman and Andrea Vedaldi from Visual Geometry Group have worked on the following areas of research:

 

Sign Language Recognition:

 

Aligning Subtitles in Sign Language Videos

We propose a Transformer architecture to temporally align asynchronous subtitles in sign language videos.

 

 

 

Read and Attend: Temporal Localisation in Sign Language Videos

We show that the ability to localise signs emerges from the attention patterns of the Transformer sequence prediction model.

 

 

Learning to spot signs from multiple supervisors

For a given sign and its corresponding dictionary video, our task is to identify whether and where it has occured in a continuous sign language video.

 

 

BSL-1K: Scaling up co-articulated sign language recognition using mouthing cues

We introduce a new scalable approach to data collection for sign recognition in continuous videos.

 

 

Upper Body Pose Estimation and Tracking

Fast and accurate upper body pose estimation over long video sequences using a random forest framework with pose structured output.

 

 

Self Supervised Learning:

 

Self-supervised Co-training for Video Representation Learning

Self-supervised video representation learning goes beyond instance discrimination by co-training both RGB and optical flow models.

 

 

Video Representation Learning by Dense Predictive Coding

Self-supervised video representation learning by predicting spatio-temporal features in the future. RGB-stream action classification accuracy higher than ImageNet pretrained weights

 

 

Learning Human Pose from Unaligned Data through Image Translation

Learn landmark detectors from unlabelled videos and unaligned pose annotations. No need for paired data/labelled images.

 

Self-Supervised Learning of Geometrically Stable Features Through Probalistic Introspection

This research aims at using self-supervision for geometry-oriented tasks such as semantic matching and part detection.

 

 

Audio-Visual Learning:

 

Localizing Visual Sounds the Hard Way

Localize sound sources that are visible in a video using hard samples.

 

 

Emotion Recognition in Speech using Cross-Modal Transfer in the Wild

Transferring knowledge of emotion from faces to voices.

 

 

Seeing Voices and Hearing Faces: Cross-model biometric matching

A network is trained to recognise faces from voices alone and vice versa.

 

 

Understanding and training convolutional neural networks:

 

Small Steps and Giant Leaps: Minimal Newton Solvers for Deep Learning

Curveball is a fast second-order method that can be used as a drop-in replacement for current deep learning solvers.

 

 

Deep Image Prior

In this work we show that the structure of a generator network is sufficient to capture a great deal of low-level image statistics prior to any learning.

 

 

Understanding Deep Image Representations by Inverting Them

Visualize representations by inverting them back into images with the help of a natural image prior.

 

 

Search and retrieval of images and video:

 

Video retrieval using representations from collaborative experts

Collaborative Experts is a framework for combining deep neural networks for text-video retrieval.

 

 

Faces in Places: Compound Query Retrieval

Retrieving images containing both a target person and a target scene type (e.g. Barack Obama on the beach) from a large dataset of images.

 

 

Pose-based Video Retrieval

Retrieve humans striking a pose from a database of Hollywood movies in real time.

 

 

Video based recognition and understanding:

 

Detecting people looking at each other in videos

The goal is to localise both spatially and temporally pairs of people looking at each other in video sequences.

 

 

Character Identification in TV series without a Script

The goal of this work is to recognise people under unconstrained conditions automatically, from TV show and feature film material.

 

 

Seeing the Arrow of Time

This work explores whether it is possible to observe Time's Arrow in a temporal sequence.

 

 

Counting, detecting, reading and tracking:

 

Amplifying Key Cues for Human-Object-Interaction Detection

This work introduces two methods to amplify key cues, and a method to combine cues when considering the interaction between a human and an object.

 

 

AutoCorrect: Deep Inductive Alignment of Noisy Geometric Annotations

The goal of this work is to train a model with noisy data, and to correct the registration noise in the annotations.

 

 

Learning to Count Objects in Images

Learning to count objects in images, e.g. cells in a microscopic image or humans in surveillance video frames.

 

 

Art recognition and search:

 

Visual Search of Paintings

Search a large dataset of paintings on-the-fly for a given object category.

 

Faces to Paintings

Match photographs of people to similar looking paintings in a large corpus.

 

 

Automatic Annotation of Greek Vases

A method allowing automatic detection of gods and animals in a large dataset of Greek vases.

 

 

Miscellaneous:

 

Automated Labelling of Cell Cycle Phases

Automatically detect and track cells through a video, labelling cell cycle phase at every time point.

 

 

Learning to Detect Cells

Detect cells automatically with models learnt from simple annotations.

 

 

Descriptor Learning Using Convex Optimisation

Learn feature descriptors using convex formulations for keypoint matching and object instance retrieval.