Title: End-to-End Learning for Intelligent Signal Processing
Speaker: Yann LeCun, Courant Institute
When/where: Thursday, April 1st, 4:15PM, Room 200 AKW
Abstract:
Machine learning and statistical modeling are at the core of many
recent advances in data mining, forecasting, biological data analysis,
information retrieval, and human-computer interfaces. Some of the
most challenging problems for Machine Learning could be characterized
as high-dimensional "intelligent" signal processing. Of particular
interest is the problem detecting and classifying objects in
multidimensional signals (images and videos, sonar, radar, etc).
Learning to classify images from a set of training samples can be seen
as approximating a function from sparse samples in very high
dimension. One is faced with three problems: (1) finding suitable
families of functions (with appropriate parameterizations) from which
to pick the solution, (2) finding suitable loss functions whose
minimization will produce the desired behavior, (3) designing
efficient optimization algorithms that find "good" minima of the loss
function.
Arguably, all the difficulties are combined in the problem of
recognizing generic objects categories in images (e.g. picking up
people, animals, cars, trucks or airplanes in natural images),
independently of the particular instance, the pose, the illumination,
and surrounding objects and textures: The input dimension is
overwhelming (>10,000 pixels), the intra-class variabilities are
extremely complex (variation of appearance due to shape, pose, and
illumination), the decision surfaces between categories are highly
non-linear, and the necessary number of training samples very large.
Cracking this problem forces us to use very peculiar parameterizations
of the family of functions from which the learning process will pick a
solution. This comes down building very large learning systems
composed of multiple heterogeneous modules with millions of adjustable
parameters, trained on millions of examples so as to optimize a global
performance measure. End-to-end training such systems from raw sensor
data (pixels) to object categories requires new ways of integrating
heterogeneous trainable modules that are specialized for each
sub-problem so that they can be trained cooperatively.
We will first describe a general methodology with which to construct
such large-scale heterogeneous learning machines. We will then show
that training such systems can be seen as the process of minimizing
the average difference between two quantities that are analogous to
the Free Energy in statistical physics. We will describe the
convolutional network architecture, a biologically-inspired trainable
system specifically designed to process 2D signals with robustness to
geometric distortions.
We will report results of generic object recognition experiments with
various popular methods using the NORB dataset. This dataset comprises
stereo image pairs of 50 uniform-colored toys under 18 azimuths, 9
elevations, and 6 lighting conditions. The objects belong to 5 generic
categories (with 10 instances for each): four-legged animals, human
figures, airplanes, trucks, and cars. The systems were trained on 5
instances of each objects and tested on 5 different instances. These
experiments demonstrate the advantages of the convolutional nets and
the limitations of template matching-based architectures such as the
ever popular Support Vector Machines. They also point out the
relative merits of "deep" versus "shallow" architectures.
We will briefly describe a check reading system trained with
discriminative Free Energy Difference criterion, that combines
convolutional networks, trainable graph manipulation modules, and
stochastic language models. This system is integrated in several
commercial recognition engines, and currently processes an estimated
10% to 20% of all the checks written in the US with record accuracy.
We will show several live demonstrations of convolutional nets that
recognize handwriting, detect human faces in images, drive autonomous
robots, and classify generic objects independently of pose and
lighting.
Parts of this work are joint with Fu Jie Huang (NYU),
and Leon Bottou (NEC Labs)