Skip to main content
  • CIS
    Members: Free
    IEEE Members: Free
    Non-members: Free
    Length: 00:58:27
06 Jul 2014

The combined emergence of very large datasets, powerful parallel computers, and new machine learning methods, has enabled the deployment of highly-acurate computer perception systems, and is opening the door to a wide deployment of AI systems. A key component in systems that can understand natural data is a module that turns the raw data into an suitable internal representation. But designing and building such a module, often called a feature extractor, requires a considerable amount of engineering efforts and domain expertise. The main objective of 'Deep Learning' is to come up with learning methods that can automatically produce good representations of data from labeled or unlabeled samples. Deep learning allows us to construct systems that are trained end to end, from raw inputs to ultimate output. Instead of having a separate feature extractor and perdictor, deep architectures have multiple stages in which the data is represented hierarchically: features in successive stages are increasingly global, abstract, and invariant to irrelevant transformations of the input. The convolutional network model (ConvNet) is a particular type of deep architecture that is somewhat inspired by biology, and consist of multiple stages of filter banks, interspersed with non-linear operations, and spatial pooling. ConvNets, have become the record holder for a wide variety of benchmarks and competition, including object detection, localization, and recognition in image, semantic image segmentation and labeling (2D and 3D), acoustic modeling for speech recognition, drug design, handwriting recognition, biological image segmentation, etc. The most recent speech recognition and image understanding systems deployed by Facebook, Google, IBM, Microsoft, Baidu, NEC and others use deep learning, and many use convolutional networks. Such systems use very large and very deep ConvNets with billions of connections, trained using backpropagation with stochastic gradient, with heavy regularization. But many new applications require the use of unsupervised feature learning methods. A number of methods based on sparse auto-encoder will be presented. Several applications will be shown through videos and live demos, including a category-level object recognition system that can be trained on the fly, a system that can label every pixel in an image with the category of the object it belongs to (scene parsing), a pedestrian detector, and object localization and detection systems that rank first on the ImageNet Large Scale Visual Recognition Challenge data. Specialized hardware architecture that run these systems in real time will also be described.