NIPS Workshop on Efficient Machine Learning, Whistler 2007

Speeding Up Stochastic Gradient Descent

author: Yoshua Bengio, University of Montreal
published: Dec. 29, 2007, recorded: December 2007, views: 12918

Slides

Report a problem or upload files

If you have found a problem with this lecture or would like to send us extra material, articles, exercises, etc., please use our ticket system to describe your request and upload the data.
Enter your e-mail into the 'Cc' field, and we will keep you updated with your request's status.

Lecture popularity: You need to login to cast your vote.

Description

n order to tackle large-scale learning problems whose solution necessarily involves a large model with many tunable parameters, difficult non-convex optimization has to be performed efficiently. Computational complexity arguments strongly suggest that deep architectures will be necessary to represent the kind of complex functions that AI involves. Unfortunately, this involves difficult optimization problems and efficient approximate iterative optimization becomes key to obtain good generalization, and not so much the regularization techniques that have been so well studied in the last two decades. Furthermore, because of the size of the data sets involved in such tasks, it is imperative that computation scale no more than linearly with respect to the number of training examples. In many cases, the algorithm to beat is stochastic gradient descent, and the comparisons have to be made by looking at the curve of test error versus computation time. Following recent interest in online versions of second-order optimization methods, we present computational tricks that yield a linear time variant of natural gradient optimization. Another issue, that is particularly difficult to address in the optimization of multi-layer neural networks, is how to parallelize efficiently. SMP machines becoming cheaper and easier to use, we compare and discuss different strategies for exploiting parallelization of training for multi-layer neural networks, showing that naive approaches fail but those taking into account the communication bottleneck yield impressive speed-ups.

Link this page

Would you like to put a link to this lecture on your homepage?
Go ahead! Copy the HTML snippet !

Write your own review or comment:

Comment:
Name:
Email address:
URL:

make sure you have javascript enabled or clear this field: