Neural networks are an important class of highly flexible and powerful models inspired by the structure of the brain. They consist of a sequence of interconnected layers, each comprised of basic computational units similar to the gates of a classical circuit. And like circuits, they have the capacity to perform simple computational procedures such as those which might underlie the generating process of the dataset they are trained on. The most popular and successful approach for learning neural networks is to optimize their parameters with respect to some objective function using standard methods for nonlinear optimization. Because basic methods like stochastic gradient descent (SGD) can often be very slow for deeply layered neural networks, or ones with recurrent connections, it is worthwhile to consider more advanced methods. In this thesis we review and analyze various such methods that have been proposed over the past few decades, with a particular focus on approximate-Newton/2nd-order ones, and develop two of our own which we call Hessian-free optimization (HF) and Kronecker-factored Approximate Curvature (K-FAC) respectively. Our experiments show that K-FAC can be much faster in practice at optimizing deep neural networks than well-tuned SGD with momentum.
展开▼