Stochastic optimization

banner-982162_1920
mathematics-1044070_1920
geometry-1023846_1920
lego-1124010_1920
previous arrow
next arrow

Stochastic gradient descent

Let f: \mathbb{R}^d \rightarrow \mathbb{R} be a smooth real-valued function on  w \in \mathbb{R}^d, where d is the dimension. The target problem concerns a given model variable w, and is expressed as

 \displaystyle{\min_{w \in \mathbb{R}^d} f(w) := \frac{1}{n} \sum_{i=1}^n f_i(w)} = \frac{1}{n} \sum_{i=1}^n \ \underbrace{L(w, x_i, y_i)}_{\text{\scriptsize{loss function}}}  \  + \ \lambda \underbrace{R(w)}_{\text{\scriptsize{regularizer}}}.

where w \in \mathbb{R}^d represents the model parameter and n denotes the number of samples (x_i, y_i). L(w, x_i, y_i) is the loss function and R(w) is the regularizer with the regularization parameter \lambda\geq 0. Widely diverse machine learning (ML) models fall into this problem. Considering L(w, x_i, y_i) = (w^Tx_i- y_i)^2, x_i \in \mathbb{R}^d, y_i \in \mathbb{R} and R(w)=\| w \|_2^2, this results in l_2-norm regularized linear regression problem (a.k.a. ridge regression) for n training samples (x_1, y_1), \cdots, (x_n, y_n). In case of the binary classification problem with the desired class label y_i \in \{+1,-1\} and R(w)=\| w\|_1, l_1-norm regularized logistic regression (LR) problem is obtained as f_i(w)=\log (1+\exp(-y_i w^Tx_i )) +  \lambda \| w \|_1, which encourages the sparsity of the solution of w. Other problems are matrix completion, support vector machines (SVM), and sparse principal components analysis, to name but a few. 

A popular choice of algorithms for solving this problem is the gradient descent method, which calculates the full gradient estimation for every iteration. However, this estimation is computationally costly when n is extremely large. A popular alternative is the the stochastic gradient descent algorithm (SGD). As SGD calculates only one gradient for the i-th sample, the complexity per iteration is independent of the sample size n.

SGD is hindered by a slow convergence rate due to a decaying step size sequence. In addition, the vanilla SGD algorithm is a first-order algorithm, which guarantees convergence to the first-order optimality condition, i.e.,  | \nabla f(x) | = {\bf 0}, using only the gradient information. As a result, their performance in ill-conditioned problems suffers due to poor curvature approximation. To address these issues, various advanced techniques have been proposed as below.

Recent three techniques for acceleration and improvement

  • Variance reduction (VR) techniques to exploit a full gradient estimation to reduce variance of noisy stochastic gradient. 

Illustration of stochastic variance reduced gradient (SVRG). SVRG is one representative algorithm of the VR algorithms, which reduces the variance of noisy stochastic gradients by periodical full gradient estimations, which yields a linear convergence rate.
  • Second-order (SO) algorithms to solve potential problem of first-order algorithms for ill-conditioned problems. 
  • Sub-sampled Hessian algorithms to achieve second-order optimality condition.
From http://ruder.io/optimizing-gradient-descent/.

Convergence behavior. (From Sebastian Ruder’s web site)

Software library: SGDLibrary

The SGDLibrary provides a pure-MATLAB library of a collection of stochastic optimization algorithms. Please see the details on software page.

SGDLibrary is on the Github site.

Convergence behaviors of some stochastic gradient algorithms.

Publication

201812_SGDLibrary_poster

Riemannian generalization of stochastic gradient descent

Please see details here.