Lecture “optimization theory and applications for ML (最適化理論とその応用)” started

Lecture “optimization theory and applications for machine learning” started

The lecture “Optimization theory and applications for machine learning” for the graduate school has started. 

  • Instructor : H.Kasai
  • Abstract : Many problems in machine learning (ML) and statistics learning can be formulated in terms of optimization problem of some cost function, possibly under some constraints. In the era of big data, re-design of efficient and scalable algorithms to address optimization problems in large scale have attracted a surge of research interests from the viewpoint of theory as well as practice. This includes inherently high dimensional problems involving truly massive data sets. The main goal of this lecture is to expose students to a wide variety of modern algorithmic developments in optimization, especially in convex optimization, and bring them near the frontier of research in a large-scale optimization. Especially, non-linear programming is focused. Note that the some important topics including combinational optimization are not explored.
  • Language : English (and Japanese in Q&A)
  • Style : Real-time online lecture (recorded video will be provided after the class.)
  • Else :
    • All materials are provided via WASEDA Moodle
    • Contact instructor via email (hiroyuki dot kasai at waseda dot jp)

Contents (tentative)

  1. Introduction 
    1. Notations 
    2. Machine learning and optimization 
    3. First two tours of ML and optimization problem 
      1. Regression problem: Linear least squares 
      2. Classification problem: Logistic regression 
    4. Empirical risk minimization 
      1. Definition and interest 
      2. Examples 
    5. How to solve them ? 
      1. Closed-form approach 
      2. Iterative approach 
    6. What do we learn in this lecture ? 
  2. Convexity 
    1. Convex sets 
      1. Definitions 
      2. Examples of convex sets 
      3. Combinations and hulls 
      4. Preservation of convexity of sets 
      5. Practical approach to establish convexity of a set 
      6. Geometry of convex sets 
    2. Convex functions 
      1. Preliminaries 
      2. Definitions 
      3. Examples 
      4. Function characterization 
      5. Operations preserving convexity of functions 
      6. Practical approach to verify convexity of a function 
  3. Smoothness and strong convexity
    1. Lipschitz continuity
    2. Smoothness 
    3. Strong convexity 
    4. Smoothness and strong convexity 
  4. Subgradient and subdifferential 
    1. Subgradient 
    2. Subdifferential 
  5. Optimization preliminaries 
    1. Optimization problem 
      1. Definition and terminology 
      2. Problem types 
    2. Which optimization method is efficient ? 
      1. Black box oriented algorithm 
      2. What does “efficiency” mean ? 
  6. Unconstrained optimization I (gradient methods) 
    1. Optimality conditions 
    2. Optimality condition for convex function 
    3.  Iterative descent method 
      1. Fundamentals 
      2. Stepsize selection 
    4. Gradient Descent (GD) method 
      1. Fundamentals 
      2. Sufficient decrease lemma
      3. Convergence rate analysis for smooth case
      4. Convergence rate analysis for Lipschitz and convex case
      5. Convergence rate analysis for smooth and convex case
      6. Convergence rate analysis for Lipschitz and strongly-convex case
    5. Newton’s method 
      1. Fundamentals 
      2. Convergence analysis 
  7. Unconstrained optimization II (optimal methods) 
    1. Momentum (The heavy ball method) 
      1. Fundamentals 
      2. Convergence analysis 
    2. Nesterov’s accelerated gradient descent (AGD) method 
      1. Fundamentals 
      2. Convergence analysis 
  8. Constrained convex optimization I (without equality and inequality) 
    1. Optimality conditions 
    2. Conditional Gradient Descent (CGD) method (a.k.a. Frank–Wolfe method) 
      1. Fundamentals 
      2. Convergence analysis 
    3. Projected Gradient Descent (PGD) method 
      1. Orthogonal projection operator 
      2. Fundamentals 
      3.  Convergence analysis 
    4.  (Block) Coordinate descent (CD) methods 
      1. Questions for this approach 
      2. Coordinate descent for unconstrained problem 
      3. Block coordinate descent (BCD) for constrained problem 
      4. Variants of coordinate descent methods 
      5. Example 
      6. Convergence analysis 
  9. Constrained convex optimization II (with equality and inequality constraints) 
    1. Karush–Kuhn–Tucker (KKT) conditions 
      1. Lagrangian function 
      2. KKT conditions 
      3. Examples 
    2. Augmented Lagrangian methods 
      1. Equality constraint problem 
      2. General form 
  10. Non-smooth constrained convex optimization I (subgradient method) 
    1. Optimality conditions 
    2. Subgradient method 
      1. Fundamentals 
      2. Convergence analysis 
    3. Projected subgradient method 
      1.  Fundamentals 
      2. Convergence analysis 
    4. Mirror descent method 
      1. Bregman divergence 
      2. Mirror descent method 
  11. Non-smooth constrained convex optimization II (proximal methods) 
    1. Proximal operator (mapping) 
    2. Moreau decomposition, Moreau Envelope 
    3. Proximal algorithm (Proximal minimization algorithm, Proximal point algorithm) 
    4. Proximal gradient method (PG, ISTA)
      1. Fundamentals 
      2. Convergence analysis 
    5. Fast proximal gradient method (APG, FISTA) 
      1. Fundamentals 
      2. Convergence analysis 
  12. Non-smooth constrained convex optimization III (Non-Euclidean method)
    1. Bregman divergence
    2. Mirror Descent (MD) method
      1. Convergence analysis for fixed number of iterations
      2. Convergence analysis for Dynamic stepsize rule
      3. Numerical example
    3. Bregman Proximal Gradient (BPG) (Mirror Descent Comosite (MD-c))
      1. Fundamental
      2. Connection to Proximal Subgradient method
  13. Duality
    1. First look at duality 
    2. Lagrange duality in LP 
    3. Lagrange duality 
      1. Definitions and properties (week duality) 
      2. Strong duality 
      3. Connection to KKT condition in strong duality 
      4. Examples 
    4. Conjugate function 
    5. Fenchel duality 
      1. Fundamentals 
      2. Fenchel duality examples 
    6. Why dual ?
  14. Dual methods
    1. Dual projected subgradient method
    2. Dual subgradient method and dual gradient ascent with affine constraint
    3. Dual decomposition
    4. Dual proximal algorithm and augmented Lagrangian method
      1. Dual proximal gradient
      2. Connection to augmented Lagrangian method (method of multiplier) 
  15. Advanced method I
    1. Alternating direction method of multipliers (ADMM)
    2. Primal-dual hybrid proximal method
  16. Advanced method II (smoothing methods) 
    1. Smoothable function 
    2. Moreau-Yoshida regularization 
      1. Definition 
      2. Properties 
    3. Nesterov’s smoothing 
      1. Fundamentals 
      2. Definition 
      3. Example: f(x)=|x|
      4. Connection to the Moreau-Yoshida envelope 
      5. Algorithm 
      6. Convergence analysis
  17. Stochastic gradient
    1. Fundamentals
    2. Online stochastic optimization
    3. Stochastic gradient descent (SGD)   
    4. Accelerated methods of SGD   
      1. SAG (Stochastic average gradient), SAGA   
      2. SVRG (Stochastic variance reduced gradient)   
      3. Adaptive stochastic gradient (or adaptive learning rate)   
  18. Appendix : Mathematical preliminaries 
    1. Norms 
    2. Basic topology 
    3. Classification of matrices 
    4. Eigenvalues and eigenvectors 
    5. Singular value decomposition (SVD) 
    6. Dual space and dual norm