Module 9: Optimization


Types of optimization problems

First, what is an optimization problem?

Discrete optimization:

Continuous optimization:

Function vs. parameter (or variable) optimization:

Ways to categorize continuous (parameter) optimization problems:

In this module:


What we learned in calculus

Let's go back to our example:

Exercise 3: Go back to your calc book and find an example of a "hard to differentiate" function.
Exercise 4: What's an example of a function that's continuous but not differentiable? Can you think of a function that's continuous everywhere but differentiable nowhere? [Hint: it's rather bizarre].
Exercise 5: Consider the function
          f(x) = x/(μ1-λx) + (1-x)/(μ2-λ(1-x)).
Compute the derivative f'(x). Can you solve f'(x)=0?

One other important point from our calc course:


Simple bracketing search

Key ideas:

  • First, find a large enough initial range (bracket), [a,b].
  • Pick values for the algorithm parameters
              M = number of intervals.
              N = number of iterations.

  • Then, divide the bracket into M intervals
              => Let δ = (b-a)/M.

  • Evaluate f(x) at every interval boundary.

  • Pick the best such x
              => Call this x*.

  • Set the new bracket to be [x-δ,x+δ].

  • Repeat (for a total of N times).

  • Pseudocode:
         Algorithm: bracketSearch (a, b)
         Input: the initial range [a,b]
         1.    for i=1 to N                 
         2.        δ = (b-a) / M            // Divide current bracket.
         3.        x* = a
         4.        bestf = f(x*)
         5.        for x=a to b step δ      // Search for best x in current bracket
         6.            f = f(x)
         7.            if f < bestf
         8.                bestf = f
         9.                x* = x
         10.           endif
         11.       endfor
         12.       a = x* - δ               // Shrink bracket.
         13.       b = x* + δ
         14.   endfor
         15.   return x*
         
Exercise 6: Download and execute BracketSearch.java.
  • What is the running time in terms of M and N?
  • If we keep MN constant (e.g., MN=24), what values of M and N produce best results?
Exercise 7: Draw an example of a function for which bracket-search fails miserably, that is, the true minimum is much lower than what's found by bracket search even for large M and N.

How does one evaluate such algorithms?

  • In the above case, the actual computation inside the innermost loop was simple
              => Computing f(x) takes constant time.

  • However, in many real-world examples
              => Computing f(x) can take a lot of time.

  • Example: f(x) may be the result of solving a differential equation
              => Computing f(x) itself needs multiple iterations.

  • Thus, one would like to reduce the number of function evaluations.
Exercise 8: What is the number of function evaluations in terms of M and N for the bracket-search algorithm?

Can bracket search work for multivariate functions such as f(x,y) = 2x+3y2?

  • Yes, one simply divides every dimension at each step
              => The brackets become "cells"

  • For example, in 2D (2 variables)
         Algorithm: bracketSearch (a1, b1, a2, b2)
         Input: the initial ranges for each dimension [a1,b1] and [a2,b2]
         1.    for i=1 to N                 
                   // The ranges could be different in each dimension.
         2.        δ1 = (b1-a1) / M
         3.        δ2 = (b2-a2) / M
         4.        x1* = a1,  x2* = a2
         5.        bestf = f(x1*, x2*)
         6.        for x1=a1 to b1 step δ1      
         7.            for x2=a2 to b2 step δ2      
         8.                f = f(x1,x2)
         9.                if f < bestf
         8.                    bestf = f
         9.                    x1* = x1,  x2* = x2
         10.               endif
         11.           endfor
         12.       endfor
                   // Shrink cell.
         13.       a1 = x1* - δ1,   b1 = x1* + &delta1;
         14.       a2 = x2* - δ2,   b2 = x2* + &delta2;
         15.   endfor
         16.   return x1*, x2*
         

  • Notice that we've used the notation (x1,x2) instead of (x,y).
              => For an n-dimensional space we'd use (x1,x2, ..., xn) to represent a particular point.
Exercise 9: What is the number of function evaluations in terms of M and N for the 2D bracket-search algorithm? How does this generalize to n dimensions?
Exercise 10: Add code to MultiBracketSearch.java to find the minimum of f(x1,x2)=(x1-4.71)2 + (x2-3.2)2 + 2(x1-4.71)2(x2-3.2)2.

When to stop?

  • We have fixed the number of iterations at N.

  • We could, instead, stop when some desired accuracy is achieved.

  • Let's try the following:
    • Let fk be the best value after k iterations.
    • We will compare fk and fk-1.
                If |fk - fk-1| < ε, then stop.
    • Here, ε is some suitably small number.

  • In pseudocode:
         Algorithm: bracketSearch (a, b)
         Input: the initial range [a,b]
         1.    bestf = some-large-value
         2.    prevBestf = bestf + 2ε         // So that we enter the loop.
         3.    while |bestf - prevBestf| > &epsilon
         4.        δ = (b-a) / M              
         5.        x* = a
         6.        bestf = f(x*)
         7.        for x=a to b step δ
         8.            f = f(x)
         9.            if f < bestf
         10.               bestf = f
         11.               x* = x
         12.           endif
         13.       endfor
         14.       a = x* - δ               
         15.       b = x* + δ
         16.       N = N + 1                  // Track N for printing/evaluation
         17.   endfor
               // Print N if desired.
         18.   return x*
         

  • How do we choose ε?
    • If, in our problem, f values happen to be very large (e.g, 106),
                => ε=0.1 may be unnecessarily small.
    • If, on the other hand, f values happen to be very small (e.g, 10-6),
                => ε=0.1 may be too large.
                => No optimization occurs.

  • A better solution:
              If |(fk - fk-1) / fk-1| < ε, then stop.
              => Thus, if the proportional change is very small, stop.
Exercise 11: Modify BracketSearch2.java to use the proportional-difference stopping condition.

Golden ratio search

First, an important observation:

  • Suppose a < b are two real numbers.
              => e.g., they represent an interval [a,b].

  • Next, let r be a number strictly between 0 and 1.
              => i.e., 0 < r < 1.

  • Now consider the numbers c = ra + (1-r)b, and d = (1-r)a + rb.

  • Then, it is true that a < c < b
              => That is, c is between a and b.

  • Similarly, a < d < b

  • Intuition:
    • Pick r=0.3 and a=4, b=9.
    • Then, c = 0.3*4 + 0.7*9 = some weighted average of 4 and 9
                => weighted average must be between 4 and 9.
    Exercise 12: Prove this result.

Next, let's look at the ideas in ratio search (before we tackle golden-ratio search):

  • Recall that, in bracket-search, we fixed the number of intervals M.

  • In ratio search, we will use M=1 but adjust the ends.

  • Start with some interval [a,b].

  • Compute the ends of a smaller interval contained within:

  • At each step, shrink the interval to either [a,d] or [c,b].

    • If f(c) ≤ f(d), set b = d.
    • If f(c) > f(d), set a = c.

  • Then, repeat until interval is small enough to stop.

  • Why does this work?

    • First, we assume the function is unimodal
                => f(x) decreases from a to the optimal x, and increases after that.
    • If f(c) ≤ f(d), then the minimum cannot occur to the right of d.
                => We can shrink the interval on the right.
    • Same reasoning for the left side.

  • After shrinking the interval, we again compute two interior points and repeat.

Golden-ratio search:

  • Recall what happens after shrinking an interval:

    • We re-compute new c and d values after shrinking.

  • Instead, suppose we could re-use one of the previous values:

  • How do we arrange for this to happen? Is there a value of r that will make this happen?

  • First, observe


              (c-a) / (b-a) = (1-r).
              => (1-r) = ratio of the distance c-a to the whole interval

  • Similarly,
              (d-a) / (b-a) = (1-r).
    Exercise 13: Prove this result.

  • Next, consider two iterations:


              The ration r
              = (d' - a') / (b' - a')
              = (d' - a') / (d - a)
              = (c - a) / (d - a)
              = (1-r)(b-a) / ( r(b-a) )
              = (1-r) / r.

  • Simplifying, we get r2 + r - 1 = 0.

  • This famous quadratic has the solution r = 0.618 (approx.)
              => called the golden ratio.

  • Pseudocode:
         Algorithm: goldenRatioSearch (a, b)
         Input: the initial range [a,b]
         1.    Pick c and d so that |(c-d)/d| > ε
         2.    while |(c-d)/d| > ε
         3.        if fc ≤ fd
         4.            b = d                  // Shrink from right side.
         5.            d = c                  // Re-use old c, f(c).
         6.            fd = fc
         7.            c = ra + (1-r)b        // Compute new c, f(c).
         8.            fc = f(c)
         9.        else
         10.           a = c                  // Shrink from left side.
         11.           c = d                  // Re-use old d, f(d).
         12.           fc = fd  
         13.           d = (1-r)a + rb        // Compute new d, f(d).
         14.           fd = f(d)
         15.       endif
         16.   endwhile
         17.   return (c+d)/2
         
Exercise 14: Describe what could go wrong if we replaced the while-condition with
     2.    while |(fc-fd)/fd| > ε
     
How would you address this problem?
Exercise 15: Implement golden-ratio search in this template: GoldenRatio.java.

Gradient descent

Let's start by understanding what gradient means:

  • Consider a (single-dimensional) function f(x):

    • Let f'(x) denote the derivative of f(x).
    • The gradient at a point x is the value of f'(x).
                => Graphically, the slope of the tangent to the curve at x.

  • Observe the following:

    • To the left of the optimal value x*, the gradient is negative.
    • To the right, it's positive.

  • We seek an iterative algorithm of the form
         while not over
             if gradient < 0
                 move rightwards
             else if gradient > 0
                 move leftwards
             else
                 stop               // gradient = 0 (unlikely in practice, of course)
             endif
         endwhile
         

  • The gradient descent algorithm is exactly this idea:
         while not over
             x = x - α f'(x)
         endwhile
         
    Here, we add a scaling factor α in case f'(x) values are of a different order-of-magnitude:

  • Why we need α
    • For example, it could be that x=0.1, x*=0 and f(0.1)=1000.
    • Then one iterative step without α would produce x = 0.1 - 1000 = -999.9
                => Which would be way out of bounds.
    • For such a problem, we'd use α = 0.0001 so that
                x = 0.1 - 0.0001*1000 = 0.09

  • The algorithm parameter α is sometimes called the stepsize.

  • In pseudocode:
         Algorithm: gradientDescent (a, b)
         Input: the range [a,b]
         1.    x = a                      // Alternatively, x = b.
         2.    while |f'(x)| > ε
         3.        x = x - α f'(x)        // Note: f'() is evaluated at current value of x (before changing x).
         4.    endwhile
         5.    return x
         

  • Stopping conditions:
    • The obvious stopping condition is to see whether f'(x) is close enough to zero.
    • However, this may not always work:
                => If &alpha' is too small, the gradient may never get close enough.
    • Thus, it may help to also evaluate the actual proportional change in x, that is, |(prevX-x)/prevX|.
Exercise 16: Download and execute GradientDemo.java.
  • How many iterations does it take to get close to the optimum?
  • What is the effect of using a small α (e.g, α=0.001)?
  • In the method nextStep(), print out the current value of x, and the value of xf'(x) before the update.
  • Set α=1. Explain what you observe.
  • What happens when α=10?

Picking the right stepsize:

  • Clearly, the performance of gradient descent is sensitive to the choice of the stepsize α.
    • If α is too small
                => It can take too long to converge.
    • If α is too large
                => It can diverge or oscillate, and never converge.

  • One rule of thumb: the term αf'(x) should be an order-of-magnitude less than x.
              => This way, the changes in x are small relative to x.

  • Another idea: try different α values in each iteration:
    • Start with a small α.
    • Gradually increase until it "does something bad".
    • What is "bad"?
                => Causes an increase in f(x) value.

    • This idea is called line-search
                => Because we're searching along the (single) dimension of α.

  • At first, it may seem that one could merely try increasing α:
      Algorithm: gradientDescentLineSearch (a, b)
      Input: the range [a,b]
      1.    x = a                      
      2.    while |f'(x)| > ε
      3.        αtrial = αsmall                  // αsmall is the first value to try.
      4.        do                             // Search for the right alpha.
      5.            α = αtrial
      6.            αtrial = α + δα
      7.            f = f(x - αf'(x))
      8.            ftrial = f(x - αtrialf'(x))
      9.        while ftrial < f                // Keep increasing stepsize until you get an increase in f
      10.       x = x - α f'(x)  
      11.   endwhile
      12.   return x
             

  • However, this approach has two problems:
    • It is highly dependent on the choice of δα.
    • What if the true best value is extremely small (smaller than δα)?
                => We'd overlook it because it falls between small, αsmallα]

  • The better way is to realize that we can use bracketing search, which automatically adjusts to the "scale" by shrinking intervals:
      Algorithm: gradientDescentLineSearch (a, b)
      Input: the range [a,b]
      1.    x = a                      
      2.    Pick αsmall                   // Left end of  α interval
      3.    Pick αbig                     // Right end of  α interval
      4.    while |f'(x)| > ε
                // Define the function g(α) = f(x - αf'(x)) in the interval [αsmall, αbig]
      5.        α* = bracketSearch (g, αsmall, αbig)
      6.        x = x - α* f'(x)  
      7.    endwhile
      8.    return x
             

Let's now turn to an important concept that pervades all of optimization: local vs. global minima.

Exercise 17: Download GradientDemo2.java and examine the function being optimized.

  • Fill in the code for computing the derivative.
  • Try an initial value of x at 1.8. Does it converge?
  • Next, try an initial value of x at 5.8. What is the gradient at the point of convergence?

About local vs. global optima:

  • Just because the gradient is zero, doesn't mean we've found the optimum.

  • A function can have several different local minima, as we've seen.

    Exercise 18: Can a function have several global minima?

  • The gradient f'(x) at a point x merely describes the behavior of the function f near that point.

  • A gradient-descent algorithm will find a local minimum
              => There's no guarantee that it'll find the global minimum.

  • In general, apart from searching the whole space of solutions, there's no method that guarantees finding a global minimum.

  • Thus, finding a local minimum is considered a high-enough standard for an algorithm.

What if we cannot compute the gradient?

  • For some functions, a formula for the gradient may be difficult to obtain.

  • However, it's easy to approximate the gradient:
    • Pick some small value s.
    • Compute
                => f'approx(x) = (f(x+s) - f(x)) / s

  • In pseudocode:
         Algorithm: gradientDescentApprox (a, b)
         Input: the range [a,b]
         1.    x = a                      
         2.    f'approx = 2ε
         3.    while |f'approx| > ε
         4.        f'approx = (f(x+s) - f(x)) / s       // s is an algorithm parameter.
         5.        x = x - α f'approx       
         6.    endwhile
         7.    return x
         
Exercise 19: Add code to GradientDemo3.java to implement approximate gradients. Use s=0.01. Explain what could go wrong if s is too large.

Theoretical issues:

  • The big questions:
              Does the gradient-descent algorithm always converge to a local minimum?
              Under what conditions does the algorithm converge?

  • Theoreticians usually describe such iterative algorithms using this notation:
              x(n+1) = x(n) - α f'(x(n))
    Here, x(n) is the n-th iterate.

  • The convergence question:
    • Let S* be the set of local minima.
    • Question 1: Does the sequence x(n) have a limit?
    • Question 2: If so, is the limit in S*?

  • It can be shown that:
    • If the function f is "smooth" (twice-differentiable);
    • and if α is small enough.
    Then, x(n) → x*, where x* is some point in S*.

  • In the case that we use approximate gradients, we need an additional condition:
    • Recall that our approximate gradient was computed as:
                => f'approx(x) = (f(x+s) - f(x)) / s
    • We'll write the algorithm as
               while |f'approx| > ε
                   f'approx = (f(x+s) - f(x)) / s 
                   x = x - α f'approx       
                   n = n + 1
               endwhile
           
    • Because s is never exactly zero, it can never converge to the true minimum.
    • To solve this problem, we need to gradually decrease s as we iterate.
                => i.e., let s → 0 as n increases.
    • One way to do this:
               while |f'approx| > ε
                   sn = s / n
                   f'approx = (f(x+sn) - f(x)) / sn 
                   x = x - α f'approx       
                   n = n + 1
               endwhile
           

Gradient descent in multiple dimensions

Recall the intuition behind the gradient-descent algorithm in one dimension:

  • First, what does gradient mean? Roughly, the change in f with respect to increasing x:
              => f'(x) = (f(x+s) - f(x)) / s   (for small s)

  • We use the gradient to take a "step" in the direction towards the minimum:
             x = x - α f'(x)
         
  • Here, the "step" has two meanings:
    • A direction
                => determined by the sign of f'(x).
    • A magnitude
                => determined by the magnitude of f'(x).

  • The same idea works in multiple dimensions, provided we define "gradient" correctly.

Gradients for multivariable functions:

  • Let's first understand what we need:
    • We want an iterative algorithm for two variables.
    • What would this look like?
               while not over
                  x1 = x1 - (some gradient)
                  x2 = x2 - (some gradient)
               endwhile
               

  • Example: consider the function f(x1,x2)=(x1-4.71)2 + (x2-3.2)2 + 2(x1-4.71)2(x2-3.2)2.
              => This is a function of two variables, x1 and x1.

  • One option for defining the gradient would be
              => f'(x1,x2) = (f(x1+s,x2+s) - f(x1,x2)) / s   (for small s)
    • This results in a single number.
    • This means that the iteration would look like
               while not over
                  x1 = x1 - α f'(x1,x2)
                  x2 = x2 - α f'(x1,x2)
               endwhile
               
    • Thus, the two variables would change by the same amount in the same direction.
    Exercise 20: Explain why this won't work. Think up a function for which it won't work.

  • Instead, we need to compute gradients independently for each of the two variables:
    • Define f'1 = (f(x1+s,x2) - f(x1,x2)) / s   (for small s)
    • Define f'2 = (f(x1,x2+s) - f(x1,x2)) / s   (for small s)

    This kind of a derivative is called a partial derivative.
              => There are two partial derivatives above, one for each variable.

  • In this case, our gradient descent algorithm looks like:
             while not over
                x1 = x1 - α f'1(x1,x2)
                x2 = x2 - α f'2(x1,x2)
             endwhile
             
Exercise 21: Compute by hand the partial derivatives of f(x1,x2)=(x1-4.71)2 + (x2-3.2)2 + 2(x1-4.71)2(x2-3.2)2.
Exercise 22: Compute by hand the partial derivatives of f(x1,x2) = x1/(μ1-λx1) + x2/(μ2-λx2) .

A little more detail:

  • What about the stopping condition?
              => We should keep going as long as any one of the gradients is not close to zero.

  • Same stepsize for both variables?
    • In many problems, a single stepsize will suffice.
    • A line-search will likely result in different stepsizes.

  • At first glance, it may seem that the pseudocode could be written as:
         1.    while |f1'(x1,x2)| > ε or |f2'(x1,x2)| > ε
         2.        x1 = x1 - α f1'(x1,x2)
         3.        x2 = x2 - α f2'(x1,x2)
         4.    endwhile
         
    While this is mathematically more elegant this, however, creates a small programming issue, as we'll see.

  • Instead, let's use this pseudocode:
         Algorithm: twoVariableGradientDescent (a1, b1, a2, b2)
         Input: the ranges [a1, b1] and [a2, b2]
         1.    x1 = a1, x2 = a2
         2.    f1' = 2ε
         3.    f2' = 2ε
         4.    while |f1'| > ε or |f2'| > ε
         5.        f1' = f1'(x1,x2)
         6.        f2' = f2'(x1,x2)
         7.        x1 = x1 - α f1'
         8.        x2 = x2 - α f2'
         9.    endwhile
         10.   return x1,x2
         
    Exercise 23: What would go wrong if we interchanged lines 6 and 7 above?

  • Note: just like the unidimensional case, we could use approximate gradients.
Exercise 24: Add code to MultiGradient.java to compute the partial derivatives of the function f(x1,x2)=(x1-4.71)2 + (x2-3.2)2 + 2(x1-4.71)2(x2-3.2)2. Then execute to find the minimum. You might need to experiment with different values of α.

Stochastic gradient descent and simulation optimization

Let's start with an application:

  • Recall the routing problem with two queues:

    • An arriving customer chooses queue 0 with Pr[choose 0] = x.
                => Pr[choose 1] = 1-x.
    • In the simulation,
               if uniform() < x
                   choose 0
               else
                   choose 1
               

  • Thus, one can consider x to be a routing variable.

  • For a given routing probability x, there will be some average system time.

    • Here, for a particular value of x, we will get some estimate of the system time.
    • Let rv S(x) denote the system time when using routing value x.
    The goal: find that value x that minimizes average system time E[S(x)].
Exercise 25: Examine QueueControl.java and find the part of the code that chooses the queue. Try running the simulation with different values of x to guess the minimum.

Let's try using gradient descent:

  • Recall the gradient-descent algorithm (for one variable):
             x = x - α f'(x)
          

  • How do we know what f'(x) is?

  • One option: approximate it using finite differences
         while not over
             fxs = estimate system time with x+s       // Run the simulation with x+s
             fx  = estimate system time with x         // Run the simulation with x
             f' = (fxs - fx) / s                       // Compute finite difference.
             x = x - α f'                              // Apply gradient descent.
         endwhile
         
Exercise 26: Confirm that QueueOpt.java implements this approach. Run the algorithm - does it work? How many samples are used in each estimate?

The problem with noise:

  • Observe that we are not calculating the approximate-gradient but, rather, are estimating it.

  • For any finite number of samples, an estimate will be off by some error.

  • Thus, the algorithm can "wander" all over x space when using "noisy" estimates.

  • Fortunately, there are (interesting) ways around this problem.

Addressing the noise problem:

  • First, we'll identify how many samples are being used in each estimate.

    • Recall that, in the example, we ran the simulation for 1000 departures.
    • Obviously the more departures, the better the estimate.
    • Let S(k,x) = estimate obtained using k samples.

  • We will allow the number of such samples to vary by iteration.

  • Second, we'll allow the stepsize to vary by iteration.

  • Putting these ideas together, our iterative algorithm becomes
         while not over
             k = kn                        // k = #samples to be used in iteration n
             fxs = S(k, x+s)               // Run the simulation with x+s
             fx  = S(k, x)                 // Run the simulation with x
             f' = (fxs - fx) / s           // Compute finite difference.
             α = αn                        // stepsize to be used in iteration n
             x = x - α f'                  // Apply gradient descent.
         endwhile
         

  • We could write this more mathematically as:
             x(n+1) = x(n) - αn (S(kn,x+s) - S(kn,x)) / s
         

  • What values should be used for kn and αn?
              => There are two approaches, stepsize-control and sampling-control
    We'll look at each of these next.

Stepsize-control:

  • The key idea in stepsize control is to let the stepsize decrease gradually.
    • Initially, we make possibly large noise-driven steps.
    • Later, as we get closer to the optimum, the stepsizes get small.

  • One caveat: if the stepsize gets small too quickly, we may converge to some point away from the optimum.

  • Consider stepsizes that meet these conditions:
    • αn → 0.
    • Σn αn → infty.
    Thus, the stepsizes αn decrease, but not too quickly.
    Exercise 27: Give an example of such a sequence. (Hint: we saw one such example in Module 1). Also give an example of a sequence that does decrease quickly, that violates the second condition.

  • What about the numbers of samples kn?
              => We'll pick a fixed number, e.g., kn=1000?

  • Then, the algorithm can be written as follows:
      Algorithm: stepsizeControl 
           // Initialize x, other variables ... (not shown)
    
      1.   while not over
      2.       fxs = S(k, x+s)               // Run the simulation with x+s using k samples
      3.       fx  = S(k, x)                 // Run the simulation with x using k samples
      4.       f' = (fxs - fx) / s           // Compute finite difference.
      5.       α = αn                        // stepsize to be used in iteration n
      6.       x = x - α f'                  // Apply gradient descent.
      7.       n = n + 1
      8.  endwhile
      9.  return x
         

  • The stepsize-control approach is sometimes called the Robbins-Munro algorithm.
Exercise 28: Modify StepsizeControl.java to use decreasing stepsizes. Does it work?

Sampling-control:

  • The other approach is to increase the number of samples:
    • Initially, use fewer samples.
    • Later, as we approach the optimum, use more samples.

  • For example, kn = n

  • In this case, the algorithm can be written as:
      Algorithm: samplingControl 
           // Initialize x, other variables ... (not shown)
    
      1.  while not over
      2.       k = kn                        // # samples for iteration n
      3.       fxs = S(k, x+s)               // Run the simulation with x+s using k samples
      4.       fx  = S(k, x)                 // Run the simulation with x using k samples
      5.       f' = (fxs - fx) / s           // Compute finite difference.
      6.       x = x - α f'                  // Note fixed stepsize.
      7.       n = n + 1
      8.  endwhile
      9.  return x
         

Finally, how do we know these methods worked for our routing example?

  • It turns out that one can analytically derive the system-time in terms of x:
              f(x) = x/(μ1-λx) + (1-x)/(μ2-λ(1-x)).

  • This can even be minimized analytically (see earlier exercise).

  • However, let's use our gradient-descent algorithm.
Exercise 29: Add code to QueueGradientDescent.java to find the minimum. Does this correspond with what you found when using the stepsize-control algorithm?

Summary

We have taken an in-depth look at single-variable unconstrained optimization:

  • Simple non-gradient methods: bracketing, golden-ratio search.
  • Gradient descent.
  • The difference between local and global minima.
  • Why gradient descent can get stuck in a local minima.
  • Stochastic version of gradient descent.
  • Multivariable version of gradient descent.

Related topics we haven't covered in non-linear optimization:

  • Ways to speed up gradient-descent
              => Using 2nd derivatives (e.g., Newton-Raphson method).

  • Constrained optimization
              => A huge topic in itself.

  • Function optimization
              => Another huge topic in itself.