Module 9: Optimization
Types of optimization problems
First, what is an optimization problem?
- Typically, there is some objective
=> Maximize or minimize some value.
- There must also be variables whose values we can choose
=> The value depends on choices made for these variables.
- Key concepts in optimization techniques vary according
to the type of optimization problem.
=> It's very rare for a principle to work across different types.
- Broadly, one may categorize as follows:
- Discrete optimization problems
=> Also called combinatorial optimization problems.
- Continuous optimization problems
=> This has many sub-categories: linear, non-linear, quadratic etc.
- Stochastic optimization problems:
=> Where there is some element of randomness.
Discrete optimization:
- Example: the Traveling Salesman Problem (TSP).

- Input: n points on the plane.
- Objective: find a tour of minimum length.
- A given instance has a finite (possibly large) number of
candidate solutions.
Exercise 1:
How many candidate tours can one construct for an n-point
TSP problem?
- Note: the problem is inherently discrete
- The solution is a finite set of edges.
- The set of candidate solutions is finite.
- Algorithms for discrete problems search in a discrete space.
Continuous optimization:
- In contrast, consider this problem:
minimize f(x) = 5 + (x-4.71)2

- In other words, find that value of x where
f(x) is smallest.
- The set of candidate solutions?
=> uncountably infinite (the real line)
- Algorithms for these problems search a continuous space.
- This example had only one variable: x.
- In general, continuous optimization problems will have
more than one variable
e.g., minimize f(x,y,z) = 5 + (x-4.71)2 + 1/(3y-z) + sin(xy)
- Jargon: the function to be optimized is called
the objective function.
Function vs. parameter (or variable) optimization:
- Recall the car acceleration problem from
Module 1:
- Similarly, recall the two-segment incline problem from
Module 3:

- The goal was to see which two segments result in the bead
sliding down soonest.
- It turns out that f2(x) is better than
the other two.
- In both examples above, the goal is to find the best function.
- However in problems like
e.g., minimize f(x,y,z) = 5 + (x-4.71)2 + 1/(3y-z) + sin(xy)
We are given a function, and need to find the best values for
its arguments (variables).
=> This is called parameter optimization.
Ways to categorize continuous (parameter) optimization problems:
- Constrained vs. unconstrained:
- Consider the example:
minimize f(x,y) = 5 + (x-4.71)2 + 2y3
such that x + y = 5.
- This is an example of a problem with constraints
(on the variables).
- Linear vs. non-linear objective function
- Non-linear example: f(x,y) = 5 + (x-4.71)2 + 2y3.
- Linear example: f(x,y) = 5x + 7y.
- Linear vs. non-linear constraints:
- Linear constraints example:
minimize f(x,y) = 5 + (x-4.71)2 + 2y3
such that x + y = 5.
- Non-linear constraints example:
minimize f(x,y) = 5 + (x-4.71)2 + 2y3
such that (x + y)2 = 5.
- Note: it's the variables that are constrained.
- Combinations of these lead to the important categories:
- Linear objective, unconstrained variables: this is a trivial problem.
Exercise 2:
Why is this true? What is the minimum (unconstrained) value of f(x,y) = 3x + 4y?
- Linear objective, linear constraints: one of the most important, and
successful types of problems.
- Nonlinear, unconstrained: many kinds are solvable with
iterative algorithms.
- Nonlinear, constrained: the hardest type of problem to solve
=> Different techniques for sub-categories (e.g. linear constraints).
In this module:
- We'll focus on nonlinear (mostly unconstrained) problems.
- Mostly, we'll assume we have a function to minimize
(as opposed to maximize).
- Maximizing f(x) is the same as minimizing
-f(x) (its reflection about the x-axis).
- Notation:
- f(x) denotes the objective function we wish to minimize.
- Let x* be the x-value that minimizes f(x)
=> i.e., f(x*) ≤ f(x) for any other x.
What we learned in calculus
Let's go back to our example:
- Consider the problem
minimize f(x) = 5 + (x-4.71)2
- In calculus, we learned to set f'(x)=0:
f'(x)=0
=> f'(x) = 2(x-4.71) = 0
=> x = 4.71
- This approach relies on successfully solving two
sub-problems:
- Problem 1: calculating the derivative f'(x).
=> i.e., finding an analytical formula for f'(x).
(such as f'(x) = 2(x-4.71).
- Problem 2: solving f'(x)=0.
- In some real-world problems, it's hard to directly
calculate f'(x).
- In most real-world problems, it's very difficult
to solve (by hand) f'(x)=0.
- Example: f(x) = sin(x) - x2
- The first part is easy: f'(x) = cos(x) - 2x.
- The second part is very difficult: solving cos(x)-2x=0.
- However, it's fairly easy to write an algorithm to search
for the optimal solution.
=> the focus of this module.
Exercise 3:
Go back to your calc book and
find an example of a "hard to differentiate" function.
Exercise 4:
What's an example of a function that's continuous but not
differentiable? Can you think of a function that's continuous
everywhere but differentiable nowhere? [Hint: it's rather bizarre].
Exercise 5:
Consider the function
f(x) = x/(μ1-λx)
+ (1-x)/(μ2-λ(1-x)).
Compute the derivative f'(x). Can you solve f'(x)=0?
One other important point from our calc course:
- Merely solving f'(x) = 0 did not tell us whether
this minimized or maximized f
=> We needed something else.
- One way to tell: plot f(x) and see where the gradient
is zero.
- A better way: evaluate the second derivative at the optimum.
negative 2nd derivative => minimum.
Simple bracketing search
Key ideas:

- First, find a large enough initial range (bracket), [a,b].
- Pick values for the algorithm parameters
M = number of intervals.
N = number of iterations.
- Then, divide the bracket into M intervals
=> Let δ = (b-a)/M.
- Evaluate f(x) at every interval boundary.
- Pick the best such x
=> Call this x*.
- Set the new bracket to be [x-δ,x+δ].
- Repeat (for a total of N times).
- Pseudocode:
Algorithm: bracketSearch (a, b)
Input: the initial range [a,b]
1. for i=1 to N
2. δ = (b-a) / M // Divide current bracket.
3. x* = a
4. bestf = f(x*)
5. for x=a to b step δ // Search for best x in current bracket
6. f = f(x)
7. if f < bestf
8. bestf = f
9. x* = x
10. endif
11. endfor
12. a = x* - δ // Shrink bracket.
13. b = x* + δ
14. endfor
15. return x*
Exercise 6:
Download and execute
BracketSearch.java.
- What is the running time in terms of M and N?
- If we keep MN constant (e.g., MN=24), what
values of M and N produce best results?
Exercise 7:
Draw an example of a function for which bracket-search
fails miserably, that is, the true minimum is much
lower than what's found by bracket search even
for large M and N.
How does one evaluate such algorithms?
- In the above case, the actual computation inside the innermost
loop was simple
=> Computing f(x) takes constant time.
- However, in many real-world examples
=> Computing f(x) can take a lot of time.
- Example: f(x) may be the result of solving
a differential equation
=> Computing f(x) itself needs multiple iterations.
- Thus, one would like to reduce the number
of function evaluations.
Exercise 8:
What is the number of function evaluations
in terms of M and N for the bracket-search algorithm?
Can bracket search work for multivariate functions such as
f(x,y) = 2x+3y2?
- Yes, one simply divides every dimension at each step
=> The brackets become "cells"

- For example, in 2D (2 variables)
Algorithm: bracketSearch (a1, b1, a2, b2)
Input: the initial ranges for each dimension [a1,b1] and [a2,b2]
1. for i=1 to N
// The ranges could be different in each dimension.
2. δ1 = (b1-a1) / M
3. δ2 = (b2-a2) / M
4. x1* = a1, x2* = a2
5. bestf = f(x1*, x2*)
6. for x1=a1 to b1 step δ1
7. for x2=a2 to b2 step δ2
8. f = f(x1,x2)
9. if f < bestf
8. bestf = f
9. x1* = x1, x2* = x2
10. endif
11. endfor
12. endfor
// Shrink cell.
13. a1 = x1* - δ1, b1 = x1* + &delta1;
14. a2 = x2* - δ2, b2 = x2* + &delta2;
15. endfor
16. return x1*, x2*
- Notice that we've used the notation (x1,x2)
instead of (x,y).
=> For an n-dimensional space we'd use
(x1,x2, ..., xn)
to represent a particular point.
Exercise 9:
What is the number of function evaluations
in terms of M and N for the 2D bracket-search algorithm?
How does this generalize to n dimensions?
Exercise 10:
Add code to
MultiBracketSearch.java
to find the minimum of
f(x1,x2)=(x1-4.71)2
+ (x2-3.2)2
+ 2(x1-4.71)2(x2-3.2)2.
When to stop?
- We have fixed the number of iterations at N.
- We could, instead, stop when some desired accuracy is achieved.
- Let's try the following:
- Let fk be the best value after k iterations.
- We will compare fk and fk-1.
If |fk - fk-1| < ε, then stop.
- Here, ε is some suitably small number.
- In pseudocode:
Algorithm: bracketSearch (a, b)
Input: the initial range [a,b]
1. bestf = some-large-value
2. prevBestf = bestf + 2ε // So that we enter the loop.
3. while |bestf - prevBestf| > &epsilon
4. δ = (b-a) / M
5. x* = a
6. bestf = f(x*)
7. for x=a to b step δ
8. f = f(x)
9. if f < bestf
10. bestf = f
11. x* = x
12. endif
13. endfor
14. a = x* - δ
15. b = x* + δ
16. N = N + 1 // Track N for printing/evaluation
17. endfor
// Print N if desired.
18. return x*
- How do we choose ε?
- If, in our problem, f values happen to be very large
(e.g, 106),
=> ε=0.1 may be unnecessarily small.
- If, on the other hand, f values happen to be very small
(e.g, 10-6),
=> ε=0.1 may be too large.
=> No optimization occurs.
- A better solution:
If |(fk - fk-1) / fk-1| < ε, then stop.
=> Thus, if the proportional change is very small, stop.
Exercise 11:
Modify
BracketSearch2.java
to use the proportional-difference stopping condition.
Golden ratio search
First, an important observation:
- Suppose a < b are two real numbers.
=> e.g., they represent an interval [a,b].
- Next, let r be a number strictly between 0 and 1.
=> i.e., 0 < r < 1.
- Now consider the numbers c = ra + (1-r)b, and d
= (1-r)a + rb.
- Then, it is true that a < c < b
=> That is, c is between a and b.
- Similarly, a < d < b
- Intuition:
- Pick r=0.3 and a=4, b=9.
- Then, c = 0.3*4 + 0.7*9 = some weighted average of
4 and 9
=> weighted average must be between 4 and 9.
Exercise 12:
Prove this result.
Next, let's look at the ideas in ratio search (before we tackle
golden-ratio search):
- Recall that, in bracket-search, we fixed the number of
intervals M.
- In ratio search, we will use M=1 but adjust the ends.
- Start with some interval [a,b].

- Compute the ends of a smaller interval contained within:

- At each step, shrink the interval to either [a,d] or [c,b].

- If f(c) ≤ f(d), set b = d.
- If f(c) > f(d), set a = c.
- Then, repeat until interval is small enough to stop.
- Why does this work?

- First, we assume the function is unimodal
=> f(x) decreases from a to the optimal
x, and increases after that.
- If f(c) ≤ f(d), then the minimum cannot occur to
the right of d.
=> We can shrink the interval on the right.
- Same reasoning for the left side.
- After shrinking the interval, we again compute two interior
points and repeat.
Golden-ratio search:
Exercise 14:
Describe what could go wrong if we replaced the while-condition with
2. while |(fc-fd)/fd| > ε
How would you address this problem?
Exercise 15:
Implement golden-ratio search in this template:
GoldenRatio.java.
Gradient descent
Let's start by understanding what gradient means:
- Consider a (single-dimensional) function f(x):

- Let f'(x) denote the derivative of f(x).
- The gradient at a point x is the value of f'(x).
=> Graphically, the slope of the tangent to the curve at x.
- Observe the following:

- To the left of the optimal value x*, the gradient
is negative.
- To the right, it's positive.
- We seek an iterative algorithm of the form
while not over
if gradient < 0
move rightwards
else if gradient > 0
move leftwards
else
stop // gradient = 0 (unlikely in practice, of course)
endif
endwhile
- The gradient descent algorithm is exactly this idea:
while not over
x = x - α f'(x)
endwhile
Here, we add a scaling factor α in case f'(x)
values are of a different order-of-magnitude:
- Why we need α
- For example, it could be that x=0.1, x*=0 and f(0.1)=1000.
- Then one iterative step without α
would produce x = 0.1 - 1000 = -999.9
=> Which would be way out of bounds.
- For such a problem, we'd use α = 0.0001 so that
x = 0.1 - 0.0001*1000 = 0.09
- The algorithm parameter α is sometimes called
the stepsize.
- In pseudocode:
Algorithm: gradientDescent (a, b)
Input: the range [a,b]
1. x = a // Alternatively, x = b.
2. while |f'(x)| > ε
3. x = x - α f'(x) // Note: f'() is evaluated at current value of x (before changing x).
4. endwhile
5. return x
- Stopping conditions:
- The obvious stopping condition is to see whether
f'(x) is close enough to zero.
- However, this may not always work:
=> If &alpha' is too small, the gradient may never
get close enough.
- Thus, it may help to also evaluate the actual proportional
change in x, that is, |(prevX-x)/prevX|.
Exercise 16:
Download and execute
GradientDemo.java.
- How many iterations does it take to get close to the optimum?
- What is the effect of using a small α
(e.g, α=0.001)?
- In the method nextStep(), print out the current value
of x, and the value of xf'(x) before the update.
- Set α=1. Explain what you observe.
- What happens when α=10?
Picking the right stepsize:
- Clearly, the performance of gradient descent is sensitive to
the choice of the stepsize α.
- If α is too small
=> It can take too long to converge.
- If α is too large
=> It can diverge or oscillate, and never converge.
- One rule of thumb:
the term αf'(x) should be an
order-of-magnitude less than x.
=> This way, the changes in x are small relative to x.
- Another idea: try different α values in each iteration:
- Start with a small α.
- Gradually increase until it "does something bad".
- What is "bad"?
=> Causes an increase in f(x) value.

- This idea is called line-search
=> Because we're searching along the (single) dimension of α.
- At first, it may seem that one could merely try increasing α:
Algorithm: gradientDescentLineSearch (a, b)
Input: the range [a,b]
1. x = a
2. while |f'(x)| > ε
3. αtrial = αsmall // αsmall is the first value to try.
4. do // Search for the right alpha.
5. α = αtrial
6. αtrial = α + δα
7. f = f(x - αf'(x))
8. ftrial = f(x - αtrialf'(x))
9. while ftrial < f // Keep increasing stepsize until you get an increase in f
10. x = x - α f'(x)
11. endwhile
12. return x
- However, this approach has two problems:
- It is highly dependent on the choice of δα.
- What if the true best value is extremely small (smaller
than δα)?
=> We'd overlook it because it falls between
[αsmall, αsmall+δα]
- The better way is to realize that we can use bracketing
search, which automatically adjusts to the "scale" by shrinking intervals:
Algorithm: gradientDescentLineSearch (a, b)
Input: the range [a,b]
1. x = a
2. Pick αsmall // Left end of α interval
3. Pick αbig // Right end of α interval
4. while |f'(x)| > ε
// Define the function g(α) = f(x - αf'(x)) in the interval [αsmall, αbig]
5. α* = bracketSearch (g, αsmall, αbig)
6. x = x - α* f'(x)
7. endwhile
8. return x
Let's now turn to an important concept that pervades all of
optimization: local vs. global minima.
Exercise 17:
Download
GradientDemo2.java
and examine the function being optimized.
- Fill in the code for computing the derivative.
- Try an initial value of x at 1.8. Does it converge?
- Next, try an initial value of x at 5.8.
What is the gradient at the point of convergence?
About local vs. global optima:
- Just because the gradient is zero, doesn't mean we've found the optimum.
- A function can have several different local minima, as we've seen.

Exercise 18:
Can a function have several global minima?
- The gradient f'(x) at a point x merely describes the
behavior of the function f near that point.
- A gradient-descent algorithm will find a local minimum
=> There's no guarantee that it'll find the global minimum.
- In general, apart from searching the whole space of
solutions, there's no method that guarantees finding a global minimum.
- Thus, finding a local minimum is considered a high-enough
standard for an algorithm.
What if we cannot compute the gradient?
- For some functions, a formula for the gradient may be difficult
to obtain.
- However, it's easy to approximate the gradient:
- Pick some small value s.
- Compute
=> f'approx(x) = (f(x+s) - f(x)) / s
- In pseudocode:
Algorithm: gradientDescentApprox (a, b)
Input: the range [a,b]
1. x = a
2. f'approx = 2ε
3. while |f'approx| > ε
4. f'approx = (f(x+s) - f(x)) / s // s is an algorithm parameter.
5. x = x - α f'approx
6. endwhile
7. return x
Exercise 19:
Add code to GradientDemo3.java
to implement approximate gradients. Use s=0.01.
Explain what could go wrong if s is too large.
Theoretical issues:
- The big questions:
Does the gradient-descent algorithm always converge to a local minimum?
Under what conditions does the algorithm converge?
- Theoreticians usually describe such iterative algorithms
using this notation:
x(n+1) = x(n) - α f'(x(n))
Here, x(n) is the n-th iterate.
- The convergence question:
- Let S* be the set of local minima.
- Question 1: Does the sequence x(n) have a limit?
- Question 2: If so, is the limit in S*?
- It can be shown that:
- If the function f is "smooth" (twice-differentiable);
- and if α is small enough.
Then, x(n) → x*, where x* is some
point in S*.
- In the case that we use approximate gradients, we need an
additional condition:
- Recall that our approximate gradient was computed as:
=> f'approx(x) = (f(x+s) - f(x)) / s
- We'll write the algorithm as
while |f'approx| > ε
f'approx = (f(x+s) - f(x)) / s
x = x - α f'approx
n = n + 1
endwhile
- Because s is never exactly zero, it can never
converge to the true minimum.
- To solve this problem, we need to gradually decrease
s as we iterate.
=> i.e., let s → 0 as n increases.
- One way to do this:
while |f'approx| > ε
sn = s / n
f'approx = (f(x+sn) - f(x)) / sn
x = x - α f'approx
n = n + 1
endwhile
Gradient descent in multiple dimensions
Recall the intuition behind the gradient-descent algorithm in one dimension:
- First, what does gradient mean? Roughly, the change in f
with respect to increasing x:
=> f'(x) = (f(x+s) - f(x)) / s (for small s)
- We use the gradient to take a "step" in the direction towards
the minimum:
x = x - α f'(x)
- Here, the "step" has two meanings:
- A direction
=> determined by the sign of f'(x).
- A magnitude
=> determined by the magnitude of f'(x).
- The same idea works in multiple dimensions, provided we
define "gradient" correctly.
Gradients for multivariable functions:
- Let's first understand what we need:
- We want an iterative algorithm for two variables.
- What would this look like?
while not over
x1 = x1 - (some gradient)
x2 = x2 - (some gradient)
endwhile
- Example: consider the function
f(x1,x2)=(x1-4.71)2
+ (x2-3.2)2
+ 2(x1-4.71)2(x2-3.2)2.
=> This is a function of two variables,
x1 and x1.
- One option for defining the gradient would be
=> f'(x1,x2)
= (f(x1+s,x2+s) - f(x1,x2)) / s (for small s)
- This results in a single number.
- This means that the iteration would look like
while not over
x1 = x1 - α f'(x1,x2)
x2 = x2 - α f'(x1,x2)
endwhile
- Thus, the two variables would change by the same amount in
the same direction.
Exercise 20:
Explain why this won't work. Think up a function for which it won't work.
- Instead, we need to compute gradients independently for
each of the two variables:
- Define f'1 = (f(x1+s,x2) - f(x1,x2)) / s (for small s)
- Define f'2 = (f(x1,x2+s) - f(x1,x2)) / s (for small s)
This kind of a derivative is called a partial derivative.
=> There are two partial derivatives above, one for each variable.
- In this case, our gradient descent algorithm looks like:
while not over
x1 = x1 - α f'1(x1,x2)
x2 = x2 - α f'2(x1,x2)
endwhile
Exercise 21:
Compute by hand the partial derivatives of
f(x1,x2)=(x1-4.71)2
+ (x2-3.2)2
+ 2(x1-4.71)2(x2-3.2)2.
Exercise 22:
Compute by hand the partial derivatives of
f(x1,x2)
= x1/(μ1-λx1)
+ x2/(μ2-λx2) .
A little more detail:
- What about the stopping condition?
=> We should keep going as long as any one of the gradients
is not close to zero.
- Same stepsize for both variables?
- In many problems, a single stepsize will suffice.
- A line-search will likely result in different stepsizes.
- At first glance, it may seem that the pseudocode could be
written as:
1. while |f1'(x1,x2)| > ε or |f2'(x1,x2)| > ε
2. x1 = x1 - α f1'(x1,x2)
3. x2 = x2 - α f2'(x1,x2)
4. endwhile
While this is mathematically more elegant this, however,
creates a small programming issue, as we'll see.
- Instead, let's use this pseudocode:
Algorithm: twoVariableGradientDescent (a1, b1, a2, b2)
Input: the ranges [a1, b1] and [a2, b2]
1. x1 = a1, x2 = a2
2. f1' = 2ε
3. f2' = 2ε
4. while |f1'| > ε or |f2'| > ε
5. f1' = f1'(x1,x2)
6. f2' = f2'(x1,x2)
7. x1 = x1 - α f1'
8. x2 = x2 - α f2'
9. endwhile
10. return x1,x2
Exercise 23:
What would go wrong if we interchanged lines 6 and 7 above?
- Note: just like the unidimensional case, we could use
approximate gradients.
Exercise 24:
Add code to
MultiGradient.java
to compute the partial derivatives of the function
f(x1,x2)=(x1-4.71)2
+ (x2-3.2)2
+ 2(x1-4.71)2(x2-3.2)2.
Then execute to find the minimum. You might need to experiment
with different values of α.
Stochastic gradient descent and simulation optimization
Let's start with an application:
- Recall the routing problem with two queues:

- An arriving customer chooses queue 0 with Pr[choose 0] = x.
=> Pr[choose 1] = 1-x.
- In the simulation,
if uniform() < x
choose 0
else
choose 1
- Thus, one can consider x to be a routing variable.
- For a given routing probability x, there will be
some average system time.

- Here, for a particular value of x, we will get some
estimate of the system time.
- Let rv S(x) denote the system time when using routing
value x.
The goal: find that value x that minimizes average
system time E[S(x)].
Exercise 25:
Examine
QueueControl.java
and find the part of the code that chooses the queue.
Try running the simulation with different values of x
to guess the minimum.
Let's try using gradient descent:
- Recall the gradient-descent algorithm (for one variable):
x = x - α f'(x)
- How do we know what f'(x) is?
- One option: approximate it using finite differences
while not over
fxs = estimate system time with x+s // Run the simulation with x+s
fx = estimate system time with x // Run the simulation with x
f' = (fxs - fx) / s // Compute finite difference.
x = x - α f' // Apply gradient descent.
endwhile
Exercise 26:
Confirm that
QueueOpt.java
implements this approach.
Run the algorithm - does it work?
How many samples are used in each estimate?
The problem with noise:
- Observe that we are not calculating the approximate-gradient
but, rather, are estimating it.

- For any finite number of samples, an estimate will
be off by some error.
- Thus, the algorithm can "wander" all over x space
when using "noisy" estimates.
- Fortunately, there are (interesting) ways around this problem.
Addressing the noise problem:
- First, we'll identify how many samples are being used in each estimate.

- Recall that, in the example, we ran the simulation for 1000 departures.
- Obviously the more departures, the better the estimate.
- Let S(k,x) = estimate obtained using k samples.
- We will allow the number of such samples to vary by iteration.
- Second, we'll allow the stepsize to vary by iteration.
- Putting these ideas together, our iterative algorithm becomes
while not over
k = kn // k = #samples to be used in iteration n
fxs = S(k, x+s) // Run the simulation with x+s
fx = S(k, x) // Run the simulation with x
f' = (fxs - fx) / s // Compute finite difference.
α = αn // stepsize to be used in iteration n
x = x - α f' // Apply gradient descent.
endwhile
- We could write this more mathematically as:
x(n+1) = x(n) - αn (S(kn,x+s) - S(kn,x)) / s
- What values should be used for kn
and αn?
=> There are two approaches, stepsize-control
and sampling-control
We'll look at each of these next.
Stepsize-control:
- The key idea in stepsize control is to let the stepsize
decrease gradually.
- Initially, we make possibly large noise-driven steps.
- Later, as we get closer to the optimum, the stepsizes get small.
- One caveat: if the stepsize gets small too quickly, we may
converge to some point away from the optimum.
- Consider stepsizes that meet these conditions:
Thus, the stepsizes αn decrease, but not too quickly.
Exercise 27:
Give an example of such a sequence. (Hint: we saw one such example
in Module 1).
Also give an example of
a sequence that does decrease quickly, that violates the second condition.
- What about the numbers of samples kn?
=> We'll pick a fixed number, e.g., kn=1000?
- Then, the algorithm can be written as follows:
Algorithm: stepsizeControl
// Initialize x, other variables ... (not shown)
1. while not over
2. fxs = S(k, x+s) // Run the simulation with x+s using k samples
3. fx = S(k, x) // Run the simulation with x using k samples
4. f' = (fxs - fx) / s // Compute finite difference.
5. α = αn // stepsize to be used in iteration n
6. x = x - α f' // Apply gradient descent.
7. n = n + 1
8. endwhile
9. return x
- The stepsize-control approach is sometimes called the
Robbins-Munro algorithm.
Exercise 28:
Modify StepsizeControl.java
to use decreasing stepsizes. Does it work?
Sampling-control:
- The other approach is to increase the number of samples:
- Initially, use fewer samples.
- Later, as we approach the optimum, use more samples.
- For example, kn = n
- In this case, the algorithm can be written as:
Algorithm: samplingControl
// Initialize x, other variables ... (not shown)
1. while not over
2. k = kn // # samples for iteration n
3. fxs = S(k, x+s) // Run the simulation with x+s using k samples
4. fx = S(k, x) // Run the simulation with x using k samples
5. f' = (fxs - fx) / s // Compute finite difference.
6. x = x - α f' // Note fixed stepsize.
7. n = n + 1
8. endwhile
9. return x
Finally, how do we know these methods worked for our routing example?
- It turns out that one can analytically derive the system-time
in terms of x:
f(x) = x/(μ1-λx)
+ (1-x)/(μ2-λ(1-x)).
- This can even be minimized analytically (see earlier exercise).
- However, let's use our gradient-descent algorithm.
Exercise 29:
Add code to
QueueGradientDescent.java
to find the minimum. Does this correspond with what you found
when using the stepsize-control algorithm?
Summary
We have taken an in-depth look at single-variable unconstrained optimization:
- Simple non-gradient methods: bracketing, golden-ratio search.
- Gradient descent.
- The difference between local and global minima.
- Why gradient descent can get stuck in a local minima.
- Stochastic version of gradient descent.
- Multivariable version of gradient descent.
Related topics we haven't covered in non-linear optimization:
- Ways to speed up gradient-descent
=> Using 2nd derivatives (e.g., Newton-Raphson method).
- Constrained optimization
=> A huge topic in itself.
- Function optimization
=> Another huge topic in itself.