## Optimization

### General Formulation

The general formulation for any optimization problem is this:

• Minimize $f(x) \,$, our cost/objective function, subject to a series of constraints:
• $g_i(x) = 0 \,$
• $h_i(x) \le 0$
• Where x is the variabe for optimization (can be multidimensional) and g and h are our constraints

### What do we mean when we say f(x)?

• f(x) is a function in the general sense, which can be continuous or discontinuous
• Think of f(x) as a black box. For any input given, we will receive an output

### Standard Optimization vs. Combinatorial Optimization

• Standard optimization (continuous) deals with functions where the output is continuous (both variables and functions are continuous)
• Combinatorial optimization deals with functions where the output is a discrete set (the fact that it is discrete does not mean that it is finite). It is important to note that calculus does not apply to these systems, making them more difficult

### Traveling Salesman Problem

The traveling salesman problem is a classical combinatorial optimization problem formulated as such:

• A salesman has to visit N number of cities. In what sequence do you travel through each city to minimize the distance you travel?
• You can see that there are N! number of possible sequences for the N number of cities

Possible ways of solving this problem:

• Heuristics - i.e. a lot of informal guesses as to the correct path (properly defined as methods/algorithms that tend to give the correct answer, but are not guaranteed to give the solution)
• Complete Enumeration - going through all possible N! solutions to find the optimal path. Here, the computational time is related to the number of paths, or in big-O notation: O(N!)
• O(N!) is known as combinatorial explosion and is a computationally NP-hard problem (Non-deterministic Polynomial time)
• Some problems have been proven to only be solvable by complete enumeration. Other methods might sometimes yield the optimal solution, but do not guarantee it for the problem.

### Dimensionality

Recall that for a one dimensional system and we want to find the minimum for a given function f(x), a continuous, differentiable function, we look to the derivative such that:

$\frac{df}{dx} = 0$

To ensure that we have a minimum, we want:

$\frac{d^2f}{dx^2} > 0$

The first derivative tells us if there is a change in the monotonicity of the function while the second derivative guarantees us a minimum. For a two-dimensional system, f(x,y), we look to the gradient and the also the second derivative (Hessian) such that:

$\nabla f = 0$

And that:

$H= \begin{pmatrix} \frac{d^2f}{dx^2} & \frac{d^2f}{dxdy} \\ \frac{d^2f}{dxdy} & \frac{d^2f}{dy^2} \end{pmatrix} > 0$

The condition above is known as positive definite (all elements in the matrix are positive), and this guarantees that we have a minimum. H is also positive definite if the following condition exists:

$x^THx > 0, \forall x$

The function is also positive definite if the eigenvalues of H are both positive for symmetric matrices (usually this matrix will be symmetric). The eigenvalues being positive is related to the function being convex in both directions. Recall that if one eigenvalue is positive and the other negative, then the point is a saddle node.

### Steepest Descent/Newton Method for Optimization

General formulation:

• We wish to minimize u = f(x) given that it is positive definite.

Methods for approaching the problem:

• fsolve() function in Matlab is an implementation of the steepest descent method and finds the zeros of a function (so one must supply the derivative of a function to obtain an extremum)
• ode45() if we can define the function to be minimized in terms of a potential function (and similar to fsolve(), the gradient of the function must also be supplied)
• Let u(x) be our energy function such that if we integrate $\frac{dx}{dt} = -\nabla u$ with respect to time, then we are assured that our energy function will decrease along all trajectories.
• We must supply ode45() with an initial condition to run and with this method, we are guaranteed a minimum, but not necessarily the global minimum. Only if the function is convex that we are guaranteed a global minimum and hence convex functions are easier to optimize than non-convex functions.

Similarities between fsolve() and ode45()

• Both must be supplied the gradient function in order to work
• Both must sample many initial conditions, in fact an infinite number of them, to guarantee that the global minimum is achieved. This is particularly important for energy functions with very narrow wells.
• Both take large time steps if possible to maximize the change in gradient and speed up calculations

Differences between fsolve() and ode45()

• The output of fsolve() is the x value at which an extremum occurs
• The output of ode45() is a time trace of the x value as time progresses and it is the final value of x over time which determines the value at which the extremum occurs. See below for an explanation as to why this is the case.
• The time steps of ode45() can be changed to suit the needs of the user, which is not available for fsolve().

Convex functions - functions in which for all points in x that our Hessian is positive definite (will only have one extremum for the entire function). Convexity is the result of the variables and if we can redefine the variables such that our function becomes convex, then this makes optimization much easier.

In the computer, steepest descent is performed by the Euler method of integration given our function:

$\frac{dx}{dt} = - \nabla u$

The Euler method expands out the differential to give:

$x_{t+1} = x_t - (\nabla u)\delta t$

The computer adjusts δt to make it more efficient to reach the minimum and will minimize $\nabla u$ as much as possible until it reaches a given tolerance value such that:

$\parallel x_{t+1} - x_t \parallel < \varepsilon$

When using ode45(), the computer integrates using the above Euler's method and evaluates the derivative after every time step. When the derivate equals to zero, then xt + 1 = xt and the value of x will stop changing and be constant over time. This is why when we use ode45() to obtain our extrema, then we must take the final value of x to be the point where the extremum occurs.

Possible problems with steepest descent:

• If the function has a very narrow well with the minimum, then the initial conditions must be very close to the minimum to achieve optimization
• If the function has several minima of various depths, then many initial conditions must be used to guarantee that we obtain the global minimum of the function.
• Convergence of steepest descent is related to how fast you get to the solution. For very flat, but convex functions, then this will take relatively long to converge since we are not sure of the step size that we should be taking because the variation in the function is too small. To correct for flatness in a function we can use conjugate gradient methods which don't always take the path against the gradient but will take steps out of the direction of the gradient (similar to going up the ridges of a valley) before again taking steps in the direction of th gradient.
• Although we might have a convex function, steepest descent will only get to the minimum of a function given an infinite time (remember that this only works for continuous functions).

Can we use steepest descent for combinatorial optimization? Yes! For example, you take take your discrete solution $x_0 = \{ \mu_1, \mu_2, \dots, \mu_N \}$ and change the state of each μ that minimizes the cost function. In effect, steepest descent is similar to a greedy algorithm in that it can minimize each μ locally in an attempt to minimize the cost function globally. The only difference between using steepest descent instead of simulated annealing with is that we allow upward changes in energy for simulated annealing and not for steepest descent.

### Combinatorial Optimization with Greedy Algorithms

For discrete sets in combinatorial problems, we utilize greedy algorithms which takes a starting condition and looks to it's nearest neighbors and minimizes that distance before continuing. Remember that without complete enumeration, this will not necessarily give the best solution and may get caught in large distances later on in the algorithm. This algorithm minimizes locally in an attempt to minimize globally and this is no longer an O(N!) problem anymore, making it computationally easier than complete enumeration. A good example of a greedy algorithm is Dijkstra's algorithm for finding the minimum pathway through a network. Other minimum tree spanning algorithms include Kruskal's or Prim's Algorithm.

### Formulating a Combinatorial Optimization Problem

Although it is easy to see how we can optimize combinatorial problems, it may be more difficult to formulate the problem in terms of equations which we want to minimize. For an example, consider a circuit design where we are given N circuits to be divided into two different connected chips. The traffic matrix A is the interchip cost matrix or our "energy" look up table in the form:

$A_{ij} = \begin{bmatrix} a_{11} & \cdots & a_{1n} \\ \vdots & \ddots & \vdots \\ a_{n1} & \cdots & a_{nn} \\ \end{bmatrix}$

We first define the state for each circuit being either $\mu = \pm 1$ where if the value of μ = +1, then it is on the first chip, and if the value of μ = -1, then it is on the second chip.

The state of the system will be given by the list of all the circuits: $\{ \mu_1, \dots, \mu_n \}$. It can be seen that there are 2n possible solutions for the circuit design.

But what exactly do we want to optimize? First, we want to reduce the interchip traffic, or the energy required to transfer information between chips. All related chips (as defined by a very low value in the matrix A) would ideally be together on one chip by having only a very limited number of connections between the two chips. Second, we would like to make the size of each chip roughly equivalent. We don't want it such that all the circuits are contained on one chip while the other chip only has a few circuits.

How do we first deal with the interchip traffic? Because we defined each circuit to having a state of $\pm 1$, this makes it relatively easy. If they are on the same chip, then μi − μj = 0 and we can formulate the total cost of the function as:

$u_{traffic} = \frac{1}{2} \sum_{i,j} (\frac{\mu_i - \mu_j}{2})^2 A_{ij}$

Now to deal with the sizes of both chips, we can just sum up the values of the -1 and +1. If they are the same size, then the sum will be equal to 0.

$u_{size} = ( \sum \mu_i )^2$

The total cost function that we now want to minimize is:

$u_{total} = \frac{1}{2} \sum_{i,j} (\frac{\mu_i - \mu_j}{2})^2 A_{ij} + ( \sum \mu_i )^2$

We can also give different weightings (λ) to each cost value to change the importance of the size of the chips or the interchip traffic by multiplying with a factor:

$u_{total} = \frac{1}{2} \sum_{i,j} (\frac{\mu_i - \mu_j}{2})^2 A_{ij} + \lambda( \sum \mu_i )^2$

### Simulated Annealing

This method attempts to correct for the presence of several minima and is a heuristic algorithm for non-convex problems derived from materials physics. We slightly adapt the Newton method by including a random term which will cause our function to sometimes go against the gradient.

$x_{t+1} = x_t - (\nabla u)\delta t + \alpha w(0,\sigma)$
• $w(0,\sigma) \,$ is a randomly generated number from a normal distribution with mean centered at zero and a standard deviation σ.
• α is our control parameter (eg temperature) to determine the "jumpiness" of the step. The function usually begins with a high temperature and slowly decreases α allowing a complete exploration of the energy function. If we can adjust the temperature infinitely slowly, then we are guaranteed to be at the global minimum. The adjustment of α over time is known as the cooling schedule and is arbitrarily set. For some energy functions, a initially slow cooling period is required, then rapidly cools down after a certain number of time steps. For generally more convex functions, it is possible to decrease the temperature quickly at the beginning and slowly towards the end. Several cooling schedules are usually used to see which one obtains the best results.
• For any change in energy that decreases, then we will always accept the change, but for an increase in energy, we will go upwards with a certain probability defined by the Boltzmann distribution:
$p = e^{\frac{-\Delta E}{kT}}$

Pseudocode for the Simulated Annealing Algorithm (can be used for both standard or combinatorial optimization problems):

1. Select an initial state x0 at random and evaluate the energy u(x0).
2. Select x1, the next state at random and evaluate the energyu(x1).
3. Accept the change if the cost function is decreased
4. Otherwise, accept the change with a probability $p = e^{\frac{-\Delta E}{kT}}$, where ΔE is the difference in energy between the new state and the previous state
5. Decrease T (temperature) according to the cooling schedule
6. Repeat steps 2 - 5 until we reach the desired number of time steps

An example of simulated annealing motivated by Lawrence David from MIT:

Imagine that there is a large hilly region with several peaks and valleys and one small fish. You, God, suddenly made it rain very hard and the entire region was filled with water (remember Noah and the ark?). Now because the water level is quite high, the fish can explore almost every part of the region. However, now that you've finished cleansing the land of all the sinners, you want to reduce the water level so that man can once again survive. You slowly reduce the water level. The fish however, wants to live as long as possible so will tend to explore all the places until he has found the lowest point in the region. By reducing the water level, the fish sometimes can get stuck in local valleys (minima), but if we can reduce the water very slowly, then the fish will eventually be able to seek out the lowest valley and survive.

Now you might be asking WTF? But this is similar to the simulated annealing problem in that the water level is equivalent to the temperature and the highest point is what our Boltzmann probability maximum that we can jump. As God, you control the cooling schedule until your fish reaches the global minimum.

### Genetic/Evolutionary Algorithms

This method is only for combinatorial problems and was inspired by genetics. Evolutionary algorithms are generally faster and get around some of the problems found in simulated annealing. Instead of minimizing an energy function, we maximize a "fitness" when we use evolutionary algorithms. The advantages of using this algorithm over simulated annealing are the parallelism (faster because you are considering several solutions at each step) and the increased exploration of state space with "reproduction/crossovers" and "mutations" (ie random factors) of each solution. Remember that this is a heuristic algorithm and that there are no guarantees that the optimal solution will be found.

We first define our function u in state space with a discrete set of solutions.

1. Start out with a population of randomly generated solutions (initial guesses) of size P
2. Reproduction - take random pairs of parents and create their offspring (generating a total of P + P/2 solutions), reproduction is achieved through mutating the strands and then performing crossovers.
1. Mutation - Mutate each solution with a given probability at each location
2. Crossover - Move entire portions of the solution to different positions from the two randomly selected parent solutions
3. Rank the population according to their energy u(x)
4. Eliminate (Selection) the bottom P/2 to optimize u(x) leaving again a population of size P.
5. Repeat from the reproduction step until we've achieved the number of time steps that are desired.

Pseudocode for an Evolutionary Algorithm:

generation = 0;
initialize population
while generation < max_generation
for i from 1 to population_size
select two parents
crossover parents to produce a child
mutate child with certain probability
calculate the fitness of each individual in population
end for
eliminate the bottom P/2 members
generation++
update current population
end while

### Constrained Optimization

If we have constraints that are in the form of equalities, then we want to use Lagrange multipliers to solve for the solution. An example of this is to find out which distribution yields maximum entropy.

Suppose we wish to find the discrete probability distribution with maximal information entropy. Then

$f(p_1,p_2,\ldots,p_n) = -\sum_{k=1}^n p_k\log_2 p_k$

Of course, the sum of these probabilities equals 1, so our constraint is

$\sum_{k=1}^n p_k = 1$

We can use Lagrange multipliers to find the point of maximum entropy (depending on the probabilities). For all k from 1 to n, we require that

$\frac{\partial}{\partial p_k}(f+\lambda (\sum_{k=1}^n p_k-1))=0,$

(see that $\sum_{k=1}^n p_k-1 = 0$ allowing us to do the above operation) which gives

$\frac{\partial}{\partial p_k}\left(-\sum_{k=1}^n p_k \log_2 p_k + \lambda (\sum_{k=1}^n p_k - 1) \right) = 0.$

Carrying out the differentiation of these n equations, we get

$-\left(\frac{1}{\ln 2}+\log_2 p_k \right) + \lambda = 0$

If we now solve for pk:

$p_k = 2^{\frac{1}{\ln 2} - \lambda}$

This shows that all pi are constant and equal (because they depend on λ only). By using the constraint ∑k pk = 1, we find

$p_k = \frac{1}{n}.$

Hence, the uniform distribution is the distribution with the greatest entropy.

If the constraints are in the form of inequalities, then we want to use linear programming (used in the field of operations research and pioneered by Dantzig with the Simplex Algorithm). Because all our constraints are linear, then the feasible region is a convex polytope, so that the optimal solution will be at one of the vertices. This can be expanded to N-dimensions, except now, we cannot use a graphical method for solving. However, since we only care about the vertices of the polytope, we can go through each vertex in turn to maximize the cost function. This is known as the Simplex Algorithm which efficiently checks each of the vertices in our polytope until an optimal solution is found. First, we find one vertex and check if it's optimal. Then, we change one of the variables to get the next vertex and move on to increase the cost function maximally. Check each vertex to see if it is optimal until you have maximized the cost function. Computationally, the Simplex Algorithm is also combinatorial (ie O(N C N-m)) where N is the number of variables and m is the number of inequalities.

### Standard Least Squares Optimization

Standard (Linear) Least Squares Optimization is used for data fitting given one or more independent variables and one dependent variable. We establish the relationship between the variables either by a theoretical relationship or by observation and we wish to know which "line" best describes the data obtained from experimentation. For standard least squares, this problem has an analytical solution.

We have a collection of N points (xi,yi) and we have strong belief that this dependency is linear. We must assume that x is an independent variable and that there is no error involved with the measurement of this parameter. We would like to find out the coefficients of the following equation that best describes the data.

$y = a_0 + a_1x \,$

For the least squares method, we optimize the distance between the predicted line and the actual points that we have. We have a overdetermined system which we can write in matrix form below:

$\hat{y} = \begin{pmatrix} y_1 \\ \vdots \\ y_N \end{pmatrix} = a_0 \begin{pmatrix} \ 1 \\ \vdots \\ 1 \end{pmatrix} + a_1 \begin{pmatrix} x_1 \\ \vdots \\ x_N \end{pmatrix} = \begin{pmatrix} 1 & x_1 \\ \vdots & \vdots \\ 1 & x_N \end{pmatrix} \begin{pmatrix} a_0 \\ a_1 \end{pmatrix} = Xa$

The error function is now the difference between the predicted values and the observed values:

$e = \hat{y_i} - y_i$

For the least squares optimization we would like to minimize eTe

\begin{alignat}{2} e^Te & = (\hat{y_i} - y_i)^T (\hat{y_i} - y_i) \\ & = \hat{y}^T \hat{y} - \hat{y}^T y - y^T \hat{y} + y^T y \end{alignat}

But we know that $\hat{y} = Xa$, so substituting, we get:

\begin{alignat}{2} E = e^Te & = (Xa)^T (Xa) - (Xa)^T y - y^T (Xa) + y^T y \\ & = a^TX^TXa - a^TX^Ty - y^TXa + y^Ty \end{alignat}

To minimize the error, we want $\nabla E_a = 0$:

\begin{alignat}{2} \nabla E_a & = 2X^TXa - 2X^Ty = 0 \\ a & = (X^TX)^{-1}X^Ty \end{alignat}

This can be expanded into N variables, each with a linear dependence. The result of this is that it will find the best hyperplane in m-1 dimensions if we have m number of independent variables.

Let us compare the above with a determined system where there are enough equations to find the coefficients exactly.

\begin{alignat}{2} y_1 & = a_0 + a_1x_1 \\ y_2 & = a_0 + a_1x_2 \\ \end{alignat}

We can easily rewrite this in matrix form:

$\vec y = \begin{bmatrix} y_1 \\ y_2 \\ \end{bmatrix} = \begin{bmatrix} 1 & x_1 \\ 1 & x_2 \\ \end{bmatrix} \begin{bmatrix} a_0 \\ a_1 \\ \end{bmatrix} = \vec X \vec a$

Now we can see to solve for the coefficients, we can just take the inverse matrix of X:

$\vec a = (\vec X)^{-1} \vec y$

Compared to the overdetermined system above calculated for the standard least squares, we can see a similarity between the matrices inverted. Because we are "inverting" a rectangular matrix for the least squares method and not a square matrix, we call this the pseudoinverse.

Pseudoinverse = (XTX) − 1XT

In Matlab, this is implemented by the function pinv().

### Standard Least Squares Example for a Straight Line Fit

If we reinvert the matrix, we obtain that

$X^TXa = X^Ty \,$

Let us take the four points (0,0), (1,8), (3,8), and (4,20) and perform standard least squares to find the line which best fits the points (minimizes the vertical error).

$X = \begin{bmatrix} 1 & 0 \\ 1 & 1 \\ 1 & 3 \\ 1 & 4 \\ \end{bmatrix}$
$y = \begin{bmatrix} 0 \\ 8 \\ 8 \\ 20 \\ \end{bmatrix}$

Performing the calculations:

$X^TX = \begin{bmatrix} 1 & 1 & 1 & 1 \\ 0 & 1 & 3 & 4 \end{bmatrix} \begin{bmatrix} 1 & 0 \\ 1 & 1 \\ 1 & 3 \\ 1 & 4 \\ \end{bmatrix} = \begin{bmatrix} 4 & 8 \\ 8 & 26 \\ \end{bmatrix}$

and

$X^Ty = \begin{bmatrix} 1 & 1 & 1 & 1 \\ 0 & 1 & 3 & 4 \end{bmatrix} \begin{bmatrix} 0 \\ 8 \\ 8 \\ 20 \\ \end{bmatrix} = \begin{bmatrix} 36 \\ 112 \\ \end{bmatrix}$

The system is then solved with

$\begin{bmatrix} 4 & 8 \\ 8 & 26 \\ \end{bmatrix} \begin{bmatrix} a_0 \\ a_1 \\ \end{bmatrix} = \begin{bmatrix} 36 \\ 112 \\ \end{bmatrix}$

By solving the system of equations, we obtain that

$\begin{bmatrix} a_0 \\ a_1 \\ \end{bmatrix} = \begin{bmatrix} 1 \\ 4 \\ \end{bmatrix}$

Hence the best fit line through the points is $y = 1 + 4x \,$

### Total Least Squares, Singular Value Decomposition (SVD) and Principal Component Analysis (PCA)

Total least squares is a method to finding the best fit curve given that we don't know the independent and dependent variables of the system. For example, given several different parameters that one is looking at, we are unsure of which variables depend on the other. In effect, total least squares finds the best fit line such that the error minimized is the perpendicular distance between the measured point and the optimal line (as opposed to the vertical distance in standard least squares).

Singular value decomposition yields the covariance matrix between the variables that we have and the matrix of principle components from which we can perform Principal Component Analysis. Singular value decomposition is equivalent to the diagonalization of rectangular matrices and typically yield a series of ranked numbers. The largest of these values are known as the principal components of the system and allow us to determine which of the variables are most strongly correlated to one another. For example, given 5 variables and our SVD yields that 2 of the variables are much higher than the other, we have 2 principal components which should be sufficient to describe the trends in the data and the other variables have little or no correlation to what is being observed. Using PCA allows us to discard redundant variables in our system leaving what is known as the lower rank approximation to the system.

The decomposition of a matrix X yields the following:

$X = U\Sigma V^T \,$

Where U and V are unitary matrices and Σ is the matrix with the singular values. Usually, the singular value decomposition is preformed on the matrix of the variables with the mean subtracted from all the data points and not usually on the raw data. This allows an easier interpretation of the decomposition. The covariance matrix can then be calculated from the altered data and the singular values correspond to the eigenvalues of this n x n covariance matrix (with n number of variables).

To obtain the lower rank approximation, we reduce the number of rows of Σ such that it becomes an r x r matrix, where r is the number of variables. The reconstructed values of X now take into account only the singular values with the highest value and discard those with lower values that are may not be pertinent to explaining the trendline.

### Artificial Neural Networks

Optimization within artificial neural networks can also be a challenging problem. We first define a neural network through the perceptron, or a series of inputs, hidden nodes, and outputs in a funnel like structure which allow fast computation of certain types of problems. Connections exist between layers but do not exist within layers. Each edge contains a weight by which the input from the previous layer is multiplied to obtain the next layer.

The output is a nested series of functions that are weighted by each node in the system. This is usually a non-trivial solution and very non-linear. The weights of the edges are usually obtained through "learning", or taking several known inputs and outputs and allowing the weights of each to change such that the known output is obtained. The three types of learning are used and are derived from biological heuristics: the Hebbian Rule, the Darwinian Rule, and Steepest Descent.

The Hebbian rule (potentiation) is based upon the theory that synapses that are used most are strengthened, ie the weights of those edges are higher based on how many times they are used in the learning sequence. The Darwinian rule initially assigns random weights to each edge and evolves them over time to minimize the error in the output. Steepest descent attempts to minimize the error function similar to the steepest descent methods described earlier. An implementation of this method is known as back propagation (or this website on back propagation is also good), where we first start from the output and calculate and minimize the local errors involved to find the best weights. Although back propagation is not a biological example, it is a heuristic algorithm that tends to work when training neural networks.