User:Johnsy/Advanced Modelling in Biology/Optimization: Difference between revisions

From OpenWetWare
Jump to navigationJump to search
Line 110: Line 110:
#'''Eliminate (Selection)''' the bottom P/2 to optimize ''u(x)'' leaving again a population of size P.
#'''Eliminate (Selection)''' the bottom P/2 to optimize ''u(x)'' leaving again a population of size P.
#Repeat from the reproduction step until we've achieved the number of time steps that are desired.
#Repeat from the reproduction step until we've achieved the number of time steps that are desired.
Pseudocode for an Evolutionary Algorithm:
:generation = 0;
::initialize population
::while generation < max_generation
:::for i from 1 to population_size
::::select two parents
::::crossover parents to produce a child
::::mutate child with certain probability
::::calculate the fitness of each individual in population
:::end for
:::eliminate the bottom P/2 members
:::generation++
:::update current population
::end while


===Constrained Optimization===
===Constrained Optimization===

Revision as of 11:27, 20 May 2008

Optimization

General Formulation

The general formulation for any optimization problem is this:

  • Minimize [math]\displaystyle{ f(x) \, }[/math], our cost/objective function, subject to a series of constraints:
    • [math]\displaystyle{ g_i(x) = 0 \, }[/math]
    • [math]\displaystyle{ h_i(x) \le 0 }[/math]
  • Where [math]\displaystyle{ x }[/math] is the variabe for optimization (can be multidimensional) and g and h are our constraints

What do we mean when we say [math]\displaystyle{ f(x) }[/math]?

  • [math]\displaystyle{ f(x) }[/math] is a function in the general sense, which can be continuous or discontinuous
  • Think of [math]\displaystyle{ f(x) }[/math] as a black box. For any input given, we will receive an output

Standard Optimization vs. Combinatorial Optimization

  • Standard optimization (continuous) deals with functions where the output is continuous (both variables and functions are continuous)
  • Combinatorial optimization deals with functions where the output is a discrete set (the fact that it is discrete does not mean that it is finite). It is important to note that calculus does not apply to these systems, making them more difficult

Traveling Salesman Problem

The traveling salesman problem is a classical combinatorial optimization problem formulated as such:

  • A salesman has to visit N number of cities. In what sequence do you travel through each city to minimize the distance you travel?
    • You can see that there are N! number of possible sequences for the N number of cities

Possible ways of solving this problem:

  • Heuristics - i.e. a lot of informal guesses as to the correct path (properly defined as methods/algorithms that tend to give the correct answer, but are not guaranteed to give the solution)
  • Complete Enumeration - going through all possible N! solutions to find the optimal path. Here, the computational time is related to the number of paths, or in big-O notation: O(N!)
    • O(N!) is known as combinatorial explosion and is a computationally NP-hard problem (Non-deterministic Polynomial time)
    • Some problems have been proven to only be solve by complete enumeration. Other methods do not guarantee the optimal solution for the problem.

Dimensionality

Recall that for a one dimensional system and we want to find the minimum for a given function [math]\displaystyle{ f(x) }[/math], a continuous, differentiable function, we look to the derivative such that:

[math]\displaystyle{ \frac{df}{dx} = 0 }[/math]

To ensure that we have a minimum, we want:

[math]\displaystyle{ \frac{d^2f}{dx^2} \gt 0 }[/math]

The first derivative tells us if there is a change in the monotonicity of the function while the second derivative guarantees us a minimum. For a two-dimensional system, [math]\displaystyle{ f(x,y) }[/math], we look to the gradient and the also the second derivative (Hessian) such that:

[math]\displaystyle{ \nabla f = 0 }[/math]

And that:

[math]\displaystyle{ H= \begin{pmatrix} \frac{d^2f}{dx^2} & \frac{d^2f}{dxdy} \\ \frac{d^2f}{dxdy} & \frac{d^2f}{dy^2} \end{pmatrix} \gt 0 }[/math]

The condition above is known as positive definite, and this guarantees that we have a minimum. H is also positive definite if the following condition exists:

[math]\displaystyle{ x^THx \gt 0, \forall x }[/math]

The function is also positive definite if the eigenvalues of H are both positive for symmetric matrices (usually this matrix will be symmetric). The eigenvalues being positive is related to the function being convex in both directions. Recall that if one eigenvalue is positive and the other negative, then the point is a saddle node.

Steepest Descent/Newton Method for Optimization

General formulation:

  • We wish to minimize [math]\displaystyle{ f(x) }[/math] given that it is positive definite.

Methods for approaching the problem:

  • fsolve() function in Matlab is an implementation of the steepest descent method and finds the zeros of a function (so one must supply the derivative of a function to obtain an extremum)
  • ode45() if we can define the function to be minimized in terms of a potential function (and similar to fsolve(), the gradient of the function must also be supplied)
    • Let [math]\displaystyle{ u(x) }[/math] be our energy function such that if we integrate [math]\displaystyle{ \frac{dx}{dt} = -\nabla u }[/math] with respect to time, then we are assured that our energy function will decrease along all trajectories.
    • We must supply ode45() with an initial condition to run and with this method, we are guaranteed a minimum, but not necessarily the global minimum. Only if the function is convex that we are guaranteed a global minimum and hence convex functions are easier to optimize than non-convex functions.

Convex functions - functions in which for all points in x that our Hessian is positive definite (will only have one extremum for the entire function). Convexity is the result of the variables and if we can redefine the variables such that our function becomes convex, then this makes optimization much easier.

In the computer, steepest descent is performed by the Euler method of integration given our function:

[math]\displaystyle{ \frac{dx}{dt} = - \nabla u }[/math]

The Euler method expands out the differential to give:

[math]\displaystyle{ x_{t+1} = x_t - (\nabla u)\delta t }[/math]

The computer adjusts [math]\displaystyle{ \delta t }[/math] to make it more efficient to reach the minimum and will minimize [math]\displaystyle{ \nabla u }[/math] as much as possible until it reaches a given tolerance value such that:

[math]\displaystyle{ \parallel x_{t+1} - x_t \parallel \lt \varepsilon }[/math]

Possible problems with steepest descent:

  • If the function has a very narrow well with the minimum, then the initial conditions must be very close to the minimum to achieve optimization
  • If the function has several minima of various depths, then many initial conditions must be used to guarantee that we obtain the global minimum of the function.
  • Convergence of steepest descent is related to how fast you get to the solution. For very flat, but convex functions, then this will take relatively long to converge since we are not sure of the step size that we should be taking because the variation in the function is too small. To correct for flatness in a function we can use conjugate gradient methods which don't always take the path against the gradient but will take steps out of the direction of the gradient (similar to going up the ridges of a valley) before again taking steps in the direction of th gradient.
  • Although we might have a convex function, steepest descent will only get to the minimum of a function given an infinite time (remember that this only works for continuous functions).

Can we use steepest descent for combinatorial optimization? Yes! For example, you take take your discrete solution [math]\displaystyle{ x_0 = \{ \mu_1, \mu_2, \dots, \mu_N \} }[/math] and change the state of each μ that minimizes the cost function. In effect, steepest descent is similar to a greedy algorithm in that it can minimize each μ locally in an attempt to minimize the cost function globally. The only difference between using steepest descent instead of simulated annealing with is that we allow upward changes in energy for simulated annealing and not for steepest descent.

Combinatorial Optimization with Greedy Algorithms

For discrete sets in combinatorial problems, we utilize greedy algorithms which takes a starting condition and looks to it's nearest neighbors and minimizes that distance before continuing. Remember that without complete enumeration, this will not necessarily give the best solution and may get caught in large distances later on in the algorithm. This algorithm minimizes locally in an attempt to minimize globally and this is no longer an O(N!) problem anymore, making it computationally easier than complete enumeration. A good example of a greedy algorithm is Dijkstra's algorithm for finding the minimum pathway through a network. Other minimum tree spanning algorithms include Kruskal's or Prim's Algorithm.

Simulated Annealing

This method attempts to correct for the presence of several minima and is a heuristic algorithm for non-convex problems derived from materials physics. We slightly adapt the Newton method by including a random term which will cause our function to sometimes go against the gradient.

[math]\displaystyle{ x_{t+1} = x_t - (\nabla u)\delta t + \alpha w(0,\sigma) }[/math]
  • [math]\displaystyle{ w(0,\sigma) \, }[/math] is a randomly generated number from a normal distribution with mean centered at zero and a standard deviation σ.
  • α is our control parameter (eg temperature) to determine the "jumpiness" of the step. The function usually begins with a high temperature and slowly decreases α allowing a complete exploration of the energy function. If we can adjust the temperature infinitely slowly, then we are guaranteed to be at the global minimum. The adjustment of α over time is known as the cooling schedule and is arbitrarily set. For some energy functions, a initially slow cooling period is required, then rapidly cools down after a certain number of time steps. For generally more convex functions, it is possible to decrease the temperature quickly at the beginning and slowly towards the end. Several cooling schedules are usually used to see which one obtains the best results.
  • For any change in energy that decreases, then we will always accept the change, but for an increase in energy, we will go upwards with a certain probability defined by the Boltzmann distribution:
[math]\displaystyle{ p = e^{\frac{-\Delta E}{kT}} }[/math]

Pseudocode for the Simulated Annealing Algorithm (can be used for both standard or combinatorial optimization problems):

  1. Select an initial state x0 at random and evaluate the energy u(x0).
  2. Select x1, the next state at random and evaluate the energyu(x1).
  3. Accept the change if the cost function is decreased
  4. Otherwise, accept the change with a probability [math]\displaystyle{ p = e^{\frac{-\Delta E}{kT}} }[/math]
  5. Decrease T according to the cooling schedule
  6. Repeat steps 2 - 5 until we reach the desired number of time steps

Genetic/Evolutionary Algorithms

This method is only for combinatorial problems and was inspired by genetics. Evolutionary algorithms are generally faster and get around some of the problems found in simulated annealing. Instead of minimizing an energy function, we maximize a "fitness" when we use evolutionary algorithms. The advantages of using this algorithm over simulated annealing are the parallelism (faster) and the increased exploration of state space with "reproduction" and "mutations" of each solution. Remember that this is a heuristic algorithm and that there are no guarantees that the optimal solution will be found.

We first define our function u in state space with a discrete set of solutions.

  1. Start out with a population of randomly generated solutions (initial guesses) of size P
  2. Reproduction - take random pairs of parents and create their offspring (generating a total of P + P/2 solutions), reproduction is achieved through mutating the strands and then performing crossovers.
    1. Mutation - Mutate each solution with a given probability at each location
    2. Crossover - Move entire portions of the solution to different positions from the two randomly selected parent solutions
  3. Rank the population according to their energy u(x)
  4. Eliminate (Selection) the bottom P/2 to optimize u(x) leaving again a population of size P.
  5. Repeat from the reproduction step until we've achieved the number of time steps that are desired.

Pseudocode for an Evolutionary Algorithm:

generation = 0;
initialize population
while generation < max_generation
for i from 1 to population_size
select two parents
crossover parents to produce a child
mutate child with certain probability
calculate the fitness of each individual in population
end for
eliminate the bottom P/2 members
generation++
update current population
end while

Constrained Optimization

If we have constraints that are in the form of equalities, then we want to use Lagrange multipliers to solve for the solution. An example of this is to find out which distribution yields maximum entropy.

Below is taken from the Wikipedia page on Lagrange Multipliers:

Suppose we wish to find the discrete probability distribution with maximal information entropy. Then

[math]\displaystyle{ f(p_1,p_2,\ldots,p_n) = -\sum_{k=1}^n p_k\log_2 p_k. }[/math]

Of course, the sum of these probabilities equals 1, so our constraint is g(p) = 1 with

[math]\displaystyle{ g(p_1,p_2,\ldots,p_n)=\sum_{k=1}^n p_k. }[/math]

We can use Lagrange multipliers to find the point of maximum entropy (depending on the probabilities). For all k from 1 to n, we require that

[math]\displaystyle{ \frac{\partial}{\partial p_k}(f+\lambda (g-1))=0, }[/math]

which gives

[math]\displaystyle{ \frac{\partial}{\partial p_k}\left(-\sum_{k=1}^n p_k \log_2 p_k + \lambda (\sum_{k=1}^n p_k - 1) \right) = 0. }[/math]

Carrying out the differentiation of these n equations, we get

[math]\displaystyle{ -\left(\frac{1}{\ln 2}+\log_2 p_k \right) + \lambda = 0. }[/math]

This shows that all pi are equal (because they depend on λ only). By using the constraint ∑k pk = 1, we find

[math]\displaystyle{ p_k = \frac{1}{n}. }[/math]

Hence, the uniform distribution is the distribution with the greatest entropy.

If the constraints are in the form of inequalities, then we want to use linear programming (used in the field of operations research and pioneered by Dantzig with the Simplex Algorithm). Because all our constraints are linear, then the feasible region is a convex polytope, so that the optimal solution will be at one of the vertices. This can be expanded to N-dimensions, except now, we cannot use a graphical method for solving. However, since we only care about the vertices of the polytope, we can go through each vertex in turn to maximize the cost function. This is known as the Simplex Algorithm which efficiently checks each of the vertices in our polytope until an optimal solution is found. First, we find one vertex and check if it's optimal. Then, we change one of the variables to get the next vertex and move on to increase the cost function maximally. Check each vertex to see if it is optimal until you have maximized the cost function.

Standard Least Squares Optimization

Standard (Linear) Least Squares Optimization is used for data fitting given one or more independent variables and one dependent variable. We establish the relationship between the variables either by a theoretical relationship or by observation and we wish to know which "line" best describes the data obtained from experimentation. For standard least squares, this problem has an analytical solution.

We have a collection of N points [math]\displaystyle{ (x_i,y_i) }[/math] and we have strong belief that this dependency is linear. We must assume that x is an independent variable and that there is no error involved with the measurement of this parameter. We would like to find out the coefficients of the following equation that best describes the data.

[math]\displaystyle{ y = a_0 + a_1x \, }[/math]

For the least squares method, we optimize the distance between the predicted line and the actual points that we have. We have a overdetermined system which we can write in matrix form below:

[math]\displaystyle{ \hat{y} = \begin{pmatrix} y_1 \\ \vdots \\ y_N \end{pmatrix} = a_0 \begin{pmatrix} 1 \\ \vdots \\ 1 \end{pmatrix} + a_1 \begin{pmatrix} x_1 \\ \vdots \\ x_N \end{pmatrix} = \begin{pmatrix} 1 & x_1 \\ \vdots & \vdots \\ 1 & x_N \end{pmatrix} \begin{pmatrix} a_0 \\ a_1 \end{pmatrix} = Xa }[/math]

The error function is now the difference between the predicted values and the observed values:

[math]\displaystyle{ e = \hat{y_i} - y_i }[/math]

For the least squares optimization we would like to minimize [math]\displaystyle{ e^Te }[/math]

[math]\displaystyle{ \begin{alignat}{2} e^Te & = (\hat{y_i} - y_i)^T (\hat{y_i} - y_i) \\ & = \hat{y}^T \hat{y} - \hat{y}^T y - y^T \hat{y} + y^T y \end{alignat} }[/math]

But we know that [math]\displaystyle{ \hat{y} = Xa }[/math], so substituting, we get:

[math]\displaystyle{ \begin{alignat}{2} E = e^Te & = (Xa)^T (Xa) - (Xa)^T y - y^T (Xa) + y^T y \\ & = a^TX^TXa - a^TX^Ty - y^TXa + y^Ty \end{alignat} }[/math]

To minimize the error, we want [math]\displaystyle{ \nabla E_a = 0 }[/math]:

[math]\displaystyle{ \begin{alignat}{2} \nabla E_a & = 2X^TXa - 2X^Ty = 0 \\ a & = (X^TX)^{-1}X^Ty \end{alignat} }[/math]

This can be expanded into N variables, each with a linear dependence. The result of this is that it will find the best hyperplane in m-1 dimensions if we have m number of independent variables.

Total Least Squares, Singular Value Decomposition (SVD) and Principal Component Analysis (PCA)

Total least squares is a method to finding the best fit curve given that we don't know the independent and dependent variables of the system. For example, given several different parameters that one is looking at, we are unsure of which variables depend on the other. In effect, total least squares finds the best fit line such that the error minimized is the perpendicular distance between the measured point and the optimal line (as opposed to the vertical distance in standard least squares).

Singular value decomposition yields the covariance matrix between the variables that we have and the matrix of principle components from which we can perform Principal Component Analysis. Singular value decomposition is equivalent to the diagonalization of rectangular matrices and typically yield a series of ranked numbers. The largest of these values are known as the principal components of the system and allow us to determine which of the variables are most strongly correlated to one another. For example, given 5 variables and our SVD yields that 2 of the variables are much higher than the other, we have 2 principal components which should be sufficient to describe the trends in the data and the other variables have little or no correlation to what is being observed. Using PCA allows us to discard redundant variables in our system leaving what is known as the lower rank approximation to the system.

Artificial Neural Networks

Optimization within artificial neural networks can also be a challenging problem. We first define a neural network through the perceptron, or a series of inputs, hidden nodes, and outputs in a funnel like structure which allow fast computation of certain types of problems. Connections exist between layers but do not exist within layers. Each edge contains a weight by which the input from the previous layer is multiplied to obtain the next layer.

The output is a nested series of functions that are weighted by each node in the system. This is usually a non-trivial solution and very non-linear. The weights of the edges are usually obtained through "learning", or taking several known inputs and outputs and allowing the weights of each to change such that the known output is obtained. The three types of learning are used and are derived from biological heuristics: the Hebbian Rule, the Darwinian Rule, and Steepest Descent.

The Hebbian rule (potentiation) is based upon the theory that synapses that are used most are strengthened, ie the weights of those edges are higher based on how many times they are used in the learning sequence. The Darwinian rule initially assigns random weights to each edge and evolves them over time to minimize the error in the output. Steepest descent attempts to minimize the error function similar to the steepest descent methods described earlier. An implementation of this method is known as back propagation, where we first start from the output and calculate and minimize the local errors involved to find the best weights. Although back propagation is not a biological example, it is a heuristic algorithm that tends to work when training neural networks.