Drummond:Coupling: Difference between revisions

Revision as of 21:46, 13 July 2008

Prediction of probability of protein folding

Assume that folding is a binary outcome represented by the random variable [math]\displaystyle{ F }[/math]. Given some predictor [math]\displaystyle{ X }[/math] (such as mean pair probability, max % identity, # of missed pairs), we want to infer [math]\displaystyle{ Pr(F|X) }[/math]. We assume that there is a sigmoidal relationship between X and the probability of folding,

[math]\displaystyle{ \Pr(F|X) = p = 1/(1 + e^{aX + b}) }[/math]

where [math]\displaystyle{ a }[/math] and [math]\displaystyle{ b }[/math] quantify the steepness and position of the step function. This formulation is equivalent to assuming a linear relationship between the predictor [math]\displaystyle{ X }[/math] and the log-odds,

[math]\displaystyle{ aX + b = \ln {1-p \over p} }[/math].

We can write down a likelihood of the observed data, where [math]\displaystyle{ x_i }[/math] is the value of the predictor [math]\displaystyle{ X }[/math] for an actual protein

[math]\displaystyle{ L(F\|\{x_i\})\! }[/math]	[math]\displaystyle{ = \prod_{i \in \textrm{folded}} \Pr(F\|X=x_i) \prod_{j \in \textrm{unfolded}} (1 - \Pr(F\|X=x_j)) }[/math]
[math]\displaystyle{ \ln L(F\|\{x_i\})\! }[/math]	[math]\displaystyle{ = \sum_{i \in \textrm{folded}} \ln \Pr(F\|X=x_i) \prod_{j \in \textrm{unfolded}} \ln (1 - \Pr(F\|X=x_j)) }[/math]
[math]\displaystyle{ \ln L(F\|\{x_i\})\! }[/math]	[math]\displaystyle{ = -\sum_{i \in \textrm{all}} \ln (1 + e^{ax_i + b}) + \sum_{j \in \textrm{unfolded}} a x_j + b }[/math]

The parameters can then be fit by maximizing the log-likelihood function. The whole process is termed logistic regression.

Application to WW domains

Given only the Socolich et al. data, we can estimate the probability of folding given mean pair probability, max % identity, # of missed pairs. Specifically, we can estimate the curve

[math]\displaystyle{ f(x) = \Pr(F|X=x)= 1/(1 + e^{aX + b})\! }[/math]

In the overview figure below, the sigmoid curves show these estimated curves, based on the variable in each column, derived from maximum-likelihood fitting of the Socolich et al. data (excluding random sequences). "Missed pairs" (MP) counts pairs of amino acids that do not appear at all in the overall natural MSA (Russ + Socolich, 407 proteins). "Mean prob. of unmissed pairs" (MPUP) quantifies the mean probability of pairs which are found at least once in the overall natural MSA. These two measures are in principle independent.

Reasoning from these curves:

Probability that a random engineered protein folds, considering pair.p = 0.338
- Excess over Socolich et al. = 1.32
Probability that a random engineered protein folds, considering missed.pairs = 0.274
- Excess over Socolich et al. = 1.69

As the table below shows, folded proteins (considering all synthesized proteins, or just CC proteins) have statistically significantly fewer missed pairs, higher mean pair probabilities, higher max % identity, and higher MPUP than unfolded proteins. (One-sided Wilcoxon signed rank test.)

Data	X	[math]\displaystyle{ \sim }[/math]	[math]\displaystyle{ \Pr(X(\textrm{folded}) \sim X(\textrm{unfolded})) }[/math]
all	missed.pairs	<	4.759025e-10
all	max.id	>	9.231519e-10
all	pair.p.unmissed	>	3.62112e-05
all	pair.p	>	2.034076e-07
cc	missed.pairs	<	0.002047423
cc	max.id	>	0.003763384
cc	pair.p.unmissed	>	0.0008415346
cc	pair.p	>	0.0008970336

As the following figures show, the new engineered WW domains have no significant coupling according to SCA.

SCA matrices
All natural WW domains (Russ + Socolich et al.)
120 natural WW domains
120 CC WW domains
120 engineered WW domains

@@ Line 43: / Line 43: @@
 :{|
-|align="center" | <h3>Data</h3>
+|align="center" | <em>Data</em>
-|align="center" | <h3>X</h3>
+|align="center" | <em>X</em>
 |align="center" | <math>\sim</math>
 |align="center" | <math>\Pr(X(\textrm{folded}) \sim X(\textrm{unfolded})) </math>

[math]\displaystyle{ L(F\|\{x_i\})\! }[/math]	[math]\displaystyle{ = \prod_{i \in \textrm{folded}} \Pr(F\|X=x_i) \prod_{j \in \textrm{unfolded}} (1 - \Pr(F\|X=x_j)) }[/math]
[math]\displaystyle{ \ln L(F\|\{x_i\})\! }[/math]	[math]\displaystyle{ = \sum_{i \in \textrm{folded}} \ln \Pr(F\|X=x_i) \prod_{j \in \textrm{unfolded}} \ln (1 - \Pr(F\|X=x_j)) }[/math]
[math]\displaystyle{ \ln L(F\|\{x_i\})\! }[/math]	[math]\displaystyle{ = -\sum_{i \in \textrm{all}} \ln (1 + e^{ax_i + b}) + \sum_{j \in \textrm{unfolded}} a x_j + b }[/math]

Drummond:Coupling: Difference between revisions

Revision as of 21:46, 13 July 2008

Prediction of probability of protein folding

Application to WW domains

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

research

Tools