(→Results & Discussion)
(→Results & Discussion)
|Line 40:||Line 40:|
[[Image:hist.jpg|thumb||left|Figure 1: Density chart of the distribution of links between superfamilies for "old" folds and "yound" folds based on fragment length of 30. This histogram also shows that the distribution of links per superfamily is not a normally distributed. This results is significant, with the Wilcoxon unpaired test assigning a p-value of 1.2e-09.]]
[[Image:wilcox.jpg|thumb||left|Figure 2: Fragment length versus W score of Wilcoxon's signed-rank test. Wilcoxon singed-rank tests were performed on paired data: the number of links each superfamily has to the group of young superfamilies, and old superfamilies, normalised for the size of the age groups. Each test shows that superfamilies have significantly fewer links to young superfamilies (p-values < 2.2e-16). Since the number of compared folds in each test set are identical, the W scores can be compared directly. ]]
Revision as of 08:02, 29 September 2006
Linking evolution of protein structures through fragments
Here we use a strucutural fragment library to investigate evolutionary links between protein folds. We show that 'older' folds have relatively more such links than 'younger' folds.
At present there is no universal understanding how proteins can change topology during evolution, and how such pathways can be determined in a systematic way. The ability to create links between fold topologies would have important consequences for structural classification, structure prediction and homology modelling. It has been proven difficult however to show the evolutionary relevance of such links between topologies based on geometrical measures. Here we use our a previously determined age measure for protein folds or superfamilies  to investigate the effect of structural fragments on protein structure evolution .
Results & Discussion
When comparing the number of links young and old superfamilies make with other superfamilies, it becomes clear that the distribution of younger folds is skewed towards fewer links (Figure 1).
Similarly we can compare the number of link each fold has with young and old folds respectively. Again we see that fold share significantly fewer links with the goup of young folds, this correlation becomes stronger when we use larger fragment lengths in our test set (Figure 2), possibly indicating stronger evolutionary links.
Structural similarity has long been linked with evolutionary relatedness, but is as a sole measure not satisfactory to establish an evolutionary link between two proteins, due to the possibility of convergent evolution of structures or topologies. With these results we show that the age of a superfamily can divide the abundancy of fragment based structural similarity in two different distributions.
In a general accepted evolutionary model new protiens are created through duplication and point mutations structural domains. Here we show the first evidence that this might also occur on a scale below the domain level: fragments are shared more often with longer existing superfamilies, which is expected in a model new topologies can be built through an assembly of, or mulitple insertions of fragments from existing proteins. A little care has to be taken here as these results, could also be caused by a scenario of convergent evolution, towards more stable fragments. Interestingly, however, the results become stronger, with longer fragment length, in which case the probabibility of convergence should decrease.
These results would have important implications for structure prediction, as it might exlpain why current 'fragment based' modelling approaches are so successful.
The fragment library generated for this study, contains fragment-pairs of length 10,15,20 and 30, with a maximum allowed gaplength of 2,3,4,6 respectively. All fragments are based on pairwise comparisons between structural domain as defined by SCOP. The pairs are scored for similarity purely on structural grounds, using the coordinates of the c-alpha atoms. This is in order to avoid dependencies between fragments and age estimates, which are generated through fold recognition techniques using on sequence similarity.
All possible pairwise fragments between two domains of the given lengths are first screened and aligned using a method similar to the prefilter used by MAMMOTH . Each fragment pair with an alignment score above a threshold is then superimposed to create an RMSD score for the fragment pair.
Age estimates for protein folds or superfamilies are generated using fold recongnition of structural domains on a set of completed genomes. The occurrence patterns of such predictions, are analysed with a parsimony algorithm to estimate an age for a superfamily or fold, for more details see .
The age of a fold or superfamily is based on a score between [0.0,1.0] with 0.0 indicating a last common recent ancestor at the leafs (youngest), and 1.0 indicating present at the root of the species tree (oldest). Here an 'old' fold is defined as a fold with an age of 1.0, and a 'young' fold with an age < 0.5
Since no consideration of secondary structure is taken into account, the amount of shared fragments needs to be normalised for the amount a fragment occurs in general. Friedberg and Godzik (2005) used a used a superfamily based normalisation to overcome this problem . We use a similar approach, although the fragment-pairs in this study are based on structural similarity only, whereas riedberg and Godzik (2005) used a combination of sequence and structural similarity.
A link between two superfamilies (I and J) is established when f(I,J) > 0.1, which is calculated as:
Here Sim(A,B) is the number of shared fragments between two set of domains (e.g. superfamilies), and A is the set of all domains. In this studies we do not consider self-similarity of superfamilies.
We show that younger folds have relatiely fewer shared fragments with other fold, than old protein fold. This might indicate that evolutionary links above superfamily or fold level could be established, through such shared fragments.