BioSysBio:abstracts/2007/Alok Mishra: Difference between revisions

From OpenWetWare
Jump to navigationJump to search
No edit summary
Line 1: Line 1:
=Effect of microarray data heterogeneity on regulatory gene module discovery=
=Effect of microarray data heterogeneity on regulatory gene module discovery=
'''Author(s): '''[[Special:Mypage|Alok Mishra, Duncan Gillies]]<br>  
'''Author(s): '''[[User:Alokmishra|Alok Mishra, Duncan Gillies]]<br>  
'''Affiliations:''' Imperial College, London  <br>
'''Affiliations:''' Imperial College, London  <br>
'''Contact:'''email: am2203@doc.ic.ac.uk <br>
'''Contact:'''email: am2203@doc.ic.ac.uk <br>

Revision as of 17:06, 20 September 2006

Effect of microarray data heterogeneity on regulatory gene module discovery

Author(s): Alok Mishra, Duncan Gillies
Affiliations: Imperial College, London
Contact:email: am2203@doc.ic.ac.uk
Keywords: 'microarrays' 'cluster similarity' 'regulatory gene module'

Introduction

Micro-arrays allow us to study a large proportion of genome expression simultaneously. These data have been used to build models of gene regulatory networks. In some of the recent work TSKS05, SFKR04, `integrative genomics`, in which data from experiments related to different conditions or even different organisms are merged together, has been suggested as a method to discover these regulatory networks. One of the important reasons behind this is that identification of gene activity is a small sample size problem. The number of variables is large and there are not large numbers of data-sets that can model the networks reliably. Another reason behind this approach is that many of the genes are believed to have similar roles in different organisms and so we may get insights by analysing data collectively.

The motivation behind the research being presented in this paper is that in our view, for this specific problem, blind integrative genomics might not be the right way forward. It should be used only when we are ready to lose signals that are individually visible in individual experiments and are only concerned about the dominant signal that is able to suppress others in experiments. The global regulatory network is the sum of individual regulatory networks and in our view by taking the integrative approach we are loosing information from the individual networks.

Our hypothesis is that as micro-arrays from different experimental conditions but same experiment type e.g. stress are merged, we should be able to readily identify stress specific regulatory network. The clusters of co-regulated genes obtained should reinforce the local regulatory network (stress specific) while suppressing the noise. By contrast, when micro-arrays from various different experiment types are merged together then the local regulatory networks (individual types) would be masked by dominant regulatory networks. The individual signals should fade and only a common denominator of the genes would be observed to behave in a consistent manner while the rest would be expressed in an unpredictable manner because of mixing of conditions. The set of genes that overexpress should be the ones that are most dominantly expressed among all the conditions for which datasets have been mixed.

In order to validate our hypothesis, the primary requirement is to obtain the regulatory networks from various datasets and then measure their similarities to each other. This way we would be able to show that progressive mixing of data with other types deteriorates the quality of local regulatory networks at the cost of more dominant regulatory networks. A number of researchers have worked on the problem of finding regulatory networks, some of the most important ones being SSR+03, BJGL+03, TSKS04 where they have incorporated prior knowledge in the form of known transcription factors or DNA binding data to guide the clustering process. The results in these work have shown that the resulting clusters of transcription factor (TF) modules as well as regulated gene modules are biologically meaningful. We have used Module Networks algorithm SSR+03 which is a well established approach and has had success in finding biologically relevant modules. For measuring the similarities among sets of regulated gene clusters resulting from this algorithm, we chose to use the modified Rand Index HA85 which has been shown to be a very stable index of partition similarity.

Previous work

Learning the structure of genetic regulatory networks has attracted lots of attention in the past years. A slightly dated review of these techniques is found in dJ02. Some of the newer methods that have proved quite successful in the recent years are Module Networks SSR+03 and SAMBA TSKS04. Unlike earlier algorithms both of these use prior biological knowledge in order to come up with a clustering of similar set of genes. Module Networks, given a list of potential regulators and microarray expression data uses an iterative procedure that searches for a regulation program for each module (set of genes) and then reassigns each gene to the module whose program best predicts its behaviour. SAMBA uses a diverse set of prior knowledge in the form of protein interactions, TF binding and phenotype sensitivity. All this information is represented as nodes of graphs and the algorithm tries to find statistically significant subgraphs. Recently, some researchers have focused on integrated approaches where they analyze a big compendium of microarrays gathered from various sources. TSKS05 have used data from 2000 experiments from 60 different publications to find regulatory programs.

Methodology

In order to validate our hypothesis we chose to work with two very diverse datasets. One of them is from experiments done to study the gene expression when yeast is exposed to stress conditions. The other major class of dataset was from the study of cell-cycle related genes. They are very different datasets because the expression of genes when stress conditions are created is much more drastic (both repressed and induced genes) when compared to cell-cycle experiments where the conditions are optimal for growth. We want to see how such a mixture affects the final outcomes. Are the results expected that the expression values from stress experiments being much higher would dominate?

All the data used in our analysis were taken from the Stanford Microarray Database (SMD) which hosts c-DNA micro-array data-sets from various experimenters. We decided to focus our study on yeast as the regulatory mechanisms in more complex organisms are more involved and yeast has been studied extensively in the recent years. We started with analysing data by individual researchers for experiments related to Stress GSK+00 in this paper referred as DS-STRESS1 (76 microarrays), SBB04 called DS-STRESS2 (49 microarrays) and GSK+00 called DS-STRESS3 (41 microarrays). In the next stage we merged all the 166 microarrays from these 3 researchers to create and analyse the data set we call DS-STRESS. To compare these clustering against an entirely different category, we took 93 microarray data sets for cell-cycle experiments SSZ+98 referred in this article as DS-CCYCLE. A further mixing of both stress and cell-cycle data was named DS-STRESS-CCYCLE. Finally, we extracted all available data (1082 microarrays) for yeast (not only stress) named DS-ALL and compared the earlier results against it. In order to have statistical significance behind our results we also generated a random microarray dataset for all the genes by generating random numbers from a Gaussian distribution with zero mean and unit standard deviation. This dataset was named DS-RANDOM.

To analyze the data we did standard pre-processing of the datasets. We used [math]\displaystyle{ log_{2} }[/math] of the ratio of mean of Channel 2 to mean of Channel 1. We only included those features that have not been designated as unreliable by the scanning software. For normalization we use total intensity which is based on the (slightly naive) assumption that the average log ratio on the array should be zero. Further, we do filtering on the genes selected by choosing genes whose log(base2) of R/G ratio is greater than 2 times for at least one experiment. This retains only those genes that have shown significant change in their expression. We did not do any scale normalization across slides to account for different experimental conditions or different data-sets.

We also needed a list of transcription on which to base the clustering upon. Our TFs were taken from the Yeastxtract website (cite) which is a curated database and has a list of 145 TFs. Once we had the required data, we analysed it using the the software package Genomica (refer SSR+03) which has been provided by the authors of the Module Network algorithm. The reason why we chose the Module Network algorithm was because it has been shown in literature to identify biologically meaningful clusters as the clustering process is driven with the help of known regulators. Only a few other algorithms use prior biological knowledge TSKS04 to guide the clustering process and in comparison to them Module Networks has resulted in more significant biological findings. Every data set was analysed 4 times in order to validate the cluster stability of results.

Since we are comparing the results on different data-sets our goal is to check the closeness of these resulting clusters (on different data-sets). This closeness was validated using cluster similarity which is described in detail in the next section.

Cluster similarity

Cluster similarity is a measure to check how close to each other the various clustering are from different data-sets. We have used a well established measure of cluster similarity - the adjusted Rand Index which was proposed by (HA85). Based on an extensive empirical comparison of several such measures, Milligan and Cooper (1986) recommended this index as the measure of agreement even when comparing partitions having different numbers of clusters.

Rand index works on the concept of pair-wise matching on each of the clusters that are being compared. Given a set of objects of cardinality n [math]\displaystyle{ S = {s_1, . . . , s_n} }[/math], suppose we get two clustering C1 and C2 of these objects such that [math]\displaystyle{ C1 = {c1_1, . . . , c1_k} }[/math] and [math]\displaystyle{ C2 = {c2_1, . . . , c2_k } }[/math] such that [math]\displaystyle{ \bigcup_{i=1}^{k}c1_{i} = S = \bigcup_{j=1}^{k}c2_{j} }[/math]. Let,

[math]\displaystyle{ N_{11} }[/math] be the number of pairs of objects in the same cluster in both C1 and C2
[math]\displaystyle{ N_{00} }[/math] be number of pairs of objects in different clusters in both C1 and C2
[math]\displaystyle{ N_{01} }[/math] be the number of pairs of objects in different clusters in C1 but same cluster in C2
[math]\displaystyle{ N_{10} }[/math] be the number of pairs of objects in the same cluster in C1 but different clusters in C2

The Rand index is simply the fraction of agreement to total, i.e. [math]\displaystyle{ \frac{(N_{11} + N_{00})}{(N_{11} + N_{00}+N_{01} + N_{10})} }[/math] and it lies between 0 and 1. When the two partitions are identical, the Rand index is 1 whereas it reaches 0 when they have nothing in common. A problem with the Rand index is that the expected value of the Rand index of two random partitions does not take a constant value. The adjusted Rand index (HA85) corrects for this by assuming the general form [math]\displaystyle{ \frac{index-expected index}{maximum index-expected index} }[/math] . Its maximum value is 1 and its expected value in the case of random clusters is 0. We are using this index in all our analysis as it was found to be stable even when the number of clusters are dissimilar and its performance was better in comparison to other indices (HA85).

Results

Table 1:Rand Indices for various runs of same data sets
DS-STRESS1
R1 R2 R3 R4
R1 1.0000 0.4670 0.5401 0.4654
R2 0.0000 1.0000 0.4905 0.4736
R3 0.0000 0.0000 1.0000 0.5041
R4 0.0000 0.0000 0.0000 1.0000
DS-STRESS2
R1 R2 R3 R4
R1 1.0000 0.3045 0.2985 0.3505
R2 0.0000 1.0000 0.3125 0.3469
R3 0.0000 0.0000 1.0000 0.3411
R4 0.0000 0.0000 0.0000 1.0000
DS-STRESS3
R1 R2 R3 R4
R1 1.0000 0.3861 0.4051 0.3833
R2 0.0000 1.0000 0.4538 0.3888
R3 0.0000 0.0000 1.0000 0.3432
R4 0.0000 0.0000 0.0000 1.0000
DS-STRESS
R1 R2 R3 R4
R1 1.0000 0.4001 0.3887 0.4105
R2 0.0000 1.0000 0.3744 0.3981
R3 0.0000 0.0000 1.0000 0.4165
R4 0.0000 0.0000 0.0000 1.0000



Table 2:Cluster variation among stress datasets
DS-STRESS1 DS-STRESS2 DS-STRESS3 DS-STRESS
DS-STRESS1 0.490 0.1747 0.2417 0.3425
DS-STRESS2 0.1747 0.325 0.1155 0.1060
DS-STRESS3 0.2417 0.1155 0.39 0.2470



Table 3:Comparison of clustering of individual stress datasets versus progressively mixed datasets
DS-STRESS DS-CCYCLE DS-STRESS-CCYCLE DS-ALL DS-RANDOM
DS-STRESS1 0.3425 0.0981 0.3378 0.3434 0.0037
DS-STRESS2 0.1060 0.0252 0.0920 0.0759 0.0022
DS-STRESS3 0.2470 0.0925 0.2534 0.2325 0.0023



Table 4: Comparison of cell-cycle to mixed data clustering
DS-STRESS DS-CCYCLE DS-STRESS-CCYCLE DS-ALL DS-RANDOM
DS-CCYCLE 0.0663 0.2152 0.0812 0.0614 0.00068
DS-STRESS 0.3979 0.0663 0.3067 0.2244 0.0013


To reiterate, the clustering algorithm groups together functionally similar groups of genes. By using the modified Rand's index, we measure the similarity among the resulting sets of clusters. Higher index value, according to our interpretation suggests that the two data-sets are functionally similar to each other while a lower index suggests dissimilarity. The index values can range from 0 (completely dissimilar) to 1 (fully similar).

In order to validate that the algorithm was producing consistent results, we ran it many times over the same data-sets and then compared the resulting similarity. For each of the Stress data-set we did four runs of clustering and then measured the similarities among the various runs. From the results in Table-1 we can see that various runs on the same data-set produces similar clustering as the variance among the indices are small. The results have been presented as a symmetric matrix with each run being compared against all other runs. For any good clustering algorithm this is to be expected if the algorithm is running in a predictable manner. One of the interesting side effects of this analysis is that we can comment not only on the ability of the algorithm but also on the quality of the data set. As we see that the similarity values are higher for DS-STRESS1, DS-STRESS3 and DS-STRESS in comparison to DS-STRESS2. It might indicate that the algorithm was confused by the dataset as it had too much noise and the clusters could not be identified in a consistent manner in various runs and hence a lower similarity value.

might have to remove After validating the algorithm's consistency, we conducted further analysis. Comparison of the resulting clusters on different data sets (but the same class i.e. Stress) as shown in Table-2 shows some interesting results. Since DS-STRESS1 and DS-STRESS3 clusters show higher similarity values to each other than each of them individually to DS-STRESS2 we can say that they are more similar to each other than to DS-STRESS2. DS-STRESS1 shows much higher similarity to DS-STRESS (which is a mix of all three datasets) than to individually to any one of them. One of the probable reasons is that because the number of micro-arrays from DS-STRESS1 is the highest and hence proportionally it is more represented in the mix and so the higher values. DS-STRESS2 is not too similar to both DS-STRESS1 and DS-STRESS3 but its similarity with DS-STRESS is even lower which is strange as the value should be above the value of its similarity with DS-STRESS3 (minimum). One of the reasons that we think is that the relationship between similarity index and amount of dissimilar data is not linear.

These results indicate that similar data do keep the similarity among clusters higher while dissimilar data brings it down. It reinforces our hypothesis that if datasets are similar then the similarity index should be higher as the resulting functional clusters that we obtain will be similar but as we mix different datasets this similarity goes down.

What we have done till now is an analysis of very similar data-sets i.e they were taken from Stress conditions. To get a further validation we continued with more analysis. [have to remove] We took a data-set that is totally unrelated to stress related experiments. It was taken from experiments related to cell-cycle. To ensure that it indeed was very different we calculated the similarity indices again for all our three earlier datasets. As shown in the table-3, all three stress data-sets have very low similarity with it. As described in methodology section, this data-sets is known as DS-STRESS-CCYCLE. Our next step was to compare the individual stress-datasets similarity with both these in comparison to DS-STRESS. The objective is to see how this mixing of diverse data-sets affects the similarity among them. The results in table-3 show that the the similarity values across the columns DS-STRESS and DS-STRESS-CCYCLE are not that different which indicates that expression data that had much higher level of expression change (stress) has dominated the final clusters when we mixed it with data where the expression change was not that much (cell-cycle). Another interesting observation is that even though we mixed many other types of data-sets in DS-ALL, the results are not significantly different from DS-STRESS. This makes us conclude that stress data-set is very dominant as compared to others. As expected, the similarity values for the random data-set are near zero in all the cases.

We also did a combined data-set level comparison rather than individual data sets as done earlier. In this we compared the cell cycle and stress data-sset with each other, DS-ALL and DS-RANDOM. The results in table-4 generalise and substantiate our earlier observations. DS-CCYCLE's similarity to DS-STRESS is very low. Since DS-STRESS has dominant expression values, hence the similarity value for the mixed (stress-cellcycle and all) is also very low. On the other hand DS-STRESS is maximally similar to itself and gradually declining as the mixing progressively attenuates its dominance. Again the random similarity values are negligible.

Conclusion and Future Work

As argued in OR02, all cellular regulatory mechanisms are very local in nature and trying to use a blind integrative approach is most likely going to prove futile if we are interested in meaningful results. We have tried to establish this from a different point of view that as more diverse data-sets are merged then the similarity to individual data-sets (which have more local patterns) is reduced and the dominant ones shadow the weaker signals. One source of error in our results might be attributed to the fact that our similarity index is based on pairwise matches of genes in each sets. It is a good measure for cluster comparison but a better approach would be calculate the functional similarity of clusters. Since we have extra information in the gene ontology databases about each of the genes, we can use this to calculate how functionally similar these resulting clusters are. As there is no such index, we are working on this and in future we would extend our work by using it.

Even though we are not a proponent of blind integration of diverse microarray data-sets we do believe that an integrated approach of analysing different families of data to learn specific regulatory models is the right approach. We now have 4 major families of data namely -microarray expression, CHIP-chip DNA binding, sequence and protein-protein interaction data. We can integrate each of these to arrive at a better picture of the regulatory networks as each of them complement the partial picture painted by the others. Kernel methods LDC+04,LDBC+04 give us a principled way to merge these seemingly disparate data and then use kernalised versions of algorithms (e.g. clustering) for further analysis.

There is huge scope of further work in this field. If we can come up with a way to mix datasets so that the final mix acts as a background model, then we can remove the significantly enriched genes from other analysis as we know that they are always enriched. This way more significant genes can be studied in individual data-sets and not have to deal with information that is not significant to those sets of experiments. Currently, we can see that experiments where there are wider fluctuations dominate the final mixture. In order to correct this so that we might have a good background model, one step might be to do scale normalization of data across various experiments. This will bring expression values across different types of experiments on the same level. Once we have this background model, we can use it to stabilise the covariance matrix by shrinking it towards the background model when we don't have enough data for the individual data-sets to correctly estimate the covariance matrix because of singularity conditions. Pooled covariance matrices are used in Graphical Gaussian Model analysis to recover genetic regulatory networks.

References

BJGL+03
Ziv Bar-Joseph, Georg K Gerber, Tong Ihn Lee, Nicola J Rinaldi, Jane Y Yoo, François Robert, D Benjamin Gordon, Ernest Fraenkel, Tommi S Jaakkola, Richard A Young, and David K Gifford.
Computational discovery of gene modules and regulatory networks.
Nature Biotechnology, 21(11):1337-1342, 2003.
dJ02
H. de Jong.
Modeling and simulation of genetic regulatory systems: a literature review.
J Comput Biol, 9(1):67-103, 2002.
GSK+00
A. P. Gasch, P. T. Spellman, C. M. Kao, O. Carmel-Harel, M. B. Eisen, G. Storz, D. Botstein, and P. O. Brown.
Genomic expression programs in the response of yeast cells to environmental changes.
Mol Biol Cell, 11(12):4241-4257, December 2000.
HA85
L. Hubert and P. Arabie.
Comparing partitions.
Journal of Classification, 1985.
LDBC+04
G. R. Lanckriet, T. De Bie, N. Cristianini, M. I. Jordan, and W. S. Noble.
A statistical framework for genomic data fusion.
Bioinformatics, 20(16):2626-2635, November 2004.
LDC+04
G. R. Lanckriet, M. Deng, N. Cristianini, M. I. Jordan, and W. S. Noble.
Kernel-based data fusion and its application to protein function prediction in yeast.
pages 300-311, January 2004.
OR02
George Orphanides and Danny Reinberg.
A unified theory of gene expression.
Cell, 108(4):439-451, February 2002.
SBB04
Alok J. Saldanha, Matthew J. Brauer, and David Botstein.
Nutritional Homeostasis in Batch and Steady-State Culture of Yeast.
Mol. Biol. Cell, 15(9):4089-4104, 2004.
SFKR04
Eran Segal, Nir Friedman, Daphne Koller, and Aviv Regev.
A module map showing conditional activity of expression modules in cancer.
Nat Genet, 36(10):1090-8, Oct 2004.
SSR+03
Eran Segal, Michael Shapira, Aviv Regev, Dana Pe'er, David Botstein, Daphne Koller, and Nir Friedman.
Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data.
Nature Genetics, 34(2):166-176, 2003.
SSZ+98
PT Spellman, G Sherlock, MQ Zhang, VR Iyer, K Anders, MB Eisen, PO Brown, D Botstein, and B Futcher.
Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization.
Mol Biol Cell, 9(12):3273-97, Dec 1998.
TSKS04
Amos Tanay, Roded Sharan, Martin Kupiec, and Ron Shamir.
Revealing modularity and organization in the yeast molecular network by integrated analysis of highly heterogeneous genomewide data.
PNAS, 101(9):2981-2986, 2004.
TSKS05
Amos Tanay, Israel Steinfeld, Martin Kupiec, and Ron Shamir.
Integrative analysis of genome-wide experiments in the context of a large high-throughput data compendium.
Molecular Systems Biology, 1(1):msb4100005-E1-msb4100005-E10, March 2005.