User:Matthew Whiteside/Notebook/Malaria Microarray/2009/05/29
Project name | Main project page Previous entry Next entry | |||||
Task 4.1: Prepare gene expression data matrix for meta-analysisDatasetsThere are 5 experiments that i will be using:
Normalizing DataNormalized data was provided from GEO for all datasets (GDSXXXX). Datasets 1-3 were GEO datasets. Dataset 4 was from ArrayExpress and was also normalized. Dataset 5 was not normalized. Karsten helped with the normalization and data checks. Data quality checks were already performed and provided on ArrayWiki (http://arraywiki.bme.gatech.edu/index.php/GSE9861). The first array showed some artifacts & low quality score (88.63%) - these were cleaned: the spots with high variance have been replaced with the median value of this probe from other chips in the dataset (see here for more details: http://arraywiki.bme.gatech.edu/index.php/BioPNG_format). Karsten generated two normalized versions of the data: I have placed files with normalized values into /tmp on koch: khokamp@koch:/tmp> l GSE9861_* -rw-r--r-- 1 khokamp wg-users 3257047 2009-05-13 13:56 GSE9861_clean_gcrma.txt -rw-r--r-- 1 khokamp wg-users 3256320 2009-05-13 13:56 GSE9861_gcrma.txt But if you are going to analyse them with limma, you could easily re-do this yourself in R: 1. Change into the directory containing the (compressed) CEL files, e.g. cd /tmp/GSE9861_cleanCEL 2. start R 3. read data and normalize: library(gcrma) data = ReadAffy() norm_data = gcrma(data) norm_data is an expressionSet that should be readily usable with limma The last step that I applied for extracting the data was write.exprs(norm_data, file='GSE9861_clean_gcrma.txt') It might be best to run both the original and the clean data set through limma to see if the resulting gene lists differ greatly. I found no differentially-expressed genes when using the cleaned data. I will proceed with the original data - but may decide to drop this dataset altogether. See scripts ~/workspace/other_projects/b_malaria/bioconductor/meta_workflow_h*.R for scripts that were used to obtain the data.
Data preparationBroadly speaking there are two types of meta-analysis techniques that i plan to use:
There are a number of arrays (1) and contrasts (2) i could use from each dataset. My first attempt where i considered a large number of contrasts (bioc_workflow_h*.R) can be found in the ~.../de_genes directory. This data contained a few problems (dataset 4 was labelled incorrectly). So always reference the ~.../de_genes/2 directory, even though its less extensive. Probe AnnotationRefer to R scripts ~/workspace/other_projects/b_malaria/bioconductor/meta_workflow_h*.R. Basic strategy for probe annotation:
Data ChecksRefer to R scripts ~/workspace/other_projects/b_malaria/bioconductor/meta_workflow_h*.R. I produced a heatmap showing how each array clustered within a dataset. I used pearson correlation of DE genes that were found across multiple arrays (between 100-1000 genes) and hierarchial clustering. I checked that arrays of uninfected and infected clustered together. Only for dataset 1.1 & 1.2, did i find any problems - i removed the arrays that did not cluster as expected (see h1.*.before_arrays_removed.heatmap.png for the before version). I also generated volcano plots to visualize the microarray data for particular contrasts of interest. These files are in ~/data/projects/b_malaria/de_genes/2 ContrastsI considered the following contrasts and output files are named as follows: Microarray datasets & associated filenames ------------------------------------------ Microarray Set: h1 PBMC Study, pmid:16988231 Subsets: h1.1: US volunteer uninfected vs. US volunteer infected h1.2 US volunteer uninfected vs. African volunteer infected h1.3 African volunteer treated vs. African volunteer infected h1.4 US volunteer infected vs. African volunteer infected Microarray Set: h2 Whole Blood Study, pmid:16738667 Subsets: h2.1: Gabonese children infected w/ uncomplicated malaria vs. uninfected Gabonese children h2.2: Gabonese children infected w/ complicated malaria vs. uninfected Gabonese children h2.3 Complicated malaria + uncomplicated malaria vs. uninfected Microarray Set: h3 Placenta Study, pmid:17579077 Subsets: h3.1: Chronic, inflamed malaria infected Placenta vs. healthy control placenta Microarray Set: h4 Plasmodium falciparum infected RBC (pRBC) & HUVEC endothelial study, pmid:17383656 Subsets (HUVEC monolayers incubated with): h4.1: Plasmodium falciparum infected Red Blood Cells (PRBC) vs. Red Blood Cells (RBC) h4.2: PRBC + Tumor Necrosis Factor (PRBCTNF) vs. RBC + TNF (RBCTNF) h4.3 PRBCTNF + PRBC vs. RBCTNF + RBC Microarray Set: h5 Plasmodium falciparum infected RBC (pRBC) & brain endothelial study, pmid:16714553 (original version not the 'clean'ed version with artifacts removed) Subsets: h5.1: high-binding PRBC + low-binding PRBC vs. RBC control + Medium control I decided to go with the following contrasts: 1.1, 2.3, 3.1, 4.1, 5.1. Summarized Expression MatrixI used LIMMA to compute p-values and produce a summarized logFC value for a particular contrast. A inter-study comparison of this summarized logFC values was also performed (see data in ~/.../meta_analysis/sample_selection/2/ - parent directory contains some false data but is more extensive and therefore was left in). A heatmap of the hierarchical clustering of the pearson correlation was performed for all contrasts considered (*all*) and for the final subset (*final*) of contrasts that was decided upon to go forward with. Overlapping DE genes for different combinations of contrasts in the *all* and *final* groups was also computed. The script ~/.../meta_analysis/de_gene_set_workflow.R was used to compute these figures. Some notes on these figures. The heatmaps show that no two study are tightly clustered (approx equi-distant) and interestingly, the cell-line studies did not separate from the in vivo tissue studies. Dataset 5 has the smallest overlap with the other studies, in terms of DE genes identified. Only 2 genes were common to all studies (as being DE). Array-Level Expression MatrixThe arrays for a particular contrast were also output as a large combined matrix. There were common 12769 Entrez Genes that had recorded expression values in all studies (The contrasts 1.1, 2.3, 3.1, 4.1, 5.1).
|