User:Matthew Whiteside/Notebook/Malaria Microarray/2009/05/29

From OpenWetWare
Jump to navigationJump to search
Project name Main project page
Previous entry      Next entry

Task 4.1: Prepare gene expression data matrix for meta-analysis

Datasets

There are 5 experiments that i will be using:

  1. Ockenhouse CF, Hu WC, Kester KE, Cummings JF, Stewart A, Heppner DG, Jedlicka AE, Scott AL, Wolfe ND, Vahey M, and Burke DS. Common and divergent immune response signaling pathways discovered in peripheral blood mononuclear cell gene expression patterns in presymptomatic and clinically apparent malaria. Infect Immun. 2006 Oct;74(10):5561-73. DOI:10.1128/IAI.00408-06 | PubMed ID:16988231 | HubMed [ocken2006]

    PBMC study from US & African volunteers

  2. Boldt AB, Luty A, Grobusch MP, Dietz K, Dzeing A, Kombila M, Kremsner PG, and Kun JF. Association of a new mannose-binding lectin variant with severe malaria in Gabonese children. Genes Immun. 2006 Jul;7(5):393-400. DOI:10.1038/sj.gene.6364312 | PubMed ID:16738667 | HubMed [boldt2006]

    African children whole blood study

  3. Muehlenbachs A, Fried M, Lachowitzer J, Mutabingwa TK, and Duffy PE. Genome-wide expression analysis of placental malaria reveals features of lymphoid neogenesis during chronic infection. J Immunol. 2007 Jul 1;179(1):557-65. DOI:10.4049/jimmunol.179.1.557 | PubMed ID:17579077 | HubMed [muehl2007]

    Placental malaria study

  4. Chakravorty SJ, Carret C, Nash GB, Ivens A, Szestak T, and Craig AG. Altered phenotype and gene transcription in endothelial cells, induced by Plasmodium falciparum-infected red blood cells: pathogenic or protective?. Int J Parasitol. 2007 Jul;37(8-9):975-87. DOI:10.1016/j.ijpara.2007.02.006 | PubMed ID:17383656 | HubMed [chakra2007]

    Human Umbilical Endothelial cell-line study

  5. Tripathi AK, Sullivan DJ, and Stins MF. Plasmodium falciparum-infected erythrocytes increase intercellular adhesion molecule 1 expression on brain endothelium through NF-kappaB. Infect Immun. 2006 Jun;74(6):3262-70. DOI:10.1128/IAI.01625-05 | PubMed ID:16714553 | HubMed [tripath2006]

    Brain Endothelial cell-line study

All Medline abstracts: PubMed | HubMed

Normalizing Data

Normalized data was provided from GEO for all datasets (GDSXXXX). Datasets 1-3 were GEO datasets. Dataset 4 was from ArrayExpress and was also normalized. Dataset 5 was not normalized. Karsten helped with the normalization and data checks. Data quality checks were already performed and provided on ArrayWiki (http://arraywiki.bme.gatech.edu/index.php/GSE9861). The first array showed some artifacts & low quality score (88.63%) - these were cleaned: the spots with high variance have been replaced with the median value of this probe from other chips in the dataset (see here for more details: http://arraywiki.bme.gatech.edu/index.php/BioPNG_format). Karsten generated two normalized versions of the data:

I have placed files with normalized values into /tmp on koch:
khokamp@koch:/tmp> l GSE9861_*
-rw-r--r-- 1 khokamp wg-users 3257047 2009-05-13 13:56 GSE9861_clean_gcrma.txt
-rw-r--r-- 1 khokamp wg-users 3256320 2009-05-13 13:56 GSE9861_gcrma.txt
But if you are going to analyse them with limma, you could easily re-do 
this yourself in R:
1. Change into the directory containing the (compressed) CEL files, e.g. 
cd /tmp/GSE9861_cleanCEL
2. start R
3. read data and normalize:
library(gcrma)
data = ReadAffy()
norm_data = gcrma(data)
norm_data is an expressionSet that should be readily usable with limma
The last step that I applied for extracting the data was
write.exprs(norm_data, file='GSE9861_clean_gcrma.txt')
It might be best to run both the original and the clean data set through 
limma to see if the resulting gene lists differ greatly.

I found no differentially-expressed genes when using the cleaned data. I will proceed with the original data - but may decide to drop this dataset altogether.

See scripts ~/workspace/other_projects/b_malaria/bioconductor/meta_workflow_h*.R for scripts that were used to obtain the data.


Data preparation

Broadly speaking there are two types of meta-analysis techniques that i plan to use:

  1. accounts for variation and sample size (MAID). This would use array-level data.
  2. does not account for sample size (RankProd). This would use single summary expression column / study (so studies with more arrays are not more represented).

There are a number of arrays (1) and contrasts (2) i could use from each dataset. My first attempt where i considered a large number of contrasts (bioc_workflow_h*.R) can be found in the ~.../de_genes directory. This data contained a few problems (dataset 4 was labelled incorrectly). So always reference the ~.../de_genes/2 directory, even though its less extensive.

Probe Annotation

Refer to R scripts ~/workspace/other_projects/b_malaria/bioconductor/meta_workflow_h*.R. Basic strategy for probe annotation:

  • I used the annotation packages from Bioconductor to load the probe to gene mappings for each affy platform (bioconductor version 2.4) to create a uniform source (geo also provides this info, but not all datasets are in geo)
  • I mapped affy probeset IDs to NCBI Unigene IDs (Entrez Genes)
  • I removed probes that mapped to multiple genes (none actually - affy probes are built off of Unigenes i believe)
  • I removed probes that mapped to no genes.
  • I selected the probe that had the smallest p-value from the LIMMA analysis, when there were multiple probes that mapped to the same gene. (this was the smallest in any of the contrasts that i considered for that dataset)

Data Checks

Refer to R scripts ~/workspace/other_projects/b_malaria/bioconductor/meta_workflow_h*.R. I produced a heatmap showing how each array clustered within a dataset. I used pearson correlation of DE genes that were found across multiple arrays (between 100-1000 genes) and hierarchial clustering. I checked that arrays of uninfected and infected clustered together. Only for dataset 1.1 & 1.2, did i find any problems - i removed the arrays that did not cluster as expected (see h1.*.before_arrays_removed.heatmap.png for the before version).

I also generated volcano plots to visualize the microarray data for particular contrasts of interest. These files are in ~/data/projects/b_malaria/de_genes/2

Contrasts

I considered the following contrasts and output files are named as follows:

Microarray datasets & associated filenames
------------------------------------------
Microarray Set: h1
 PBMC Study, pmid:16988231
Subsets:
h1.1:
 US volunteer uninfected vs. US volunteer infected
h1.2
 US volunteer uninfected vs. African volunteer infected
h1.3
 African volunteer treated vs. African volunteer infected
h1.4
 US volunteer infected vs. African volunteer infected

Microarray Set: h2
 Whole Blood Study, pmid:16738667
Subsets:
h2.1:
 Gabonese children infected w/ uncomplicated malaria vs. uninfected Gabonese children
h2.2:
 Gabonese children infected w/ complicated malaria vs. uninfected Gabonese children
h2.3
 Complicated malaria + uncomplicated malaria vs. uninfected

Microarray Set: h3
 Placenta Study, pmid:17579077
Subsets:
h3.1:
 Chronic, inflamed malaria infected Placenta vs. healthy control placenta

Microarray Set: h4
 Plasmodium falciparum infected RBC (pRBC) & HUVEC endothelial study, pmid:17383656
Subsets (HUVEC monolayers incubated with):
h4.1:
 Plasmodium falciparum infected Red Blood Cells (PRBC) vs. Red Blood Cells (RBC)
h4.2:
 PRBC + Tumor Necrosis Factor (PRBCTNF) vs. RBC + TNF (RBCTNF)
h4.3
 PRBCTNF + PRBC vs. RBCTNF + RBC

Microarray Set: h5
 Plasmodium falciparum infected RBC (pRBC) & brain endothelial study, pmid:16714553
 (original version not the 'clean'ed version with artifacts removed)
 Subsets:
h5.1:
 high-binding PRBC + low-binding PRBC vs. RBC control + Medium control

I decided to go with the following contrasts: 1.1, 2.3, 3.1, 4.1, 5.1.

Summarized Expression Matrix

I used LIMMA to compute p-values and produce a summarized logFC value for a particular contrast. A inter-study comparison of this summarized logFC values was also performed (see data in ~/.../meta_analysis/sample_selection/2/ - parent directory contains some false data but is more extensive and therefore was left in). A heatmap of the hierarchical clustering of the pearson correlation was performed for all contrasts considered (*all*) and for the final subset (*final*) of contrasts that was decided upon to go forward with. Overlapping DE genes for different combinations of contrasts in the *all* and *final* groups was also computed. The script ~/.../meta_analysis/de_gene_set_workflow.R was used to compute these figures.

Some notes on these figures. The heatmaps show that no two study are tightly clustered (approx equi-distant) and interestingly, the cell-line studies did not separate from the in vivo tissue studies. Dataset 5 has the smallest overlap with the other studies, in terms of DE genes identified. Only 2 genes were common to all studies (as being DE).

Array-Level Expression Matrix

The arrays for a particular contrast were also output as a large combined matrix. There were common 12769 Entrez Genes that had recorded expression values in all studies (The contrasts 1.1, 2.3, 3.1, 4.1, 5.1).