Dahlquist:Microarray Data Processing in R: Difference between revisions

From OpenWetWare
Jump to navigationJump to search
(→‎Alternate Way to Filter GCAT Chips: Indented rownames and colnames lines of code so that the reader knows that they are lines of code.)
(deleted a double and after and in first bullet point)
 
(110 intermediate revisions by 2 users not shown)
Line 1: Line 1:
{{Template:Dahlquist}}
{{Template:Dahlquist}}


==Normalizing the Data==
The following protocol was developed to normalize GCAT and Ontario DNA microarray chip data from the Dahlquist lab using the R Statistical Software and the limma package (part of the Bioconductor Project).
* The normalization procedure has been verified to work with version 3.1.0 of R released in April 2014 ([http://cran.r-project.org/bin/windows/base/old/3.1.0/ link to download site]) and version 3.20.1 of the limma package ([[Media:Limma_3.20.1.zip | direct link to download zipped file]]) on the Windows 7 platform. 
** Note that using other versions of R or the limma package might give different results.
** Note also that using the 32-bit versus the 64-bit versions of R 3.1.0 will give different results for the normalization out in the 10<sup>-13</sup> or 10<sup>-14</sup> decimal place.  The Dahlquist Lab is standardizing on using the 64-bit version of R.
* To install R for the first time, download and run the installer from the link above, accepting the default installation.
* To use the limma package, unzip the file and place the contents into a folder called "limma" in the library directory of the R program.  If you accept the default location, that will be C:\Program Files\R\R-3.1.0\library.


First, target files must be created to input all of the GPR files into R. In order to do so, open up Microsoft Excel. The first column should be labeled "FileName", consisting of all the GPR file names. This should be followed by a column labeled "Header" that has the column names for the GPR files, then "Strain", "TimePoint", "Flask", and "DyeSwap" for each GPR. Make sure they match up to their respective GPR files in the excel sheet and save this as a CSV file. Repeat this step for the GCAT chips, placing the top nine chips in the row 2-10 and the bottom nine in rows 11-19. You will also need to load a CSV file containing the gene IDs into R for both the Ontario and GCAT chips. This can be accomplished by running a single GPR file through R until you duplicate the spots. This is where R alphabetizes the IDs and their corresponding spots with them, this can then be saved to a table and copied to a new CSV file. This file can be loaded into R and used to add the gene IDs to the data. Use these lines of code:


Into the R Console window, type in "library(limma)".
==Input Files for R==


*Set the directory (File > Change dir...)to the folder containing the Ontario GPR files. Load the target file into R and read the first GPR file so that R can Loess normalize it.  
The normalization protocol was written to normalize two types of chips:
Targets<-read.csv("Targets.csv",sep=",")
* "GCAT" chips:  containing 70-mer oligonucleotides printed on epoxy chips obtained from the [http://www.bio.davidson.edu/GCAT/GCATchip.html Genome Consortium for Active Teaching] and produced by Washington University, St. Louis (Fall 2005 chips)
f<-function(x) as.numeric(x$Flags > -99)
* "Ontario" chips:  the Yeast 6.4K Array containing full-length PCR products obtained from the [http://www.microarrays.ca/ UHN Microarray Centre] in Toronto, Ontario, Canada.
RG<-read.maimages(Targets[,1],source="genepix.median",wt.fun=f)
 
For each chip that you want to normalize in R, you must import the .gpr file created for that chip after scanning it with GenePix Pro Software. Make sure that all of the .gpr files you will need for normalization are in one folder, along with the targets file(s) described in the section below.  


*Loess Normalize the Ontario data.
===Create a targets file with information about the .gpr files===
MA<-normalizeWithinArrays(RG, method="loess", bc.method="normexp")


*Tell R how many rows the target matrix is going to have.
One targets file must be created for the Ontario chips and a separate targets file must be created for the GCAT chips. To create the targets file for the Ontario chips, create the following spreadsheet in Excel, where each row corresponds to a .gpr file:
M1<-tapply(MA$M[,1],as.factor(MA$genes[,5]),mean)
n1<-length(M1)
*Tell R how many columns the target matrix is going to have.
n0<-length(MA$M[1,])
*Create the target matrix
MM<-matrix(nrow=n1,ncol=n0)


*Average the duplicate spots so that only unique spots remain.
*The first column (Column A, labeled "FileName") contains the file names of all the .gpr files. Make sure that you include the ".gpr" extension at the end of each file name.
MM<-tapply(MA$M[,1],as.factor(MA$genes[,5]),mean)
*The second column (Column B, labeled "Header") describes the strain, the time point, and flask from which the data represented in the .gpr file was taken during the cold shock experiment. Create this designation for each .gpr file in the form <strain>_LogFC_t<time point>-<flask number>. For example, wt_LogFC_t15-2, where wt is the wild type (note that future analysis of the data in MATLAB will be case-sensitive, so be careful when choosing appropriate headers).
*The third column (Column C, labeled "Strain") contains the name of the strain to which the .gpr file corresponds (match this to the designation you used in Column B).
*The fourth column (Column D, labeled "TimePoint") contains the experimental time point to which the .gpr file corresponds.
*The fifth column (Column E, labeled "Flask") contains the flask number to which the .gpr file corresponds, assuming that replicates were performed for each time point for each strain.
*The sixth column (Column F, labeled "Dyeswap") indicates the orientation of the dyes on the chip. If the control time point (t0) was  labeled with Cy3 dye and the other time points where labeled with Cy5 dye on a particular chip, then put a "1" in this column in the row corresponding to the chip. If the orientation of the dyes was reversed for a chip, then put a "-1" in this column in the row corresponding to that chip.
*The seventh column (Column G, labeled "MasterList") orders the chips by strain (wild type strain first and the deletion strains afterwards), then time point, and then flask. In this column, put a "1" in the row corresponding to the chip from the wild type, the first time point, and the first flask of that time point. Then, put a "2" in the row corresponding to the chip from the wild type strain, the first time point, and the second flask of that time point. Continue to number the chips in this order until all the chips have been numbered.


*Write a table to your directory so you can save the names for your Gene IDs target file.
Make sure to save the targets file as a .csv file in the folder that contains the .gpr files for the Ontario chips.
write.table(MM,"ONT_Index_ID.csv",sep=",",row.names=TRUE,append=FALSE)
* The original file used for this analysis is [[Media:Ontario_Targets.csv‎ | Ontario_Targets.csv]], which contains data for the wild type and deletion strains for CIN5, GLN3, HMO1, and ZAP1.
* An updated file is [[Media:Ontario_Targets_wt-dCIN5-dGLN3-dHMO1-dSWI4-dZAP1-Spar_20150317.csv | Ontario_Targets_wt-dCIN5-dGLN3-dHMO1-dSWI4-dZAP1-Spar_20150317.csv]] which contains the above data, plus data from the SWI4 deletion strain and ''S. paradoxus''.


Open the CSV file you just created. Relabel cell A1 as "Gene ID" and save the file.
The targets file for the GCAT chips should contain the aforementioned columns in the order in which they were mentioned. However, there should be a column that precedes this aforementioned set of columns (Column A,labeled "Location"), which designates which half of the GCAT chip was scanned to create the .gpr file. The rest of the columns will be shifted to the right.  If the top half was scanned for a particular chip, then put "Top" in this column in the row corresponding to that chip. If the bottom half was scanned for a particular chip, then put "Bottom" in this column in the row corresponding to that chip. Make sure that the sequence of GCAT chips in the MasterList of .gpr files corresponding to the top of the chips is the same as the sequence of GCAT chips in the list of the .gpr files corresponding to the bottom of the chips.  This is necessary because the .gal file for the GCAT chips only corresponded to one-half of the chip; the entire genome was printed twice in two large blocks on the top and bottom half of the chip.


For the GCAT chips use these lines of code:
Make sure to save the targets file as a .csv file in the folder that contains the .gpr files for the GCAT chips.  The file used for this analysis is [[Media:GCAT_Targets.csv | GCAT_Targets.csv]].


*Set the directory (File > Change dir...)to the folder containing the GCAT GPR files. Read the GCAT target file into R.
==Data Normalization==
targets<-read.csv("GCAT_Targets.csv",sep=",")
f<-function(x) as.numeric(x$Flags > -99)
RT<-read.maimages(targets[1:9,1],source="genepix.median",wt.fun=f)
*Loess normalize the GCAT GPR files
MAG<-normalizeWithinArrays(RT,method="loess",bc.method="normexp")


*Tell R how many rows the target matrix will have.
Microarray data for ''S. cerevisiae'' wild type and mutant strains was initially collected using GCAT chips before switching to Ontario chips. The within array normalization procedure for data from the GCAT chips has some differences in comparison to the procedure for data from the Ontario chips. In contrast, microarray data for wild type ''S. paradoxus'' has been collected using Ontario chips only.
R1<-tapply(MAG$M[,1],as.factor(MAG$genes[,4]),mean)
r1<-length(R1)
*Tell R how many columns the target matrix will have.
r0<-length(MAG$M[1,])
*Create the target matrix
RR<-matrix(nrow=r1,ncol=r0)


*Average all duplicate spots so that only unique spots remain.
To normalize data from both the Ontario and GCAT chips, begin with the section [[#Within Array Normalization for the Ontario Chips|"Within Array Normalization for the Ontario Chips"]]. Begin with the same section to normalize data collected with Ontario chips only.
RR<-tapply(MAG$M[,1],as.factor(MAG$genes[,4]),mean)


*Write this to a table and save it as a CSV file for later use.
===Within Array Normalization for the Ontario Chips===
write.table(RR,"GCAT_ID.csv",sep=",",row.names=TRUE,append=FALSE)


Open the CSV file you just created. Relabel cell A1 as "Gene ID" and save the file.
* Launch R 3.1.0.
* Change the directory (File > Change dir...) to the folder containing the targets file and the GPR files for the Ontario chips.
* Enter the code below (in the boxes) into the R command prompt and hit enter to run the within array normalization procedure that uses the Loess method.  Hit enter after every line of code.
** Alternately, you can run a script containing all of these commands.
** Download the zipped file [[Media:Ontario Chip Within-Array Normalization corrected 20140721.zip | Ontario Chip Within-Array Normalization corrected 20140721.zip]] and unzip it to the folder containing the [[Media:Ontario_Targets.csv‎ | Ontario_Targets.csv]] file (or [[Media:Ontario_Targets_wt-dCIN5-dGLN3-dHMO1-dSWI4-dZAP1-Spar_20150317.csv | Ontario_Targets_wt-dCIN5-dGLN3-dHMO1-dSWI4-dZAP1-Spar_20150317.csv]]) and the GPR files corresponding to the Ontario chips.
** In R, select the menu item File > Source R code..., and select the Ontario Chip Within-Array Normalization corrected 20140721.R script.
** You will be prompted by an Open dialog for the Ontario targets file.


You will also need a CSV file consisting of a single column with the correct order of the Headers. This must be ordered by Strain (starting with the wild type and then the deletion strains in alphabetical order) then by TimePoint and finally by Flask.
* Line-by-line instructions follow:


Also make sure that the target file and the GPR files for the Ontario chips are in one folder and the the target file and the GPR files for the GCAT chips are in another folder.
*Load the limma library.
library(limma)


Open up R. Change the directory (File > Change dir...) to the folder containing the target file and the GPR files for the Ontario chips.
*When prompted after entering the line below, select the targets file with information about the GPR files corresponding to the Ontario chips. The file used for this analysis is [[Media:Ontario_Targets.csv‎ | Ontario_Targets.csv]] (or [[Media:Ontario_Targets_wt-dCIN5-dGLN3-dHMO1-dSWI4-dZAP1-Spar_20150317.csv | Ontario_Targets_wt-dCIN5-dGLN3-dHMO1-dSWI4-dZAP1-Spar_20150317.csv]]).
Otargets<-read.csv(file.choose())


*Read the Ontario "Target" CSV file and the "ID" CSV file into R.
*Read in the .gpr files by designating the column within the targets file (imported as Otargets) that lists all the .gpr file names.  
Targets<-read.csv("Targets.csv",sep=",")
Names<-read.csv("ONT_Index_ID.csv",sep=",")
  f<-function(x) as.numeric(x$Flags > -99)
  f<-function(x) as.numeric(x$Flags > -99)
*Separate the individual columns into their own locations in R so they may be called on later with ease.
  RGO<-read.maimages(Otargets[,1], source="genepix.median", wt.fun=f)
ds<-Targets[,6]
 
row<-Names[,1]
*Extract the column of values from the targets file that indicates for which chips the orientation of the dyes was swapped.
col<-Targets[,2]
ds<-Otargets$Dyeswap
*Read the GPR files into R so they can be normalized.
  RG<-read.maimages(Targets[,1],source="genepix.median",wt.fun=f)


*Loess normalize the GPR files and allow R to process all the GPR files, the more GPR files the more time this will take.
*Perform Loess normalization. This step may take some time depending upon the number of chips that need to be normalized.  
  MA<-normalizeWithinArrays(RG, method="loess", bc.method="normexp")
  MAO<-normalizeWithinArrays(RGO, method="loess", bc.method="normexp")
*Tell R how many rows the target matrix is going to have.
 
  M1<-tapply(MA$M[,1],as.factor(MA$genes[,5]),mean)
*Create a list of the names of the genes on the Ontario chips.
  M1<-tapply(MAO$M[,1],as.factor(MAO$genes$Name),mean)
ontID<-rownames(M1)
 
*Designate which column of the targets file contains the name of each chip in the form <strain>_LogFC_t<time point>-<flask number>.
headers<-Otargets$Header
 
*Create a matrix MO, which will store the normalized data after replicate spots on the chips have been averaged. It should have as many rows as there are unique genes on the Ontario chip and as many columns as there are GPR files for the Ontario chips.
n0<-length(MAO$M[1,])
  n1<-length(M1)
  n1<-length(M1)
*Tell R how many columns the target matrix is going to have.
MO<-matrix(nrow=n1,ncol=n0)
  n0<-length(MA$M[1,])
 
*Create the new matrix.
*Average all duplicate spots on the chip.
  MM<-matrix(nrow=n1,ncol=n0)
  for(i in 1:n0) {MO[,i]<-tapply(MAO$M[,i],as.factor(MAO$genes$Name),mean)}
 
*Create a new matrix MD to store the data. It should have as many rows as there are unique genes on the Ontario chip and as many columns as there are GPR files for the Ontario chips.
  MD<-matrix(nrow=n1,ncol=n0)


*Average all duplicate spots so that only unique spots remain.
*Multiply each row of the data matrix MO by the vector ds, which contains the values from the targets file that indicate for which chips the dye orientation was swapped. Store the output of the calculation in the matrix MD.
  for(i in 1:94) {MM[,i]<-tapply(MA$M[,i],as.factor(MA$genes[,5]),mean)}
  for(i in 1:n0) {MD[,i]<-ds[i]*MO[,i]}


*Tell R how many rows the target matrix will have.
*Convert matrix MD to a data frame.  
  M2<-tapply(MA$M[,1],as.factor(MA$genes[,5]),mean)
  MD2<-as.data.frame.matrix(MD)
n3<-length(M2)
*Set the row names of the data frame to the names of the genes on the Ontario chips.  
*Tell R how many columns the target matrix will have.
  colnames(MD2)<-headers
  n2<-length(MA$M[1,])
*Set the column names of the data frame to the names of each chip that include the strain, time point and flask number.
*Create the target matrix.
  rownames(MD2)<-ontID
MN<-matrix(nrow=n3,ncol=n2)
*Dye swap the GPR files.
  for(i in 1:94) {MN[,i]<-ds[i]*MM[,i]}


*Tell R that the resulting matrix should be a data frame.
*Remove the control spots on the chip (i.e. Arabidopsis and 3XSSC).
MO<-as.data.frame.matrix(MN)
  MD3<-subset(MD2,row.names(MD2)!="Arabidopsis")
*Assign Headers to the data frames columns.
  MD4<-subset(MD3,row.names(MD3)!="3XSSC")
colnames(MO)<-col
*Assign IDs to the data frames rows.
rownames(MO)<-row
*Tell R to dispose of the two Ontario controls.
  ont1<-subset(MO,row.names(MO)!="Arabidopsis")
  MP<-subset(ont1,row.names(ont1)!="3XSSC")


Switch to GCAT directory and start on the GCAT chips.
If you are also normalizing data from GCAT chips, proceed to the next section [[#Within Array Normalization for the GCAT Chips|"Within Array Normalization for the GCAT Chips"]]. If you are normalizing only data from Ontario chips, then proceed to the section [[#Between Array Normalization of Only Ontario Chip Data|"Between Array Normalization of Only Ontario Chip Data"]].


*Read the GCAT target file into R.
===Within Array Normalization for the GCAT Chips===
  targets<-read.csv("GCAT_Targets.csv",sep=",")
 
* These instructions assume that you have just completed [[Dahlquist:Microarray_Data_Processing_in_R#Within_Array_Normalization_for_the_Ontario_Chips | Within Array Normalization for the Ontario Chips]].
* If necessary, change the directory (File > Change dir...) to the folder containing the targets file and the GPR files for the Ontario chips.
* Enter the code below (in the boxes) into the R command prompt and hit enter to run the within array normalization procedure that uses the Loess method.  Hit enter after every line of code.
** Alternately, you can run a script containing all of these commands (and the commands from [[Dahlquist:Microarray_Data_Processing_in_R#Between_Array_Normalization_of_Merged_GCAT_and_Ontario_Chip_Data | Between Array Normalization of Merged GCAT and Ontario Chip Data]]) below.
** Download the zipped file [[Media:Within-Array Normalization GCAT and Merged Ontario-GCAT Between-Chip Normalization corrected 20140721.zip | Within-Array Normalization GCAT and Merged Ontario-GCAT Between-Chip Normalization corrected 20140721.zip]] and unzip it to the folder containing the [[Media:GCAT_Targets.csv‎ | GCAT_Targets.csv]] file and the GPR files corresponding to the GCAT chips.
** In R, select the menu item File > Source R code..., and select the Within-Array Normalization GCAT and Merged Ontario-GCAT Between-Chip Normalization corrected 20140721.R script.
** You will be prompted by an Open dialog for the GCAT targets file.
 
*Line-by-line instructions follow:
*Read the targets file for the GCAT chips.  The file used for this analysis is [[Media:GCAT_Targets.csv‎ | GCAT_Targets.csv]].
  Gtargets<-read.csv(file.choose())
 
*Create a subset of the targets file that contains only the GPR files created after scanning the top of the GCAT chips. Read those GPR files into R.
Gtop<-subset(Gtargets,Gtargets$Location %in% 'Top')
  f<-function(x) as.numeric(x$Flags > -99)
  f<-function(x) as.numeric(x$Flags > -99)
*Separate the Top and Bottom chips into their own locations in R.
  RT<-read.maimages(Gtop, source="genepix.median", wt.fun=f)
  RT<-read.maimages(targets[1:9,1],source="genepix.median",wt.fun=f)
 
  RB<-read.maimages(targets[10:18,1],source="genepix.median",wt.fun=f)
*Create a subset of the targets file that contains only the GPR files created after scanning the bottom of the GCAT chips. Read those GPR files into R.
*Combine the Top GPRs with the Bottom GPRs so that there are only 9 chips left.
Gbottom<-subset(Gtargets,Gtargets$Location %in% 'Bottom')
  RB<-read.maimages(Gbottom, source="genepix.median", wt.fun=f)
 
*Bind the data from the top and bottom of the GCAT chip for each chip.  
  RGG<-rbind(RT,RB)
  RGG<-rbind(RT,RB)


*Loess normalize the GCAT data.
*Extract the column of values from the subset of targets file corresponding to the top of the GCAT chips that indicates which chips were dye-swapped.
MAG<-normalizeWithinArrays(RGG,method="loess",bc.method="normexp")
  dw<-Gtop$Dyeswap
*Tell the R how rows the target matrix should have.
R1<-tapply(MAG$M[,1],as.factor(MAG$genes[,4]),mean)
r1<-length(R1)
*Tell R how many columns the target matrix should have.
r0<-length(MAG$M[1,])
*Create the target matrix.
RR<-matrix(nrow=r1,ncol=r0)
*Average any duplicate spots in the GCAT data so that only unique spots remain.
  for(i in 1:9) {RR[,i]<-tapply(MAG$M[,i],as.factor(MAG$genes[,4]),mean)}


*Read the GCAT IDs file into R.
*Perform Loess normalization.
  GNames<-read.csv("GCAT_ID.csv",sep=",")
  MAG<-normalizeWithinArrays(RGG,method="loess", bc.method="normexp")
*Separate the column Headers into their own location.
Gcol<-targets[1:9,2]
*Separate the row ID names into their own location.
Grow<-GNames[,1]
*Tell R the target matrix should be a data frame instead.
GD<-as.data.frame.matrix(RR)
*Assign column names to the data frame.
colnames(GD)<-Gcol
*Assign row names to the data frame.
rownames(GD)<-Grow


*Merge the GCAT and Ontario data together into a single data frame, any IDs that appear in one chip and not the other will appear as NA's in the data frame.
*Create a list of the names of the genes on the GCAT chip.
Q<-merge(MP,GD,by="row.names",all=T)
  M0<-tapply(MAG$M[,1],as.factor(MAG$genes$ID),mean)
*Tell R to get rid of the spots that are only in the GCAT chips, and keep all the spots that are in the Ontario chips, for the entire data set.
  gcatID<-rownames(M0)
  Z<-subset(Q,Q[,1] %in% Names[,1])
*Specify the number of columns in the target matrix.
x0<-length(Z[1,])
*Specify the number of rows in the target matrix.
x1<-length(Z[,1])
*Create the target matrix.
  XX<-matrix(nrow=x1,ncol=x0)
*Tell R the row names from the merged data are the same as the row names for the new target matrix.
XX[,1]=Z[,1]
*Tell R that the column Headers from the merged data are the same as the Headers for the new target matrix.
colnames(XX)=colnames(Z)
*Divide each chip by its own MAD to scale the data.
for(i in 2:104) {XX[,i]<-Z[,i]/mad(Z[,i],na.rm=TRUE)}


Make sure you pick the correct directory where your ordered header CSV file is located
*Designate which column of the targets file contains the name of each chip in the form <strain>_LogFC_t<time point>-<flask number>.
chips<-Gtop$Header


*Read the correctly ordered Headers into R.
*Create a matrix MG, which will store the normalized data after replicate spots on the chips have been averaged. It should be as many rows as there are unique genes on the GCAT chip and as many columns as there are GPR files for the GCAT chips.
  XZ<-read.csv("Ordered_headers.csv",sep=",")
  m0<-length(MAG$M[1,])
*Tell R that the chip data should be a data frame instead of a matrix.
  m1<-length(M0)
  XV<-as.data.frame.matrix(XX)
  MR<-matrix(nrow=m1,ncol=m0)
*Sort the columns from the data frame into a new data frame using the ordered headers as the sorting criteria.
  XY<-XV[,match(XZ[,1],colnames(XV))]


*Write the final data set to a table, this should consist of all of the data, Loess normalized, scaled after the fact that the controls were gone, and then sorted into their correct order.
*Average the duplicates spots on the GCAT chips.
  write.table(XY,"Master_Normalized_Data.csv",sep=",",col.names=NA,row.names=TRUE,append=FALSE)
  for(i in 1:length(chips)) {MR[,i]<-tapply(MAG$M[,i],as.factor(MAG$genes$ID),mean)}


*Open the .csv file with the R output in Excel. Replace every entry with an "NA" by a space. To do so, select the menu option Edit > Replace... (or select any cell and click Ctrl+H). Type "NA" into the "Find what:" box and leave the "Replace with:" box blank. Then click "Replace All". Save the file.
*Create a new matrix MG to store the data. It should be as many rows as there are unique genes on the GCAT chip and as many columns as there are GPR files for the GCAT chips.
  MG<-matrix(nrow=m1,ncol=m0)


===Alternate Way to Filter GCAT Chips===
*Set up a for loop to multiply each row of the data matrix MR by the vector of values dw indicating the dye-swapped chips. Store the output to the matrix MG.
for(i in 1:m0) {MG[,i]<-dw[i]*MR[,i]}


There is an alternate set of code that can be used to filter genes from the GCAT chips. The code below can be used to replace the code above for the GCAT chips at the point at which the GCAT and Ontario is merged together.
*Convert matrix MG to a data frame.
MG2<-as.data.frame.matrix(MG)
*Set the column names of the data frame to the names of each chip that include the strain, time point and flask number.
colnames(MG2)<-chips
*Set the row names of the data frame to the names of the genes on the GCAT chips.
rownames(MG2)<-gcatID


*Get rid of the GCAT genes that are not on the Ontario chips.
*'''''NOTE: In normalizing dSWI4 GCAT chip data collected by two classes of Biology 478 on 4/29/2014, I noticed that the gene names in the GPR files included an "_01" at the end of the proper name, which is unlike the GCAT chip data for other deletion strains that had been previously analyzed. To following line of code was used to remove the "_01":'''''
names.to.keep<-row
  gcatIDtruncated<-strsplit(gcatID,"_01",fixed=TRUE)
  GO<-subset(GD,row.names(GD) %in% names.to.keep)
:*'''''The row names of the data frame MG2 were then set using the following line of code:'''''
  rownames(MG2)<-gcatIDtruncated
:*'''''The normalization was finished following the remaining part of the normalization protocol as listed below.'''''


*Find the Ontario genes that the GCAT chips do not have data for.
*Merge the GCAT and Ontario data into a single data frame. Any gene names that only appear on the GCAT chip will have an NA in each column of the row corresponding to that gene in the new data frame.  
  ONT<-as.data.frame(subset(MP, !row.names(MP) %in% GNames[,1]))
  GO<-merge(MD4,MG2,by="row.names",all=T)


*Create a new matrix for the subset of Ontario genes that are not on the GCAT chips. There should be as many rows as there Ontario genes in the subset (ONT). There should be as many columns as there are GCAT GPR files.
*Set the row names of the new data frame to the names of the genes on both the Ontario and GCAT chips.
  subO<-matrix(nrow=length(ONT[,1]),ncol=length(targets[1:9,1]))
rownames(GO)<-GO$Row.names
 
*Convert the matrix to a data frame.
*Get rid of the genes on the GCAT chips that are not on the Ontario chips.
  subO2<-as.data.frame.matrix(subO)
GO2<-subset(GO,rownames(GO) %in% ontID)
 
*Create a list to order the merged GCAT and Ontario chip data by strain, then time point, and then flask.
MasterList<-rbind(Otargets[,c('Header','MasterList')],Gtop[,c('Header','MasterList')])
OrderedML<-MasterList[sort.list(MasterList$MasterList),]
 
*Rearrange the columns so that they are ordered by strain (wild type then deletion strains in alphabetical order), then time point, and then flask.
GO3<-GO2[,match(OrderedML$Header,colnames(GO2))]
 
*Write the normalized Ontario and GCAT data to a CSV file. Make sure to indicate which chips and/or which strains the data was collected from.
write.table(GO3,"GCAT_and_Ontario_Within_Array_Normalization.csv",sep=",",col.names=NA,row.names=TRUE)
 
At this point, the normalization within each GCAT and Ontario array is complete. To merge the normalized data from both chips and to perform between array normalization, proceed to the section [[#Between Array Normalization of Merged GCAT and Ontario Chip Data|"Between Array Normalization of Merged GCAT and Ontario Chip Data"]].
 
===Between Array Normalization of Merged GCAT and Ontario Chip Data===
 
*Create a new matrix to store the merged data. It should have as many rows as there are genes common to both the GCAT and Ontario chips and as many columns as there are GPR files that have been normalized.
  g0<-length(GO3[1,])
g1<-length(GO3[,1])
MADM<-matrix(nrow=g1,ncol=g0)
 
*Scale each chip by its median absolute deviation.
for (i in 1:g0) {MADM[,i]<-GO3[,i]/mad(GO3[,i],na.rm=TRUE)}
 
*Convert matrix MADM to a data frame.  
  MAD<-as.data.frame.matrix(MADM)
*Set the row names of the data frame to names of the genes common to the Ontario and GCAT chips.
rownames(MAD)<-rownames(GO3)
*Set the column names of the data frame to the list of the names of the chips in the format <strain>_LogFC_t<time point>-<flask number> ordered by strain, then time point, and then flask number.
colnames(MAD)<-colnames(GO3)
 
*Write the final data set to a CSV file. Make sure to indicate which chips and/or which strains the data was collected from.
write.table(MAD,"GCAT_and_Ontario_Final_Normalized_Data.csv",sep=",",col.names=NA,row.names=TRUE)
 
*Minimize the R window (do not close the program). Open the .csv file with the R output in Excel. Replace every entry with an "NA" by a space. To do so, select the menu option  Edit > Replace... (or select any cell and click Ctrl+H). Type "NA" into the "Find what:" box and leave the "Replace with:" box blank. Then click "Replace All". Save the file.


*Label the rows with the names of the Ontario genes in the subset (ONT).
===Between Array Normalization of Only Ontario Chip Data===
rownames(subO2)<-row.names(ONT)


*Label the columns with the names of the GPR files.
'''''NOTE:''' this only applies if you are normalizing Ontario Chip Data OnlyIf you are combining GCAT and Ontario data, do not do this section.''
  colnames(subO2)<-targets[1:9,2]
   
   
*Bind the filtered GCAT data (GO) to the data frame with the Ontario genes not on the GCAT chips (subO2).
*Save the Ontario data after performing within array normalization to a CSV file.
  G<-rbind(GO,subO2)
  write.table(MD4,"Ontario_Within_Array_Normalization.csv",sep=",",col.names=NA,row.names=TRUE)


*Sort the data frame final so that the genes are in alphabetical order.
*Create a list to order the Ontario chip data by strain, then time point, and then flask.
  G.sort<-G[order(row.names(G)),]
  MasterList<-Otargets[,c('Header','MasterList')]
OrderedML<-MasterList[sort.list(MasterList$MasterList),]


*Merge GCAT and Ontario within array normalized data.
*Reorder the columns of the data frame MD4 so that the data is ordered by the time points.
  merged<-cbind(G.sort,MP)
  MD5<-MD4[,match(OrderedML$Header,colnames(MD4))]


*Read .csv file with the list of all the headers (the GPR files) in the correct order.
*Create a matrix MADM, which will store the data after performing between array normalization. It should have as many rows as there are unique genes on the Ontario chip and as many columns as there are GPR files for the Ontario chips.
  XZ<-read.csv("Ordered_headers.csv",sep=",")
  t0<-length(MD5[1,])
t1<-length(MD5[,1])
MADM<-matrix(nrow=t1,ncol=t0)


*Rearrange the columns so that they are ordered by strain (wildtype then deletion strains in alphabetical order), timepoint, and then flask.
*Scale each chip by its median absolute deviation.
merged.sort<-merged[,match(XZ[2:104,1],colnames(merged))]
for (i in 1:t0) {MADM[,i]<-MD5[,i]/mad(MD5[,i],na.rm=TRUE)}


*Write the within array normalized data for all chips to a table.
*Convert matrix MADM to a data frame.
  write.table(merged.sort,"GCAT_and_Ontario_WAnorm.csv",sep=",",col.names=NA,row.names=TRUE)
  MAD<-as.data.frame.matrix(MADM)


*Create a blank matrix the as many rows (r) and as many columns (col) as there are in the data frame with the GCAT and Ontario chip within array normalized data merged.
*Set the row names of the data frame to names of the unique genes on the Ontario chip.
r<-length(merged.sort[,1])
  rownames(MAD)<-rownames(MD5)
  col<-length(merged.sort[1,])
MADM<-matrix(nrow=r,ncol=col)


*Scale each GPR file by its own MAD.
*Set the column names of the data frame to the list of the names of the chips in the format <strain>_LogFC_t<time point>-<flask number> ordered by strain, then time point, and then flask number.  
  for (i in 1:103) {MADM[,i]<-merged.sort[,i]/mad(merged.sort[,i],na.rm=TRUE)}
  colnames(MAD)<-colnames(MD5)


*Convert the matrix into a data frame.
*Write the final data set to a CSV file. Make sure to indicate which chips and/or which strains the data was collected from.
  merged.MAD<-as.data.frame.matrix(MADM)
  write.table(MAD,"Ontario_Final_Normalized_Data.csv",sep=",",col.names=NA,row.names=TRUE)


*Label the rows with the row names from the merged and sorted (merged.sort) data frame.
Minimize the R window (do not close the program). Open the .csv file with the R output in Excel. Replace every entry with an "NA" by a space. To do so, select the menu option Edit > Replace... (or select any cell and click Ctrl+H). Type "NA" into the "Find what:" box and leave the "Replace with:" box blank. Then click "Replace All". Save the file.
rownames(merged.MAD)<-row.names(merged.sort)


*Label the columns with the column names from the merged and sorted (merged.sort) data frame.
==Visualize the Normalized Data==
colnames(merged.MAD)<-colnames(merged.sort)


*Write scale normalized data for all chips to a table.
There are two types of plots that can be used to visualize microarray data before and after normalization: MA plots and box plots. MA plots (M for log fold change and A for intensity) display the ratio of the quantity of the red dye to that of the green dye (log2(red/green)), which will be referred to as the log fold change, for each spot versus the intensity ((1/2)*log2(red*green)) of each spot.
write.table(merged.MAD,"ONT_and_GCAT_final_scaled_data.csv",sep=",",col.names=NA,row.names=TRUE)


*Edit the .csv file in Excel in as described in the last bullet of the previous section.
===Create MA Plots and Box Plots for the GCAT Chips===


==Generating MA Plots and Boxplots==
First, you will create MA plots for the data before the normalization has occurred. Input the following code in the same window in R that was used to normalize the data.


Use the following lines of code to create MA plots and boxplots for the GCAT chips.
*Create a variable to store the names of the genes on the GCAT chip before the controls have been removed and before the replicates have been averaged.
GCAT.GeneList<-RGG$genes$ID


First, you will create MA plots for the data before the normalization has occurred.  
*Calculate the log fold change of each gene before normalization has been performed.
lg<-log2((RGG$R-RGG$Rb)/(RGG$G-RGG$Gb))


*Set the dimensions of the window in which the graphs will appear to reflect the number of graphs that need to be fit into the window. Originally, there were 9 GCAT chips so in the line of code below there are 3 columns and 3 rows of graphs.
* If you get a message saying "NaNs produced" this is OK, proceed to the next step.
  par(mfrow=c(3,3))
*Create a blank matrix with as many columns as there are GPR files for the GCAT chips and as many rows as there are unique genes on the GCAT chip.
  r0<-length(lg[1,])
rx<-tapply(lg[,1],as.factor(GCAT.GeneList),mean)
r1<-length(rx)
MM<-matrix(nrow=r1,ncol=r0)


*Set a variable (GeneList) for all of the GCAT gene IDs before the controls have been taken out and before replicates have been averaged.
*Calculate the log fold changes after averaging duplicate genes.
  GeneList<-RGG$genes$ID
  for(i in 1:r0) {MM[,i]<-tapply(lg[,i],as.factor(GCAT.GeneList),mean)}


*Calculate the log fold changes (M values) for each spot on each chip before normalization has occurred.
*Create a new matrix MC to store the data. It should have as many rows as there are unique genes on the GCAT chip and as many columns as there are GPR files for the GCAT chips.
  lr<-(log2((RGG$R-RGG$Rb)/(RGG$G-RGG$Gb)))
  MC<-matrix(nrow=r1,ncol=r0)


*Create a blank matrix with as many columns as there are GPR files and as rows as there are genes after replicates have been averaged.
*Set up a for loop to multiply each row of the matrix MM by the vector of values dw indicating the dye-swapped chips. Store the output to the matrix MC.
  r0<-length(lr[1,])
  for(i in 1:r0) {MC[,i]<-dw[i]*MM[,i]}
RX<-tapply(lr[,1],as.factor(GeneList),mean)
r1<-length(RX)
MG<-matrix(nrow=r1,ncol=r0)


*Calculate the log fold changes (M values) for each spot on each chip after averaging duplicate genes. In the for loop, alter the range to reflect the number of GPR files.
*Convert the matrix MC to a data frame and set the column names of the data frame to the names of the GCAT chips in the form <strain>_LogFC_t<time point>-<flask number>.
  for(i in 1:9) {MG[,i]<-tapply(lr[,i],as.factor(GeneList),mean)}
  MCD<-as.data.frame(MC)
colnames(MCD)<-chips
rownames(MCD)<-gcatID


*Calculate the intensity values (A values) for each spot on each chip before normalization has occurred.
*Calculate the intensity values before normalization has occurred.
  la<-(1/2*log2((RGG$R-RGG$Rb)*(RGG$G-RGG$Gb)))
  la<-(1/2*log2((RGG$R-RGG$Rb)*(RGG$G-RGG$Gb)))


*Create a blank matrix with as many columns as there are GPR files and as many rows as there are genes after replicates have been averaged.
* If you get these Warning messages, it's OK:
  r3<-length(la[1,])
:1: In (RGG$R - RGG$Rb) * (RGG$G - RGG$Gb) :
  RQ<-tapply(la[,1],as.factor(GeneList),mean)
:NAs produced by integer overflow
  r4<-length(RQ)
:2: NaNs produced
  AG<-matrix(nrow=r4,ncol=r3)
*Create a blank matrix with as many columns as there are GPR files for the GCAT chips and as many rows as there are unique genes on the GCAT chip.
  r2<-length(la[1,])
  ri<-tapply(la[,1],as.factor(GCAT.GeneList),mean)
  r3<-length(ri)
  AG<-matrix(nrow=r3,ncol=r2)


*Calculate the intensity values (A values) after averaging duplicate genes. In the for loop, make sure that the range reflects the number of GPR files.
*Calculate the intensity values after averaging duplicate genes.
  for(i in 1:9) {AG[,i]<-tapply(la[,i],as.factor(GeneList),mean)}
  for(i in 1:r2) {AG[,i]<-tapply(la[,i],as.factor(GCAT.GeneList),mean)}


*Plot the log fold changes (M) against the intensities (A). In the for loop, make sure that the range reflects the number of GPR files.
*Set how many rows and columns the graphs will be partitioned in to. In the call par(mfrow=c(#,#)), the first # sets the number of rows and the second # sets the number of columns.
  for(i in 1:9) {plot(AG[,i],MG[,i],xlab="A",ylab="M",ylim=c(-5,5),xlim=c(0,15))}
  par(mfrow=c(3,3))


Maximize the window in which the graphs have appeared. Save the graphs as a JPEG (File>Save As>JPEG>100% quality...). Once the graphs have been saved, close the window.
*Plot the log fold changes (labeled "M" on the y axis) against the intensity values (labeled "A" on the x axis) before each chip  has been normalized. The y axis limits (ylim) and x axis limits (xlim) may need to be altered depending on the magnitude of outliers.
for(i in 1:r2) {plot(AG[,i],MC[,i],main=chips[i],xlab='A',ylab='M',ylim=c(-5,5),xlim=c(0,15))}
browser()


Next, you will create MA plots for the data after within array normalization has been performed.  
Maximize the window in which the graphs have appeared. Save the graphs as a JPEG (File>Save As>JPEG>100% quality...). Once the graphs have been saved, close the window. To continue with the rest of the code, type the letter "c" and press Enter


*Set the dimensions of the window in which the graphs will appear to reflect the number of graphs that need to be fit into the window.
Next, create MA plots of the data after within array normalization.
par(mfrow=c(3,3))


The log fold changes after normalization is saved in R's memory under the variable RR. Therefore, just the intensity values have to be calculated after within array normalization has occurred.
The log fold changes after within array normalization were calculated earlier and stored in the data frame MG2. Therefore, it is only necessary to calculate the intensity values after within array normalization has been performed.


*Create a blank matrix with as many columns as there are columns in GPR files and as many rows as there are averaged duplicate genes.
*Create a blank matrix with as many columns as there are GPR files and as many rows as there are unique genes.
  X1<-tapply(MAG$A[,1],as.factor(MAG$genes[,4]),mean)
  x0<-tapply(MAG$A[,1],as.factor(MAG$genes$ID),mean)
  y0<-length(MAG$A[1,])
  y0<-length(MAG$A[1,])
  y1<-length(X1)
  x1<-length(x0)
  AAG<-matrix(nrow=y1,ncol=y0)
  AAG<-matrix(nrow=x1,ncol=y0)


*Calculate the intensity values (A) after normalization has occurred and after duplicate genes have been averaged. In the for loop, make sure that the range reflects the number of GPR files.
*Calculate the intensity values after normalization has occurred and after duplicate genes have been averaged.
  for(i in 1:9) {AAG[,i]<-tapply(MAG$A[,i],as.factor(MAG$genes[,4]),mean)}
  for(i in 1:y0) {AAG[,i]<-tapply(MAG$A[,i],as.factor(MAG$genes$ID),mean)}


*Plot the log fold changes (M) against the intensities (A). In the for loop, make sure that the range reflects the number of GPR files.
*Set how many rows and columns the graphs will be partitioned in to.
  for(i in 1:9) {plot(AAG[,i],RR[,i],ylab="M",xlab="A",ylim=c(-5,5),xlim=c(0,15))}
  par(mfrow=c(3,3))


Maximize the window in which the graphs have appeared. Save the graphs as a JPEG (File>Save As>JPEG>100% quality...). Once the graphs have been saved, close the window.
*Plot the log fold changes (labeled "M" on the y axis) versus the intensity values (labeled "A" on the x axis) for each chip after within array normalization has been performed performed. The y axis limits (ylim) and x axis limits (xlim) may need to be altered depending on the magnitude of outliers.
for(i in 1:y0) {plot(AAG[,i],MG2[,i],main=chips[i],xlab='A',ylab='M',ylim=c(-5,5),xlim=c(0,15))}
browser()


Use the following code to generate boxplots of the log fold changes for the GCAT chips before normalization has occurred, after within array normalization has been performed, and after scale normalization (dividing each chip by its MAD) has occurred.
Maximize the window in which the graphs have appeared. Save the graphs as a JPEG (File>Save As>JPEG>100% quality...). Once the graphs have been saved, close the window. To continue with the rest of the code, type the letter "c" and press Enter.


*Change the dimensions of the window in which the graphs will appear to reflect how many graphs need to be fit into the window. Since you will be generating three graphs, one for each stage in the normalization process, you can set the dimensions to one row with three columns.
Use the following code to generate boxplots of the log fold changes for the GCAT chips before normalization has occurred, after within array normalization has been performed, and after between array normalization has occurred.
 
*Set how many rows and columns the graphs will be partitioned in to.  
  par(mfrow=c(1,3))
  par(mfrow=c(1,3))


*Create a boxplot of the log fold changes before normalization has occurred. The number within the brackets next to the variable designating the matrix of nonnormalized log fold changes denotes a GPR file. Also, set the range of the y-axis (ylim) so that the range of the boxplot for each GPR file is visible.  
*Create a box plot of the log fold changes before normalization has occurred.  
  boxplot(MG[,1],MG[,2],MG[,3],MG[,4],MG[,5],MG[,6],MG[,7],MG[,8],MG[,9],ylim=c(-5,5))
  boxplot(MCD,main="Before Normalization",ylab='Log Fold Change',ylim=c(-5,5),xaxt='n')


*Create a boxplot of the log fold changes after within array normalization has occurred. The number within the brackets next to the variable designating the matrix of within array normalized log fold changes denotes a GPR file. Also, make sure that the range of the y axis (ylim) is the same as in the previous set of boxplots of the nonnormalized data.
*Specify where the tick marks will appear on the x axis of the plot.
  boxplot(RR[,1],RR[,2],RR[,3],RR[,4],RR[,5],RR[,6],RR[,7],RR[,8],RR[,9],ylim=c(-5,5))
  axis(1,at=xy.coords(chips)$x,tick=TRUE,labels=FALSE)


*Create a boxplot of the log fold changes after scale normalization has occurred. The number within the brackets next to the variable designating the matrix of scale normalized log fold changes denotes a GCAT GPR file within the matrix of all of the scale normalized data for all of the chips (both Ontario and GCAT). Therefore, it is important to make sure that you have the right order of GCAT GPR files. Also, make sure that the range of the y axis (ylim) is the same as in the previous set of boxplots.
*Place the tick mark labels along the x axis of the plot.
boxplot(XY[,1],XY[,5],XY[,6],XY[,10],XY[,11],XY[,14],XY[,15],XY[,19],XY[,20],ylim=c(-5,5))
text(xy.coords(chips)$x-1,par('usr')[3]-0.6,labels=chips,srt=45,cex=0.9,xpd=TRUE)


If you used the alternative way to filter the GCAT chips, then use the following code:
*Create a box plot of the log fold changes after within array normalization has been performed.
  boxplot(merged.MAD[,1],merged.MAD[,5],merged.MAD[,6],merged.MAD[,10],merged.MAD[,11],merged.MAD[,14],merged.MAD[,15],
  boxplot(MG2,main='After Within Array Normalization',ylab='Log Fold Change',ylim=c(-5,5),xaxt='n')
  merged.MAD[,19],XY[,20],ylim=c(-5,5))
 
*Place the tick mark labels along the x axis of the plot.
Maximize the window in which the plots have appeared. Save the plots as a JPEG (File>Save As>JPEG>100% quality...). Once the graphs have been saved, close the window.
axis(1,at=xy.coords(chips)$x,labels=FALSE)
 
*Place the tick mark labels along the x axis of the plot.
text(xy.coords(chips)$x-1,par('usr')[3]-0.6,labels=chips,srt=45,cex=0.9,xpd=TRUE)
 
*Create a box plot of the log fold changes after between array normalization has been performed.  
  boxplot(MAD[,Gtop$MasterList],main='After Between Array Normalization',ylab='Log Fold Change',ylim=c(-5,5),xaxt='n')


Use the following lines of code to create MA plots and boxplots for the Ontario chips.
*Place the tick mark labels along the x axis of the plot.
axis(1, at=xy.coords(chips)$x,labels=FALSE)


First, you will create MA plots for the wildtype data before the normalization has occurred.  
*Place the tick mark labels along the x axis of the plot.
text(xy.coords(chips)$x-1,par('usr')[3]-0.6,labels=chips,srt=45,cex=0.9,xpd=TRUE)


*Set the dimensions of the window in which the graphs will appear to reflect the number of graphs that need to be fit into the window. There will be one graph for each GPR file. Since there were originally 14 GPR files for the wildtype the code below creates a a window to fit four rows and four columns of graphs.
Maximize the window in which the plots have appeared. Save the plots as a JPEG (File>Save As>JPEG>100% quality...). Once the graphs have been saved, close the window.
par(mfrow=c(4,4))


*Set a variable (genelist) for all of the Ontario gene IDs before the controls have been taken out and before replicates have been averaged.
===Create MA Plots and Box Plots for the Ontario Chips===
genelist<-RG$genes$Name


*Calculate the log fold changes (M values) for each spot on each chip before normalization has occurred. The log fold changes should also by multiplied by the list of dyeswaps taken from the targets file previously imported into R. In the for loop, alter the range to reflect the number of GPR files in RG for all strains.
*Create a variable to store the names of the genes on the Ontario chip before the controls have been removed and before the replicates have been averaged.
  for(i in 1:94) {lfm<-ds[i]*(log2((RG$R-RG$Rb)/(RG$G-RG$Gb)))}
  Ontario.GeneList<-RGO$genes$Name


*Create a blank matrix with as many columns as there are GPR files for the wildtype and as many rows as there are genes after replicates have been averaged.
*Calculate the log fold change of each gene before normalization has been performed.
  z0<-length(lfm[1,])
  lr<-log2((RGO$R-RGO$Rb)/(RGO$G-RGO$Gb))
ZX<-tapply(lfm[,1],as.factor(genelist),mean)
z1<-length(ZX)
MZ<-matrix(nrow=z1,ncol=z0)


*Calculate the log fold changes (M values) for each spot on each chip for the wildtype after averaging duplicate genes. In the for loop, alter the range to reflect the number of GPR files for the wildtype.
* Warning message: "NaNs produced" is OK.  
for(i in 1:14) {MZ[,i]<-tapply(lf[,i],as.factor(genelist),mean)}


*Calculate the intensity values (A values) for each spot for each chip for the wildtype before normalization has occurred.
*Create a blank matrix with as many columns as there are GPR files for the Ontario chips and as many rows as there are unique genes on the Ontario chip.
  lfa<-(1/2*log2((RG$R-RG$Rb)*(RG$G-RG$Gb)))
  z0<-length(lr[1,])
v0<-tapply(lr[,1],as.factor(Ontario.GeneList),mean)
z1<-length(v0)
MT<-matrix(nrow=z1,ncol=z0)


*Create a blank matrix with as many columns as there are GPR files for the wildtype and as many rows as there are genes after replicates have been averaged.
*Calculate the log fold changes after averaging duplicate genes.
  z3<-length(lfa[1,])
  for(i in 1:z0) {MT[,i]<-tapply(lr[,i],as.factor(Ontario.GeneList),mean)}
ZQ<-tapply(lfa[,1],as.factor(genelist),mean)
z4<-length(ZQ)
AZ<-matrix(nrow=z4,ncol=z3)


*Calculate the intensity values (M values) for each spot on each chip for the wildtype after averaging duplicate genes. In the for loop, alter the range to reflect the number of GPR files for the wildtype.
*Create a new matrix MI to store the data. It should have as many rows as there are unique genes on the Ontario chip and as many columns as there are GPR files for the Ontario chips.
  for(i in 1:14) {AZ[,i]<-tapply(lfa[,i],as.factor(genelist),mean)}
  MI<-matrix(nrow=z1,ncol=z0)


*Plot the log fold changes (M) against the intensities (A). In the for loop, make sure that the range reflects the number of GPR files for the wildtype.
*Set up a for loop to multiply each row of the matrix MT by the vector of values ds indicating the dye-swapped chips. Store the output to the matrix MI.
  for(i in 1:14) {plot(AZ[,i],MZ[,i],xlab="A",ylab="M",ylim=c(-5,5),xlim=c(0,15))}
  for(i in 1:z0) {MI[,i]<-ds[i]*MT[,i]}


Maximize the window in which the graphs have appeared. Save the graphs as a JPEG (File>Save As>JPEG>100% quality...). Once the graphs have been saved, close the window.
*Convert the matrix MI to a data frame and set the column names of the data frame to the names of the Ontario chips in the form <strain>_LogFC_t<time point>-<flask number>.
MID<-as.data.frame(MI)
colnames(MID)<-headers
rownames(MID)<-ontID


Next, you will create MA plots for the wildtype data after within array normalization has been performed.  
*Calculate the intensity values before normalization has occurred.
ln<-(1/2*log2((RGO$R-RGO$Rb)*(RGO$G-RGO$Gb)))


*Set the dimensions of the window in which the graphs will appear to reflect the number of graphs that need to be fit into the window. There will be one graph for each GPR file.
* Warning messages are OK:
par(mfrow=c(4,4))
:1: In (RGO$R - RGO$Rb) * (RGO$G - RGO$Gb) :
: NAs produced by integer overflow
:2: NaNs produced


The within array normalized log fold changes are already in R's memory under the variable MN. Therefore, just the intensity values have to be calculated after within array normalization has occurred.  
*Create a blank matrix with as many columns as there are GPR files for the Ontario chips and as many rows as there are unique genes on the Ontario chip.
z2<-length(ln[1,])
zi<-tapply(ln[,1],as.factor(Ontario.GeneList),mean)
z3<-length(zi)
AO<-matrix(nrow=z3,ncol=z2)


*Create a blank matrix with as many columns as there are GPR files for the wildtype and as many rows as there are genes after replicates have been averaged.
*Calculate the intensity values after averaging duplicate genes.
  v1<-tapply(MA$A[,1],as.factor(MA$genes[,5]),mean)
  for(i in 1:z0) {AO[,i]<-tapply(ln[,i],as.factor(Ontario.GeneList),mean)}
w0<-length(MA$A[1,])
w1<-length(v1)
AAO<-matrix(nrow=w1,ncol=w0)


*Calculate the intensity values (A) after normalization has occurred and after duplicate genes have been averaged. In the for loop, make sure that the range reflects the number of GPR files for the wildtype.
*Create a list of all of the strains for which there are Ontario chips.
  for(i in 1:14) {AAO[,i]<-tapply(MA$A[,i],as.factor(MA$genes[,5]),mean)}
  strains<-c('wt','dCIN5','dGLN3','dHMO1','dZAP1')


*Plot the log fold changes (M) against the intensities (A). In the for loop, make sure that the range reflects the number of GPR files for the wildtype.
*Plot the log fold changes (labeled "M" on the y axis) against the intensities (labeled "A" on the x axis) before each chip has been normalized. "st" designates for which strain to create MA plots. "lt" determines which columns of the data correspond to the selected strain.
  for(i in 1:14) {plot(AAO[,i],MN[,i],ylab="M",xlab="A",ylim=c(-5,5),xlim=c(0,15))}
*Set how many rows and columns the graphs will be partitioned in to with the call par(mfrow=c(#,#)), the first # sets the number of rows and the second # sets the number of columns.
*The y axis limits (ylim) and x axis limits (xlim) may need to be altered depending on the magnitude of outliers.
*After entering the call browser(), maximize the window in which the graphs have appeared. Save the graphs as a JPEG (File>Save As>JPEG>100% quality...). Once the graphs have been saved, close the window and press Enter for the next set of graphs to appear.
  for (i in 1:length(strains)) {
  st<-strains[i]
  lt<-which(Otargets$Strain %in% st)
  if (st=='wt') {
      par(mfrow=c(3,5))
  } else {
      par(mfrow=c(4,5))
  }
  for (i in lt) {
    plot(AO[,i],MI[,i],main=headers[i],xlab="A",ylab="M",ylim=c(-5,5),xlim=c(0,15))
  }
  browser()
}
*To continue generating plots, type the letter c and press enter.


Maximize the window in which the graphs have appeared. Save the graphs as a JPEG (File>Save As>JPEG>100% quality...). Once the graphs have been saved, close the window.
Next, create MA plots of the data after within array normalization.


Use the following code to generate boxplots of the log fold changes for the wildtype chips before normalization has occurred, after within array normalization has been performed, and after scale normalization (dividing each chip by its MAD) has occurred.
The log fold changes after within array normalization were calculated earlier and stored in the data frame MD2. Therefore, it is only necessary to calculate the intensity values after within array normalization has been performed.


*Create a boxplot of the log fold changes before normalization has occurred. The number within the brackets next to the variable designating the matrix of nonnormalized log fold changes denotes a GPR file. Also, set the range of the y-axis (ylim) so that the range of the boxplot for each GPR file is visible.  
*Create a blank matrix with as many columns as there are GPR files for the Ontario chips and as many rows as there are unique genes on the Ontario chip.
  boxplot(MZ[,1],MZ[,2],MZ[,3],MZ[,4],MZ[,5],MZ[,6],MZ[,7],MZ[,8],MZ[,9],MZ[,10],MZ[,11],MZ[,12],MZ[,13],MZ[,14],ylim=c(-5,5))
  j0<-tapply(MAO$A[,1],as.factor(MAO$genes[,5]),mean)
k0<-length(MAO$A[1,])
j1<-length(j0)
AAO<-matrix(nrow=j1,ncol=k0)


*Create a boxplot of the log fold changes after within array normalization has occurred. The number within the brackets next to the variable designating the matrix of within array normalized log fold changes denotes a GPR file. Also, make sure that the range of the y axis (ylim) is the same as in the previous set of boxplots of the nonnormalized data.
*Calculate the intensity values after normalization has occurred and after duplicate genes have been averaged.
  boxplot(MN[,1],MN[,2],MN[,3],MN[,4],MN[,5],MN[,6],MN[,7],MN[,8],MN[,9],MN[,10],MN[,11],MN[,12],MN[,13],MN[,14],ylim=c(-5,5))
  for(i in 1:k0) {AAO[,i]<-tapply(MAO$A[,i],as.factor(MAO$genes[,5]),mean)}


*Create a boxplot of the log fold changes after scale normalization has occurred. The number within the brackets next to the variable designating the matrix of scale normalized log fold changes denotes a Ontario GPR file within the matrix of all of the scale normalized data for all of the chips (both Ontario and GCAT). Therefore, it is important to make sure that you have the right order of Ontario GPR files. Also, make sure that the range of the y axis (ylim) is the same as in the previous set of boxplots.
*Plot the log fold changes (labeled "M" on the y axis) against the intensities (labeled "A" on the x axis) for each chip after within array normalization has been performed performed.
  boxplot(XY[,2],XY[,3],XY[,4],XY[,7],XY[,8],XY[,9],XY[,12],XY[,13],XY[,16],XY[,17],XY[,18],XY[,21],XY[,22],XY[,23],ylim=c(-5,5))
*Remember, that after entering the call readline('Press Enter to continue'), maximize the window in which the graphs have appeared. Save the graphs as a JPEG (File>Save As>JPEG>100% quality...). Once the graphs have been saved, close the window and press Enter for the next set of graphs to appear.
  for (i in 1:length(strains)) {
  st<-strains[i]
  lt<-which(Otargets$Strain %in% st)
  if (st=='wt') {
      par(mfrow=c(3,5))
  } else {
      par(mfrow=c(4,5))
  }
  for (i in lt) {
    plot(AAO[,i],MD2[,i],main=headers[i],xlab="A",ylab="M",ylim=c(-5,5),xlim=c(0,15))
  }
  browser()
}
*To continue generating plots, type the letter c and press enter.


If you used the alternative way to filter the GCAT chips, then use the following code:
Use the following code to generate box plots of the log fold changes before normalization has occurred, after within array normalization has been performed, and after between array normalization has been performed.
boxplot(merged.MAD[,2],merged.MAD[,3],merged.MAD[,4],merged.MAD[,7],merged.MAD[,8],merged.MAD[,9],merged.MAD[,12],
merged.MAD[,13],merged.MAD[,16],merged.MAD[,17],merged.MAD[,18],merged.MAD[,21],merged.MAD[,22],merged.MAD[,23],ylim=c(-5,5))


After MA plots and boxplots for the wildtype have been generated, you should make the same types of plots for the deletion strains. Work with one strain first creating the MA Plots and the three different boxplots for that strain before moving on to another strain. The same code as depicted above for the Ontario chips can be used for the deletion strains with some modifications. When designating the dimensions of the window in which the plots will appear, make sure that there are enough rows and columns to fit a graph for each GPR file for the strain. You do not have to reinput the code assigning the Ontario gene ID's to a variable nor the code that calculates the log fold changes before normalization nor the code that calculates intensities before normalization. For the MA plots, the range of the for loop must match the number of GPR files for the strain you are working on. For the boxplots, the number in the bracket next to the variable must correspond to the correct GPR for the strain you are working on. When generating the boxplot for the nonnormalized data, refer to the target file for the correct order of the GPR files for the strain you are working on. When generating the boxplot for the within array normalized data, refer to the data frame with the within array normalized data (MN) for the correct order of the GPR files for the strain you are working on. When generating the boxplot for the scale normalized data, refer to the final R output with the scale normalized data for all the chips for the correct order of the GPR files for the strain you are working on. When generating MA plots and boxplots for different strain, keep the x and y limits of the MA plot and the y limits of the boxplot the same for all the strains.
*"xcoord" sets the coordinates at which the tick mark labels will be placed along the x axis of the plot. The value which is being subtracted may need to be changed depending upon the font size and the number of tick mark labels.
*"fsize" sets the font size of the tick mark labels. This value may need to be changed depending upon the number of characters in the tick mark label.
*"ft" determines what columns of the data frame MAD (which contains the data from the GCAT and Ontario chips after performing between array normalization) correspond to the strain determined by "st".  
for (i in 1:length(strains)) {
  par(mfrow=c(1,3))
  st<-strains[i]
  lt<-which(Otargets$Strain %in% st)
  if (st=='wt') {
      xcoord<-xy.coords(lt)$x-1
      fsize<-0.9
  } else {
      xcoord<-xy.coords(lt)$x-1.7
      fsize<-0.8
  }
  boxplot(MID[,lt],main='Before Normalization',ylab='Log Fold Change',ylim=c(-5,5),xaxt='n')
  axis(1,at=xy.coords(lt)$x,labels=FALSE)
  text(xcoord,par('usr')[3]-0.65,labels=headers[lt],srt=45,cex=fsize,xpd=TRUE)
  boxplot(MD2[,lt],main='After Within Array Normalization',ylab='Log Fold Change',ylim=c(-5,5),xaxt='n')
  axis(1,at=xy.coords(lt)$x,labels=FALSE)
  text(xcoord,par('usr')[3]-0.65,labels=headers[lt],srt=45,cex=fsize,xpd=TRUE)
  ft<-Otargets$MasterList[which(Otargets$Strain %in% st)]
  boxplot(MAD[,ft],main='After Between Array Normalization',ylab='Log Fold Change',ylim=c(-5,5),xaxt='n')
  axis(1,at=xy.coords(lt)$x,labels=FALSE)
  text(xcoord,par('usr')[3]-0.65,labels=headers[lt],srt=45,cex=fsize,xpd=TRUE)
  browser()
}
*To continue generating plots, type the letter c and press enter.
* Warnings are OK.

Latest revision as of 13:22, 18 May 2015

Home        Research        Protocols        Notebook        People        Publications        Courses        Contact       


The following protocol was developed to normalize GCAT and Ontario DNA microarray chip data from the Dahlquist lab using the R Statistical Software and the limma package (part of the Bioconductor Project).

  • The normalization procedure has been verified to work with version 3.1.0 of R released in April 2014 (link to download site) and version 3.20.1 of the limma package ( direct link to download zipped file) on the Windows 7 platform.
    • Note that using other versions of R or the limma package might give different results.
    • Note also that using the 32-bit versus the 64-bit versions of R 3.1.0 will give different results for the normalization out in the 10-13 or 10-14 decimal place. The Dahlquist Lab is standardizing on using the 64-bit version of R.
  • To install R for the first time, download and run the installer from the link above, accepting the default installation.
  • To use the limma package, unzip the file and place the contents into a folder called "limma" in the library directory of the R program. If you accept the default location, that will be C:\Program Files\R\R-3.1.0\library.


Input Files for R

The normalization protocol was written to normalize two types of chips:

  • "GCAT" chips: containing 70-mer oligonucleotides printed on epoxy chips obtained from the Genome Consortium for Active Teaching and produced by Washington University, St. Louis (Fall 2005 chips)
  • "Ontario" chips: the Yeast 6.4K Array containing full-length PCR products obtained from the UHN Microarray Centre in Toronto, Ontario, Canada.

For each chip that you want to normalize in R, you must import the .gpr file created for that chip after scanning it with GenePix Pro Software. Make sure that all of the .gpr files you will need for normalization are in one folder, along with the targets file(s) described in the section below.

Create a targets file with information about the .gpr files

One targets file must be created for the Ontario chips and a separate targets file must be created for the GCAT chips. To create the targets file for the Ontario chips, create the following spreadsheet in Excel, where each row corresponds to a .gpr file:

  • The first column (Column A, labeled "FileName") contains the file names of all the .gpr files. Make sure that you include the ".gpr" extension at the end of each file name.
  • The second column (Column B, labeled "Header") describes the strain, the time point, and flask from which the data represented in the .gpr file was taken during the cold shock experiment. Create this designation for each .gpr file in the form <strain>_LogFC_t
  • The third column (Column C, labeled "Strain") contains the name of the strain to which the .gpr file corresponds (match this to the designation you used in Column B).
  • The fourth column (Column D, labeled "TimePoint") contains the experimental time point to which the .gpr file corresponds.
  • The fifth column (Column E, labeled "Flask") contains the flask number to which the .gpr file corresponds, assuming that replicates were performed for each time point for each strain.
  • The sixth column (Column F, labeled "Dyeswap") indicates the orientation of the dyes on the chip. If the control time point (t0) was labeled with Cy3 dye and the other time points where labeled with Cy5 dye on a particular chip, then put a "1" in this column in the row corresponding to the chip. If the orientation of the dyes was reversed for a chip, then put a "-1" in this column in the row corresponding to that chip.
  • The seventh column (Column G, labeled "MasterList") orders the chips by strain (wild type strain first and the deletion strains afterwards), then time point, and then flask. In this column, put a "1" in the row corresponding to the chip from the wild type, the first time point, and the first flask of that time point. Then, put a "2" in the row corresponding to the chip from the wild type strain, the first time point, and the second flask of that time point. Continue to number the chips in this order until all the chips have been numbered.

Make sure to save the targets file as a .csv file in the folder that contains the .gpr files for the Ontario chips.

The targets file for the GCAT chips should contain the aforementioned columns in the order in which they were mentioned. However, there should be a column that precedes this aforementioned set of columns (Column A,labeled "Location"), which designates which half of the GCAT chip was scanned to create the .gpr file. The rest of the columns will be shifted to the right. If the top half was scanned for a particular chip, then put "Top" in this column in the row corresponding to that chip. If the bottom half was scanned for a particular chip, then put "Bottom" in this column in the row corresponding to that chip. Make sure that the sequence of GCAT chips in the MasterList of .gpr files corresponding to the top of the chips is the same as the sequence of GCAT chips in the list of the .gpr files corresponding to the bottom of the chips. This is necessary because the .gal file for the GCAT chips only corresponded to one-half of the chip; the entire genome was printed twice in two large blocks on the top and bottom half of the chip.

Make sure to save the targets file as a .csv file in the folder that contains the .gpr files for the GCAT chips. The file used for this analysis is GCAT_Targets.csv.

Data Normalization

Microarray data for S. cerevisiae wild type and mutant strains was initially collected using GCAT chips before switching to Ontario chips. The within array normalization procedure for data from the GCAT chips has some differences in comparison to the procedure for data from the Ontario chips. In contrast, microarray data for wild type S. paradoxus has been collected using Ontario chips only.

To normalize data from both the Ontario and GCAT chips, begin with the section "Within Array Normalization for the Ontario Chips". Begin with the same section to normalize data collected with Ontario chips only.

Within Array Normalization for the Ontario Chips

  • Launch R 3.1.0.
  • Change the directory (File > Change dir...) to the folder containing the targets file and the GPR files for the Ontario chips.
  • Enter the code below (in the boxes) into the R command prompt and hit enter to run the within array normalization procedure that uses the Loess method. Hit enter after every line of code.
  • Line-by-line instructions follow:
  • Load the limma library.
library(limma)
Otargets<-read.csv(file.choose())
  • Read in the .gpr files by designating the column within the targets file (imported as Otargets) that lists all the .gpr file names.
f<-function(x) as.numeric(x$Flags > -99)
RGO<-read.maimages(Otargets[,1], source="genepix.median", wt.fun=f)
  • Extract the column of values from the targets file that indicates for which chips the orientation of the dyes was swapped.
ds<-Otargets$Dyeswap
  • Perform Loess normalization. This step may take some time depending upon the number of chips that need to be normalized.
MAO<-normalizeWithinArrays(RGO, method="loess", bc.method="normexp")
  • Create a list of the names of the genes on the Ontario chips.
M1<-tapply(MAO$M[,1],as.factor(MAO$genes$Name),mean)
ontID<-rownames(M1)
  • Designate which column of the targets file contains the name of each chip in the form <strain>_LogFC_t
headers<-Otargets$Header
  • Create a matrix MO, which will store the normalized data after replicate spots on the chips have been averaged. It should have as many rows as there are unique genes on the Ontario chip and as many columns as there are GPR files for the Ontario chips.
n0<-length(MAO$M[1,])
n1<-length(M1)
MO<-matrix(nrow=n1,ncol=n0)
  • Average all duplicate spots on the chip.
for(i in 1:n0) {MO[,i]<-tapply(MAO$M[,i],as.factor(MAO$genes$Name),mean)}
  • Create a new matrix MD to store the data. It should have as many rows as there are unique genes on the Ontario chip and as many columns as there are GPR files for the Ontario chips.
MD<-matrix(nrow=n1,ncol=n0)
  • Multiply each row of the data matrix MO by the vector ds, which contains the values from the targets file that indicate for which chips the dye orientation was swapped. Store the output of the calculation in the matrix MD.
for(i in 1:n0) {MD[,i]<-ds[i]*MO[,i]}
  • Convert matrix MD to a data frame.
MD2<-as.data.frame.matrix(MD)
  • Set the row names of the data frame to the names of the genes on the Ontario chips.
colnames(MD2)<-headers
  • Set the column names of the data frame to the names of each chip that include the strain, time point and flask number.
rownames(MD2)<-ontID
  • Remove the control spots on the chip (i.e. Arabidopsis and 3XSSC).
MD3<-subset(MD2,row.names(MD2)!="Arabidopsis")
MD4<-subset(MD3,row.names(MD3)!="3XSSC")

If you are also normalizing data from GCAT chips, proceed to the next section "Within Array Normalization for the GCAT Chips". If you are normalizing only data from Ontario chips, then proceed to the section "Between Array Normalization of Only Ontario Chip Data".

Within Array Normalization for the GCAT Chips

  • Line-by-line instructions follow:
  • Read the targets file for the GCAT chips. The file used for this analysis is GCAT_Targets.csv.
Gtargets<-read.csv(file.choose())
  • Create a subset of the targets file that contains only the GPR files created after scanning the top of the GCAT chips. Read those GPR files into R.
Gtop<-subset(Gtargets,Gtargets$Location %in% 'Top')
f<-function(x) as.numeric(x$Flags > -99)
RT<-read.maimages(Gtop, source="genepix.median", wt.fun=f)
  • Create a subset of the targets file that contains only the GPR files created after scanning the bottom of the GCAT chips. Read those GPR files into R.
Gbottom<-subset(Gtargets,Gtargets$Location %in% 'Bottom')
RB<-read.maimages(Gbottom, source="genepix.median", wt.fun=f)
  • Bind the data from the top and bottom of the GCAT chip for each chip.
RGG<-rbind(RT,RB)
  • Extract the column of values from the subset of targets file corresponding to the top of the GCAT chips that indicates which chips were dye-swapped.
dw<-Gtop$Dyeswap
  • Perform Loess normalization.
MAG<-normalizeWithinArrays(RGG,method="loess", bc.method="normexp")
  • Create a list of the names of the genes on the GCAT chip.
M0<-tapply(MAG$M[,1],as.factor(MAG$genes$ID),mean)
gcatID<-rownames(M0)
  • Designate which column of the targets file contains the name of each chip in the form <strain>_LogFC_t
chips<-Gtop$Header
  • Create a matrix MG, which will store the normalized data after replicate spots on the chips have been averaged. It should be as many rows as there are unique genes on the GCAT chip and as many columns as there are GPR files for the GCAT chips.
m0<-length(MAG$M[1,])
m1<-length(M0)
MR<-matrix(nrow=m1,ncol=m0)
  • Average the duplicates spots on the GCAT chips.
for(i in 1:length(chips)) {MR[,i]<-tapply(MAG$M[,i],as.factor(MAG$genes$ID),mean)}
  • Create a new matrix MG to store the data. It should be as many rows as there are unique genes on the GCAT chip and as many columns as there are GPR files for the GCAT chips.
MG<-matrix(nrow=m1,ncol=m0)
  • Set up a for loop to multiply each row of the data matrix MR by the vector of values dw indicating the dye-swapped chips. Store the output to the matrix MG.
for(i in 1:m0) {MG[,i]<-dw[i]*MR[,i]}
  • Convert matrix MG to a data frame.
MG2<-as.data.frame.matrix(MG)
  • Set the column names of the data frame to the names of each chip that include the strain, time point and flask number.
colnames(MG2)<-chips
  • Set the row names of the data frame to the names of the genes on the GCAT chips.
rownames(MG2)<-gcatID
  • NOTE: In normalizing dSWI4 GCAT chip data collected by two classes of Biology 478 on 4/29/2014, I noticed that the gene names in the GPR files included an "_01" at the end of the proper name, which is unlike the GCAT chip data for other deletion strains that had been previously analyzed. To following line of code was used to remove the "_01":
gcatIDtruncated<-strsplit(gcatID,"_01",fixed=TRUE)
  • The row names of the data frame MG2 were then set using the following line of code:
 rownames(MG2)<-gcatIDtruncated
  • The normalization was finished following the remaining part of the normalization protocol as listed below.
  • Merge the GCAT and Ontario data into a single data frame. Any gene names that only appear on the GCAT chip will have an NA in each column of the row corresponding to that gene in the new data frame.
GO<-merge(MD4,MG2,by="row.names",all=T)
  • Set the row names of the new data frame to the names of the genes on both the Ontario and GCAT chips.
rownames(GO)<-GO$Row.names
  • Get rid of the genes on the GCAT chips that are not on the Ontario chips.
GO2<-subset(GO,rownames(GO) %in% ontID)
  • Create a list to order the merged GCAT and Ontario chip data by strain, then time point, and then flask.
MasterList<-rbind(Otargets[,c('Header','MasterList')],Gtop[,c('Header','MasterList')])
OrderedML<-MasterList[sort.list(MasterList$MasterList),]
  • Rearrange the columns so that they are ordered by strain (wild type then deletion strains in alphabetical order), then time point, and then flask.
GO3<-GO2[,match(OrderedML$Header,colnames(GO2))]
  • Write the normalized Ontario and GCAT data to a CSV file. Make sure to indicate which chips and/or which strains the data was collected from.
write.table(GO3,"GCAT_and_Ontario_Within_Array_Normalization.csv",sep=",",col.names=NA,row.names=TRUE)

At this point, the normalization within each GCAT and Ontario array is complete. To merge the normalized data from both chips and to perform between array normalization, proceed to the section "Between Array Normalization of Merged GCAT and Ontario Chip Data".

Between Array Normalization of Merged GCAT and Ontario Chip Data

  • Create a new matrix to store the merged data. It should have as many rows as there are genes common to both the GCAT and Ontario chips and as many columns as there are GPR files that have been normalized.
g0<-length(GO3[1,])
g1<-length(GO3[,1])
MADM<-matrix(nrow=g1,ncol=g0)
  • Scale each chip by its median absolute deviation.
for (i in 1:g0) {MADM[,i]<-GO3[,i]/mad(GO3[,i],na.rm=TRUE)}
  • Convert matrix MADM to a data frame.
MAD<-as.data.frame.matrix(MADM)
  • Set the row names of the data frame to names of the genes common to the Ontario and GCAT chips.
rownames(MAD)<-rownames(GO3)
  • Set the column names of the data frame to the list of the names of the chips in the format <strain>_LogFC_t
colnames(MAD)<-colnames(GO3)
  • Write the final data set to a CSV file. Make sure to indicate which chips and/or which strains the data was collected from.
write.table(MAD,"GCAT_and_Ontario_Final_Normalized_Data.csv",sep=",",col.names=NA,row.names=TRUE)
  • Minimize the R window (do not close the program). Open the .csv file with the R output in Excel. Replace every entry with an "NA" by a space. To do so, select the menu option Edit > Replace... (or select any cell and click Ctrl+H). Type "NA" into the "Find what:" box and leave the "Replace with:" box blank. Then click "Replace All". Save the file.

Between Array Normalization of Only Ontario Chip Data

NOTE: this only applies if you are normalizing Ontario Chip Data Only. If you are combining GCAT and Ontario data, do not do this section.

  • Save the Ontario data after performing within array normalization to a CSV file.
write.table(MD4,"Ontario_Within_Array_Normalization.csv",sep=",",col.names=NA,row.names=TRUE)
  • Create a list to order the Ontario chip data by strain, then time point, and then flask.
MasterList<-Otargets[,c('Header','MasterList')]
OrderedML<-MasterList[sort.list(MasterList$MasterList),]
  • Reorder the columns of the data frame MD4 so that the data is ordered by the time points.
MD5<-MD4[,match(OrderedML$Header,colnames(MD4))]
  • Create a matrix MADM, which will store the data after performing between array normalization. It should have as many rows as there are unique genes on the Ontario chip and as many columns as there are GPR files for the Ontario chips.
t0<-length(MD5[1,])
t1<-length(MD5[,1])
MADM<-matrix(nrow=t1,ncol=t0)
  • Scale each chip by its median absolute deviation.
for (i in 1:t0) {MADM[,i]<-MD5[,i]/mad(MD5[,i],na.rm=TRUE)}
  • Convert matrix MADM to a data frame.
MAD<-as.data.frame.matrix(MADM)
  • Set the row names of the data frame to names of the unique genes on the Ontario chip.
rownames(MAD)<-rownames(MD5)
  • Set the column names of the data frame to the list of the names of the chips in the format <strain>_LogFC_t
colnames(MAD)<-colnames(MD5)
  • Write the final data set to a CSV file. Make sure to indicate which chips and/or which strains the data was collected from.
write.table(MAD,"Ontario_Final_Normalized_Data.csv",sep=",",col.names=NA,row.names=TRUE)

Minimize the R window (do not close the program). Open the .csv file with the R output in Excel. Replace every entry with an "NA" by a space. To do so, select the menu option Edit > Replace... (or select any cell and click Ctrl+H). Type "NA" into the "Find what:" box and leave the "Replace with:" box blank. Then click "Replace All". Save the file.

Visualize the Normalized Data

There are two types of plots that can be used to visualize microarray data before and after normalization: MA plots and box plots. MA plots (M for log fold change and A for intensity) display the ratio of the quantity of the red dye to that of the green dye (log2(red/green)), which will be referred to as the log fold change, for each spot versus the intensity ((1/2)*log2(red*green)) of each spot.

Create MA Plots and Box Plots for the GCAT Chips

First, you will create MA plots for the data before the normalization has occurred. Input the following code in the same window in R that was used to normalize the data.

  • Create a variable to store the names of the genes on the GCAT chip before the controls have been removed and before the replicates have been averaged.
GCAT.GeneList<-RGG$genes$ID
  • Calculate the log fold change of each gene before normalization has been performed.
lg<-log2((RGG$R-RGG$Rb)/(RGG$G-RGG$Gb))
  • If you get a message saying "NaNs produced" this is OK, proceed to the next step.
  • Create a blank matrix with as many columns as there are GPR files for the GCAT chips and as many rows as there are unique genes on the GCAT chip.
r0<-length(lg[1,])
rx<-tapply(lg[,1],as.factor(GCAT.GeneList),mean)
r1<-length(rx)
MM<-matrix(nrow=r1,ncol=r0)
  • Calculate the log fold changes after averaging duplicate genes.
for(i in 1:r0) {MM[,i]<-tapply(lg[,i],as.factor(GCAT.GeneList),mean)}
  • Create a new matrix MC to store the data. It should have as many rows as there are unique genes on the GCAT chip and as many columns as there are GPR files for the GCAT chips.
MC<-matrix(nrow=r1,ncol=r0)
  • Set up a for loop to multiply each row of the matrix MM by the vector of values dw indicating the dye-swapped chips. Store the output to the matrix MC.
for(i in 1:r0) {MC[,i]<-dw[i]*MM[,i]}
  • Convert the matrix MC to a data frame and set the column names of the data frame to the names of the GCAT chips in the form <strain>_LogFC_t
MCD<-as.data.frame(MC)
colnames(MCD)<-chips
rownames(MCD)<-gcatID
  • Calculate the intensity values before normalization has occurred.
la<-(1/2*log2((RGG$R-RGG$Rb)*(RGG$G-RGG$Gb)))
  • If you get these Warning messages, it's OK:
1: In (RGG$R - RGG$Rb) * (RGG$G - RGG$Gb) :
NAs produced by integer overflow
2: NaNs produced
  • Create a blank matrix with as many columns as there are GPR files for the GCAT chips and as many rows as there are unique genes on the GCAT chip.
r2<-length(la[1,])
ri<-tapply(la[,1],as.factor(GCAT.GeneList),mean)
r3<-length(ri)
AG<-matrix(nrow=r3,ncol=r2)
  • Calculate the intensity values after averaging duplicate genes.
for(i in 1:r2) {AG[,i]<-tapply(la[,i],as.factor(GCAT.GeneList),mean)}
  • Set how many rows and columns the graphs will be partitioned in to. In the call par(mfrow=c(#,#)), the first # sets the number of rows and the second # sets the number of columns.
par(mfrow=c(3,3))
  • Plot the log fold changes (labeled "M" on the y axis) against the intensity values (labeled "A" on the x axis) before each chip has been normalized. The y axis limits (ylim) and x axis limits (xlim) may need to be altered depending on the magnitude of outliers.
for(i in 1:r2) {plot(AG[,i],MC[,i],main=chips[i],xlab='A',ylab='M',ylim=c(-5,5),xlim=c(0,15))}
browser()

Maximize the window in which the graphs have appeared. Save the graphs as a JPEG (File>Save As>JPEG>100% quality...). Once the graphs have been saved, close the window. To continue with the rest of the code, type the letter "c" and press Enter

Next, create MA plots of the data after within array normalization.

The log fold changes after within array normalization were calculated earlier and stored in the data frame MG2. Therefore, it is only necessary to calculate the intensity values after within array normalization has been performed.

  • Create a blank matrix with as many columns as there are GPR files and as many rows as there are unique genes.
x0<-tapply(MAG$A[,1],as.factor(MAG$genes$ID),mean)
y0<-length(MAG$A[1,])
x1<-length(x0)
AAG<-matrix(nrow=x1,ncol=y0)
  • Calculate the intensity values after normalization has occurred and after duplicate genes have been averaged.
for(i in 1:y0) {AAG[,i]<-tapply(MAG$A[,i],as.factor(MAG$genes$ID),mean)}
  • Set how many rows and columns the graphs will be partitioned in to.
par(mfrow=c(3,3))
  • Plot the log fold changes (labeled "M" on the y axis) versus the intensity values (labeled "A" on the x axis) for each chip after within array normalization has been performed performed. The y axis limits (ylim) and x axis limits (xlim) may need to be altered depending on the magnitude of outliers.
for(i in 1:y0) {plot(AAG[,i],MG2[,i],main=chips[i],xlab='A',ylab='M',ylim=c(-5,5),xlim=c(0,15))}
browser()

Maximize the window in which the graphs have appeared. Save the graphs as a JPEG (File>Save As>JPEG>100% quality...). Once the graphs have been saved, close the window. To continue with the rest of the code, type the letter "c" and press Enter.

Use the following code to generate boxplots of the log fold changes for the GCAT chips before normalization has occurred, after within array normalization has been performed, and after between array normalization has occurred.

  • Set how many rows and columns the graphs will be partitioned in to.
par(mfrow=c(1,3))
  • Create a box plot of the log fold changes before normalization has occurred.
boxplot(MCD,main="Before Normalization",ylab='Log Fold Change',ylim=c(-5,5),xaxt='n')
  • Specify where the tick marks will appear on the x axis of the plot.
axis(1,at=xy.coords(chips)$x,tick=TRUE,labels=FALSE)
  • Place the tick mark labels along the x axis of the plot.
text(xy.coords(chips)$x-1,par('usr')[3]-0.6,labels=chips,srt=45,cex=0.9,xpd=TRUE)
  • Create a box plot of the log fold changes after within array normalization has been performed.
boxplot(MG2,main='After Within Array Normalization',ylab='Log Fold Change',ylim=c(-5,5),xaxt='n')
  • Place the tick mark labels along the x axis of the plot.
axis(1,at=xy.coords(chips)$x,labels=FALSE)
  • Place the tick mark labels along the x axis of the plot.
text(xy.coords(chips)$x-1,par('usr')[3]-0.6,labels=chips,srt=45,cex=0.9,xpd=TRUE)
  • Create a box plot of the log fold changes after between array normalization has been performed.
boxplot(MAD[,Gtop$MasterList],main='After Between Array Normalization',ylab='Log Fold Change',ylim=c(-5,5),xaxt='n')
  • Place the tick mark labels along the x axis of the plot.
axis(1, at=xy.coords(chips)$x,labels=FALSE)
  • Place the tick mark labels along the x axis of the plot.
text(xy.coords(chips)$x-1,par('usr')[3]-0.6,labels=chips,srt=45,cex=0.9,xpd=TRUE)

Maximize the window in which the plots have appeared. Save the plots as a JPEG (File>Save As>JPEG>100% quality...). Once the graphs have been saved, close the window.

Create MA Plots and Box Plots for the Ontario Chips

  • Create a variable to store the names of the genes on the Ontario chip before the controls have been removed and before the replicates have been averaged.
Ontario.GeneList<-RGO$genes$Name
  • Calculate the log fold change of each gene before normalization has been performed.
lr<-log2((RGO$R-RGO$Rb)/(RGO$G-RGO$Gb))
  • Warning message: "NaNs produced" is OK.
  • Create a blank matrix with as many columns as there are GPR files for the Ontario chips and as many rows as there are unique genes on the Ontario chip.
z0<-length(lr[1,])
v0<-tapply(lr[,1],as.factor(Ontario.GeneList),mean)
z1<-length(v0)
MT<-matrix(nrow=z1,ncol=z0)
  • Calculate the log fold changes after averaging duplicate genes.
for(i in 1:z0) {MT[,i]<-tapply(lr[,i],as.factor(Ontario.GeneList),mean)}
  • Create a new matrix MI to store the data. It should have as many rows as there are unique genes on the Ontario chip and as many columns as there are GPR files for the Ontario chips.
MI<-matrix(nrow=z1,ncol=z0)
  • Set up a for loop to multiply each row of the matrix MT by the vector of values ds indicating the dye-swapped chips. Store the output to the matrix MI.
for(i in 1:z0) {MI[,i]<-ds[i]*MT[,i]}
  • Convert the matrix MI to a data frame and set the column names of the data frame to the names of the Ontario chips in the form <strain>_LogFC_t
MID<-as.data.frame(MI)
colnames(MID)<-headers
rownames(MID)<-ontID
  • Calculate the intensity values before normalization has occurred.
ln<-(1/2*log2((RGO$R-RGO$Rb)*(RGO$G-RGO$Gb)))
  • Warning messages are OK:
1: In (RGO$R - RGO$Rb) * (RGO$G - RGO$Gb) :
NAs produced by integer overflow
2: NaNs produced
  • Create a blank matrix with as many columns as there are GPR files for the Ontario chips and as many rows as there are unique genes on the Ontario chip.
z2<-length(ln[1,])
zi<-tapply(ln[,1],as.factor(Ontario.GeneList),mean)
z3<-length(zi)
AO<-matrix(nrow=z3,ncol=z2)
  • Calculate the intensity values after averaging duplicate genes.
for(i in 1:z0) {AO[,i]<-tapply(ln[,i],as.factor(Ontario.GeneList),mean)}
  • Create a list of all of the strains for which there are Ontario chips.
strains<-c('wt','dCIN5','dGLN3','dHMO1','dZAP1')
  • Plot the log fold changes (labeled "M" on the y axis) against the intensities (labeled "A" on the x axis) before each chip has been normalized. "st" designates for which strain to create MA plots. "lt" determines which columns of the data correspond to the selected strain.
  • Set how many rows and columns the graphs will be partitioned in to with the call par(mfrow=c(#,#)), the first # sets the number of rows and the second # sets the number of columns.
  • The y axis limits (ylim) and x axis limits (xlim) may need to be altered depending on the magnitude of outliers.
  • After entering the call browser(), maximize the window in which the graphs have appeared. Save the graphs as a JPEG (File>Save As>JPEG>100% quality...). Once the graphs have been saved, close the window and press Enter for the next set of graphs to appear.
for (i in 1:length(strains)) {
  st<-strains[i]
  lt<-which(Otargets$Strain %in% st)
  if (st=='wt') {
      par(mfrow=c(3,5))
  } else {
      par(mfrow=c(4,5))
  }
  for (i in lt) {
    plot(AO[,i],MI[,i],main=headers[i],xlab="A",ylab="M",ylim=c(-5,5),xlim=c(0,15))
  }
  browser()
} 
  • To continue generating plots, type the letter c and press enter.

Next, create MA plots of the data after within array normalization.

The log fold changes after within array normalization were calculated earlier and stored in the data frame MD2. Therefore, it is only necessary to calculate the intensity values after within array normalization has been performed.

  • Create a blank matrix with as many columns as there are GPR files for the Ontario chips and as many rows as there are unique genes on the Ontario chip.
j0<-tapply(MAO$A[,1],as.factor(MAO$genes[,5]),mean)
k0<-length(MAO$A[1,])
j1<-length(j0)
AAO<-matrix(nrow=j1,ncol=k0)
  • Calculate the intensity values after normalization has occurred and after duplicate genes have been averaged.
for(i in 1:k0) {AAO[,i]<-tapply(MAO$A[,i],as.factor(MAO$genes[,5]),mean)}
  • Plot the log fold changes (labeled "M" on the y axis) against the intensities (labeled "A" on the x axis) for each chip after within array normalization has been performed performed.
  • Remember, that after entering the call readline('Press Enter to continue'), maximize the window in which the graphs have appeared. Save the graphs as a JPEG (File>Save As>JPEG>100% quality...). Once the graphs have been saved, close the window and press Enter for the next set of graphs to appear.
for (i in 1:length(strains)) {
  st<-strains[i]
  lt<-which(Otargets$Strain %in% st)
  if (st=='wt') {
      par(mfrow=c(3,5))
  } else {
      par(mfrow=c(4,5))
  }
  for (i in lt) {
    plot(AAO[,i],MD2[,i],main=headers[i],xlab="A",ylab="M",ylim=c(-5,5),xlim=c(0,15))
  }
  browser()
}
  • To continue generating plots, type the letter c and press enter.

Use the following code to generate box plots of the log fold changes before normalization has occurred, after within array normalization has been performed, and after between array normalization has been performed.

  • "xcoord" sets the coordinates at which the tick mark labels will be placed along the x axis of the plot. The value which is being subtracted may need to be changed depending upon the font size and the number of tick mark labels.
  • "fsize" sets the font size of the tick mark labels. This value may need to be changed depending upon the number of characters in the tick mark label.
  • "ft" determines what columns of the data frame MAD (which contains the data from the GCAT and Ontario chips after performing between array normalization) correspond to the strain determined by "st".
for (i in 1:length(strains)) {
  par(mfrow=c(1,3))
  st<-strains[i]
  lt<-which(Otargets$Strain %in% st)
  if (st=='wt') {
      xcoord<-xy.coords(lt)$x-1
      fsize<-0.9
  } else {
      xcoord<-xy.coords(lt)$x-1.7
      fsize<-0.8
  }
  boxplot(MID[,lt],main='Before Normalization',ylab='Log Fold Change',ylim=c(-5,5),xaxt='n')
  axis(1,at=xy.coords(lt)$x,labels=FALSE)
  text(xcoord,par('usr')[3]-0.65,labels=headers[lt],srt=45,cex=fsize,xpd=TRUE)
  boxplot(MD2[,lt],main='After Within Array Normalization',ylab='Log Fold Change',ylim=c(-5,5),xaxt='n')
  axis(1,at=xy.coords(lt)$x,labels=FALSE)
  text(xcoord,par('usr')[3]-0.65,labels=headers[lt],srt=45,cex=fsize,xpd=TRUE)
  ft<-Otargets$MasterList[which(Otargets$Strain %in% st)]
  boxplot(MAD[,ft],main='After Between Array Normalization',ylab='Log Fold Change',ylim=c(-5,5),xaxt='n')
  axis(1,at=xy.coords(lt)$x,labels=FALSE)
  text(xcoord,par('usr')[3]-0.65,labels=headers[lt],srt=45,cex=fsize,xpd=TRUE)
  browser()
} 
  • To continue generating plots, type the letter c and press enter.
  • Warnings are OK.