Dahlquist:Microarray Data Processing in R

Home Research Protocols Notebook People Publications Courses Contact

The following protocol was developed to normalize GCAT and Ontario DNA microarray chip data from the Dahlquist lab using the R Statistical Software and the limma package (part of the Bioconductor Project).

The normalization procedure has been verified to work with version 3.1.0 of R released in April 2014 (link to download site) and and version 3.20.1 of the limma package ( direct link to download zipped file) on the Windows 7 platform.
- Note that using other versions of R or the limma package might give different results.
- Note also that using the 32-bit versus the 64-bit versions of R 3.1.0 will give different results for the normalization out in the 10^-13 or 10^-14 decimal place. The Dahlquist Lab is standardizing on using the 64-bit version of R.
To install R for the first time, download and run the installer from the link above, accepting the default installation.
To use the limma package, unzip the file and place the contents into a folder called "limma" in the library directory of the R program. If you accept the default location, that will be C:\Program Files\R\R-3.1.0\library.

Input Files for R

The normalization protocol was written to normalize two types of chips:

"GCAT" chips: containing 70-mer oligonucleotides printed on epoxy chips obtained from the Genome Consortium for Active Teaching and produced by Washington University, St. Louis (Fall 2005 chips)
"Ontario" chips: the Yeast 6.4K Array containing full-length PCR products obtained from the UHN Microarray Centre in Toronto, Ontario, Canada.

For each chip that you want to normalize in R, you must import the .gpr file created for that chip after scanning it with GenePix Pro Software. Make sure that all of the .gpr files you will need for normalization are in one folder, along with the targets file(s) described in the section below.

Create a targets file with information about the .gpr files

One targets file must be created for the Ontario chips and a separate targets file must be created for the GCAT chips. To create the targets file for the Ontario chips, create the following spreadsheet in Excel, where each row corresponds to a .gpr file:

The first column (Column A, labeled "FileName") contains the file names of all the .gpr files. Make sure that you include the ".gpr" extension at the end of each file name.
The second column (Column B, labeled "Header") describes the strain, the time point, and flask from which the data represented in the .gpr file was taken during the cold shock experiment. Create this designation for each .gpr file in the form <strain>_LogFC_t-<flask number>. For example, wt_LogFC_t15-2, where wt is the wild type (note that future analysis of the data in MATLAB will be case-sensitive, so be careful when choosing appropriate headers).
The third column (Column C, labeled "Strain") contains the name of the strain to which the .gpr file corresponds (match this to the designation you used in Column B).
The fourth column (Column D, labeled "TimePoint") contains the experimental time point to which the .gpr file corresponds.
The fifth column (Column E, labeled "Flask") contains the flask number to which the .gpr file corresponds, assuming that replicates were performed for each time point for each strain.
The sixth column (Column F, labeled "Dyeswap") indicates the orientation of the dyes on the chip. If the control time point (t0) was labeled with Cy3 dye and the other time points where labeled with Cy5 dye on a particular chip, then put a "1" in this column in the row corresponding to the chip. If the orientation of the dyes was reversed for a chip, then put a "-1" in this column in the row corresponding to that chip.
The seventh column (Column G, labeled "MasterList") orders the chips by strain (wild type strain first and the deletion strains afterwards), then time point, and then flask. In this column, put a "1" in the row corresponding to the chip from the wild type, the first time point, and the first flask of that time point. Then, put a "2" in the row corresponding to the chip from the wild type strain, the first time point, and the second flask of that time point. Continue to number the chips in this order until all the chips have been numbered.

Make sure to save the targets file as a .csv file in the folder that contains the .gpr files for the Ontario chips.

The original file used for this analysis is Ontario_Targets.csv, which contains data for the wild type and deletion strains for CIN5, GLN3, HMO1, and ZAP1.
An updated file is Ontario_Targets_wt-dCIN5-dGLN3-dHMO1-dSWI4-dZAP1-Spar_20150317.csv which contains the above data, plus data from the SWI4 deletion strain and S. paradoxus.

The targets file for the GCAT chips should contain the aforementioned columns in the order in which they were mentioned. However, there should be a column that precedes this aforementioned set of columns (Column A,labeled "Location"), which designates which half of the GCAT chip was scanned to create the .gpr file. The rest of the columns will be shifted to the right. If the top half was scanned for a particular chip, then put "Top" in this column in the row corresponding to that chip. If the bottom half was scanned for a particular chip, then put "Bottom" in this column in the row corresponding to that chip. Make sure that the sequence of GCAT chips in the MasterList of .gpr files corresponding to the top of the chips is the same as the sequence of GCAT chips in the list of the .gpr files corresponding to the bottom of the chips. This is necessary because the .gal file for the GCAT chips only corresponded to one-half of the chip; the entire genome was printed twice in two large blocks on the top and bottom half of the chip.

Make sure to save the targets file as a .csv file in the folder that contains the .gpr files for the GCAT chips. The file used for this analysis is GCAT_Targets.csv.

Data Normalization

Microarray data for S. cerevisiae wild type and mutant strains was initially collected using GCAT chips before switching to Ontario chips. The within array normalization procedure for data from the GCAT chips has some differences in comparison to the procedure for data from the Ontario chips. In contrast, microarray data for wild type S. paradoxus has been collected using Ontario chips only.

To normalize data from both the Ontario and GCAT chips, begin with the section "Within Array Normalization for the Ontario Chips". Begin with the same section to normalize data collected with Ontario chips only.

Within Array Normalization for the Ontario Chips

Launch R 3.1.0.
Change the directory (File > Change dir...) to the folder containing the targets file and the GPR files for the Ontario chips.
Enter the code below (in the boxes) into the R command prompt and hit enter to run the within array normalization procedure that uses the Loess method. Hit enter after every line of code.
- Alternately, you can run a script containing all of these commands.
- Download the zipped file Ontario Chip Within-Array Normalization corrected 20140721.zip and unzip it to the folder containing the Ontario_Targets.csv file (or Ontario_Targets_wt-dCIN5-dGLN3-dHMO1-dSWI4-dZAP1-Spar_20150317.csv) and the GPR files corresponding to the Ontario chips.
- In R, select the menu item File > Source R code..., and select the Ontario Chip Within-Array Normalization corrected 20140721.R script.
- You will be prompted by an Open dialog for the Ontario targets file.

Line-by-line instructions follow:

Load the limma library.

library(limma)

When prompted after entering the line below, select the targets file with information about the GPR files corresponding to the Ontario chips. The file used for this analysis is Ontario_Targets.csv (or Ontario_Targets_wt-dCIN5-dGLN3-dHMO1-dSWI4-dZAP1-Spar_20150317.csv).

Otargets<-read.csv(file.choose())

Read in the .gpr files by designating the column within the targets file (imported as Otargets) that lists all the .gpr file names.

f<-function(x) as.numeric(x$Flags > -99)
RGO<-read.maimages(Otargets[,1], source="genepix.median", wt.fun=f)

Extract the column of values from the targets file that indicates for which chips the orientation of the dyes was swapped.

ds<-Otargets$Dyeswap

Perform Loess normalization. This step may take some time depending upon the number of chips that need to be normalized.

MAO<-normalizeWithinArrays(RGO, method="loess", bc.method="normexp")

Create a list of the names of the genes on the Ontario chips.

M1<-tapply(MAO$M[,1],as.factor(MAO$genes$Name),mean)
ontID<-rownames(M1)

Designate which column of the targets file contains the name of each chip in the form <strain>_LogFC_t-<flask number>.

headers<-Otargets$Header

Create a matrix MO, which will store the normalized data after replicate spots on the chips have been averaged. It should have as many rows as there are unique genes on the Ontario chip and as many columns as there are GPR files for the Ontario chips.

n0<-length(MAO$M[1,])
n1<-length(M1)
MO<-matrix(nrow=n1,ncol=n0)

Average all duplicate spots on the chip.

for(i in 1:n0) {MO[,i]<-tapply(MAO$M[,i],as.factor(MAO$genes$Name),mean)}

Create a new matrix MD to store the data. It should have as many rows as there are unique genes on the Ontario chip and as many columns as there are GPR files for the Ontario chips.

MD<-matrix(nrow=n1,ncol=n0)

Multiply each row of the data matrix MO by the vector ds, which contains the values from the targets file that indicate for which chips the dye orientation was swapped. Store the output of the calculation in the matrix MD.

for(i in 1:n0) {MD[,i]<-ds[i]*MO[,i]}

Convert matrix MD to a data frame.

MD2<-as.data.frame.matrix(MD)

Set the row names of the data frame to the names of the genes on the Ontario chips.

colnames(MD2)<-headers

Set the column names of the data frame to the names of each chip that include the strain, time point and flask number.

rownames(MD2)<-ontID

Remove the control spots on the chip (i.e. Arabidopsis and 3XSSC).

MD3<-subset(MD2,row.names(MD2)!="Arabidopsis")
MD4<-subset(MD3,row.names(MD3)!="3XSSC")

If you are also normalizing data from GCAT chips, proceed to the next section "Within Array Normalization for the GCAT Chips". If you are normalizing only data from Ontario chips, then proceed to the section "Between Array Normalization of Only Ontario Chip Data".

Within Array Normalization for the GCAT Chips

These instructions assume that you have just completed Within Array Normalization for the Ontario Chips.
If necessary, change the directory (File > Change dir...) to the folder containing the targets file and the GPR files for the Ontario chips.
Enter the code below (in the boxes) into the R command prompt and hit enter to run the within array normalization procedure that uses the Loess method. Hit enter after every line of code.
- Alternately, you can run a script containing all of these commands (and the commands from Between Array Normalization of Merged GCAT and Ontario Chip Data) below.
- Download the zipped file Within-Array Normalization GCAT and Merged Ontario-GCAT Between-Chip Normalization corrected 20140721.zip and unzip it to the folder containing the GCAT_Targets.csv file and the GPR files corresponding to the GCAT chips.
- In R, select the menu item File > Source R code..., and select the Within-Array Normalization GCAT and Merged Ontario-GCAT Between-Chip Normalization corrected 20140721.R script.
- You will be prompted by an Open dialog for the GCAT targets file.

Line-by-line instructions follow:
Read the targets file for the GCAT chips. The file used for this analysis is Ontario_Targets.csv.

Gtargets<-read.csv(file.choose())

Create a subset of the targets file that contains only the GPR files created after scanning the top of the GCAT chips. Read those GPR files into R.

Gtop<-subset(Gtargets,Gtargets$Location %in% 'Top')
f<-function(x) as.numeric(x$Flags > -99)
RT<-read.maimages(Gtop, source="genepix.median", wt.fun=f)

Create a subset of the targets file that contains only the GPR files created after scanning the bottom of the GCAT chips. Read those GPR files into R.

Gbottom<-subset(Gtargets,Gtargets$Location %in% 'Bottom')
RB<-read.maimages(Gbottom, source="genepix.median", wt.fun=f)

Bind the data from the top and bottom of the GCAT chip for each chip.

RGG<-rbind(RT,RB)

Extract the column of values from the subset of targets file corresponding to the top of the GCAT chips that indicates which chips were dye-swapped.

dw<-Gtop$Dyeswap

Perform Loess normalization.

MAG<-normalizeWithinArrays(RGG,method="loess", bc.method="normexp")

Create a list of the names of the genes on the GCAT chip.

M0<-tapply(MAG$M[,1],as.factor(MAG$genes$ID),mean)
gcatID<-rownames(M0)

Designate which column of the targets file contains the name of each chip in the form <strain>_LogFC_t-<flask number>.

chips<-Gtop$Header

Create a matrix MG, which will store the normalized data after replicate spots on the chips have been averaged. It should be as many rows as there are unique genes on the GCAT chip and as many columns as there are GPR files for the GCAT chips.

m0<-length(MAG$M[1,])
m1<-length(M0)
MR<-matrix(nrow=m1,ncol=m0)

Average the duplicates spots on the GCAT chips.

for(i in 1:length(chips)) {MR[,i]<-tapply(MAG$M[,i],as.factor(MAG$genes$ID),mean)}

Create a new matrix MG to store the data. It should be as many rows as there are unique genes on the GCAT chip and as many columns as there are GPR files for the GCAT chips.

MG<-matrix(nrow=m1,ncol=m0)

Set up a for loop to multiply each row of the data matrix MR by the vector of values dw indicating the dye-swapped chips. Store the output to the matrix MG.

for(i in 1:m0) {MG[,i]<-dw[i]*MR[,i]}

Convert matrix MG to a data frame.

MG2<-as.data.frame.matrix(MG)

Set the column names of the data frame to the names of each chip that include the strain, time point and flask number.

colnames(MG2)<-chips

Set the row names of the data frame to the names of the genes on the GCAT chips.

rownames(MG2)<-gcatID

NOTE: In normalizing dSWI4 GCAT chip data collected by two classes of Biology 478 on 4/29/2014, I noticed that the gene names in the GPR files included an "_01" at the end of the proper name, which is unlike the GCAT chip data for other deletion strains that had been previously analyzed. To following line of code was used to remove the "_01":

gcatIDtruncated<-strsplit(gcatID,"_01",fixed=TRUE)

The row names of the data frame MG2 were then set using the following line of code:

 rownames(MG2)<-gcatIDtruncated

The normalization was finished following the remaining part of the normalization protocol as listed below.

Merge the GCAT and Ontario data into a single data frame. Any gene names that only appear on the GCAT chip will have an NA in each column of the row corresponding to that gene in the new data frame.

GO<-merge(MD4,MG2,by="row.names",all=T)

Set the row names of the new data frame to the names of the genes on both the Ontario and GCAT chips.

rownames(GO)<-GO$Row.names

Get rid of the genes on the GCAT chips that are not on the Ontario chips.

GO2<-subset(GO,rownames(GO) %in% ontID)

Create a list to order the merged GCAT and Ontario chip data by strain, then time point, and then flask.

MasterList<-rbind(Otargets[,c('Header','MasterList')],Gtop[,c('Header','MasterList')])
OrderedML<-MasterList[sort.list(MasterList$MasterList),]

Rearrange the columns so that they are ordered by strain (wild type then deletion strains in alphabetical order), then time point, and then flask.

GO3<-GO2[,match(OrderedML$Header,colnames(GO2))]

Write the normalized Ontario and GCAT data to a CSV file. Make sure to indicate which chips and/or which strains the data was collected from.

write.table(GO3,"GCAT_and_Ontario_Within_Array_Normalization.csv",sep=",",col.names=NA,row.names=TRUE)

At this point, the normalization within each GCAT and Ontario array is complete. To merge the normalized data from both chips and to perform between array normalization, proceed to the section "Between Array Normalization of Merged GCAT and Ontario Chip Data".

Between Array Normalization of Merged GCAT and Ontario Chip Data

Create a new matrix to store the merged data. It should have as many rows as there are genes common to both the GCAT and Ontario chips and as many columns as there are GPR files that have been normalized.

g0<-length(GO3[1,])
g1<-length(GO3[,1])
MADM<-matrix(nrow=g1,ncol=g0)

Scale each chip by its median absolute deviation.

for (i in 1:g0) {MADM[,i]<-GO3[,i]/mad(GO3[,i],na.rm=TRUE)}

Convert matrix MADM to a data frame.

MAD<-as.data.frame.matrix(MADM)

Set the row names of the data frame to names of the genes common to the Ontario and GCAT chips.

rownames(MAD)<-rownames(GO3)

Set the column names of the data frame to the list of the names of the chips in the format <strain>_LogFC_t-<flask number> ordered by strain, then time point, and then flask number.

colnames(MAD)<-colnames(GO3)

Write the final data set to a CSV file. Make sure to indicate which chips and/or which strains the data was collected from.

write.table(MAD,"GCAT_and_Ontario_Final_Normalized_Data.csv",sep=",",col.names=NA,row.names=TRUE)

Minimize the R window (do not close the program). Open the .csv file with the R output in Excel. Replace every entry with an "NA" by a space. To do so, select the menu option Edit > Replace... (or select any cell and click Ctrl+H). Type "NA" into the "Find what:" box and leave the "Replace with:" box blank. Then click "Replace All". Save the file.

Between Array Normalization of Only Ontario Chip Data

NOTE: this only applies if you are normalizing Ontario Chip Data Only. If you are combining GCAT and Ontario data, do not do this section.

Save the Ontario data after performing within array normalization to a CSV file.

write.table(MD4,"Ontario_Within_Array_Normalization.csv",sep=",",col.names=NA,row.names=TRUE)

Create a list to order the Ontario chip data by strain, then time point, and then flask.

MasterList<-Otargets[,c('Header','MasterList')]
OrderedML<-MasterList[sort.list(MasterList$MasterList),]

Reorder the columns of the data frame MD4 so that the data is ordered by the time points.

MD5<-MD4[,match(OrderedML$Header,colnames(MD4))]

Create a matrix MADM, which will store the data after performing between array normalization. It should have as many rows as there are unique genes on the Ontario chip and as many columns as there are GPR files for the Ontario chips.

t0<-length(MD5[1,])
t1<-length(MD5[,1])
MADM<-matrix(nrow=t1,ncol=t0)

Scale each chip by its median absolute deviation.

for (i in 1:t0) {MADM[,i]<-MD5[,i]/mad(MD5[,i],na.rm=TRUE)}

Convert matrix MADM to a data frame.

MAD<-as.data.frame.matrix(MADM)

Set the row names of the data frame to names of the unique genes on the Ontario chip.

rownames(MAD)<-rownames(MD5)

Set the column names of the data frame to the list of the names of the chips in the format <strain>_LogFC_t-<flask number> ordered by strain, then time point, and then flask number.

colnames(MAD)<-colnames(MD5)

Write the final data set to a CSV file. Make sure to indicate which chips and/or which strains the data was collected from.

write.table(MAD,"Ontario_Final_Normalized_Data.csv",sep=",",col.names=NA,row.names=TRUE)

Minimize the R window (do not close the program). Open the .csv file with the R output in Excel. Replace every entry with an "NA" by a space. To do so, select the menu option Edit > Replace... (or select any cell and click Ctrl+H). Type "NA" into the "Find what:" box and leave the "Replace with:" box blank. Then click "Replace All". Save the file.

Visualize the Normalized Data

There are two types of plots that can be used to visualize microarray data before and after normalization: MA plots and box plots. MA plots (M for log fold change and A for intensity) display the ratio of the quantity of the red dye to that of the green dye (log2(red/green)), which will be referred to as the log fold change, for each spot versus the intensity ((1/2)*log2(red*green)) of each spot.

Create MA Plots and Box Plots for the GCAT Chips

First, you will create MA plots for the data before the normalization has occurred. Input the following code in the same window in R that was used to normalize the data.

Create a variable to store the names of the genes on the GCAT chip before the controls have been removed and before the replicates have been averaged.

GCAT.GeneList<-RGG$genes$ID

Calculate the log fold change of each gene before normalization has been performed.

lg<-log2((RGG$R-RGG$Rb)/(RGG$G-RGG$Gb))

If you get a message saying "NaNs produced" this is OK, proceed to the next step.
Create a blank matrix with as many columns as there are GPR files for the GCAT chips and as many rows as there are unique genes on the GCAT chip.

r0<-length(lg[1,])
rx<-tapply(lg[,1],as.factor(GCAT.GeneList),mean)
r1<-length(rx)
MM<-matrix(nrow=r1,ncol=r0)

Calculate the log fold changes after averaging duplicate genes.

for(i in 1:r0) {MM[,i]<-tapply(lg[,i],as.factor(GCAT.GeneList),mean)}

Create a new matrix MC to store the data. It should have as many rows as there are unique genes on the GCAT chip and as many columns as there are GPR files for the GCAT chips.

MC<-matrix(nrow=r1,ncol=r0)

Set up a for loop to multiply each row of the matrix MM by the vector of values dw indicating the dye-swapped chips. Store the output to the matrix MC.

for(i in 1:r0) {MC[,i]<-dw[i]*MM[,i]}

Convert the matrix MC to a data frame and set the column names of the data frame to the names of the GCAT chips in the form <strain>_LogFC_t-<flask number>.

MCD<-as.data.frame(MC)
colnames(MCD)<-chips
rownames(MCD)<-gcatID

Calculate the intensity values before normalization has occurred.

la<-(1/2*log2((RGG$R-RGG$Rb)*(RGG$G-RGG$Gb)))

If you get these Warning messages, it's OK:

1: In (RGG$R - RGG$Rb) * (RGG$G - RGG$Gb) :

NAs produced by integer overflow

2: NaNs produced

Create a blank matrix with as many columns as there are GPR files for the GCAT chips and as many rows as there are unique genes on the GCAT chip.

r2<-length(la[1,])
ri<-tapply(la[,1],as.factor(GCAT.GeneList),mean)
r3<-length(ri)
AG<-matrix(nrow=r3,ncol=r2)

Calculate the intensity values after averaging duplicate genes.

for(i in 1:r2) {AG[,i]<-tapply(la[,i],as.factor(GCAT.GeneList),mean)}

Set how many rows and columns the graphs will be partitioned in to. In the call par(mfrow=c(#,#)), the first # sets the number of rows and the second # sets the number of columns.

par(mfrow=c(3,3))

Plot the log fold changes (labeled "M" on the y axis) against the intensity values (labeled "A" on the x axis) before each chip has been normalized. The y axis limits (ylim) and x axis limits (xlim) may need to be altered depending on the magnitude of outliers.

for(i in 1:r2) {plot(AG[,i],MC[,i],main=chips[i],xlab='A',ylab='M',ylim=c(-5,5),xlim=c(0,15))}
browser()

Maximize the window in which the graphs have appeared. Save the graphs as a JPEG (File>Save As>JPEG>100% quality...). Once the graphs have been saved, close the window. To continue with the rest of the code, type the letter "c" and press Enter

Next, create MA plots of the data after within array normalization.

The log fold changes after within array normalization were calculated earlier and stored in the data frame MG2. Therefore, it is only necessary to calculate the intensity values after within array normalization has been performed.

Create a blank matrix with as many columns as there are GPR files and as many rows as there are unique genes.

x0<-tapply(MAG$A[,1],as.factor(MAG$genes$ID),mean)
y0<-length(MAG$A[1,])
x1<-length(x0)
AAG<-matrix(nrow=x1,ncol=y0)

Calculate the intensity values after normalization has occurred and after duplicate genes have been averaged.

for(i in 1:y0) {AAG[,i]<-tapply(MAG$A[,i],as.factor(MAG$genes$ID),mean)}

Set how many rows and columns the graphs will be partitioned in to.

par(mfrow=c(3,3))

Plot the log fold changes (labeled "M" on the y axis) versus the intensity values (labeled "A" on the x axis) for each chip after within array normalization has been performed performed. The y axis limits (ylim) and x axis limits (xlim) may need to be altered depending on the magnitude of outliers.

for(i in 1:y0) {plot(AAG[,i],MG2[,i],main=chips[i],xlab='A',ylab='M',ylim=c(-5,5),xlim=c(0,15))}
browser()

Maximize the window in which the graphs have appeared. Save the graphs as a JPEG (File>Save As>JPEG>100% quality...). Once the graphs have been saved, close the window. To continue with the rest of the code, type the letter "c" and press Enter.

Use the following code to generate boxplots of the log fold changes for the GCAT chips before normalization has occurred, after within array normalization has been performed, and after between array normalization has occurred.

Set how many rows and columns the graphs will be partitioned in to.

par(mfrow=c(1,3))

Create a box plot of the log fold changes before normalization has occurred.

boxplot(MCD,main="Before Normalization",ylab='Log Fold Change',ylim=c(-5,5),xaxt='n')

Specify where the tick marks will appear on the x axis of the plot.

axis(1,at=xy.coords(chips)$x,tick=TRUE,labels=FALSE)

Place the tick mark labels along the x axis of the plot.

text(xy.coords(chips)$x-1,par('usr')[3]-0.6,labels=chips,srt=45,cex=0.9,xpd=TRUE)

Create a box plot of the log fold changes after within array normalization has been performed.

boxplot(MG2,main='After Within Array Normalization',ylab='Log Fold Change',ylim=c(-5,5),xaxt='n')

Place the tick mark labels along the x axis of the plot.

axis(1,at=xy.coords(chips)$x,labels=FALSE)

Place the tick mark labels along the x axis of the plot.

text(xy.coords(chips)$x-1,par('usr')[3]-0.6,labels=chips,srt=45,cex=0.9,xpd=TRUE)

Create a box plot of the log fold changes after between array normalization has been performed.

boxplot(MAD[,Gtop$MasterList],main='After Between Array Normalization',ylab='Log Fold Change',ylim=c(-5,5),xaxt='n')

Place the tick mark labels along the x axis of the plot.

axis(1, at=xy.coords(chips)$x,labels=FALSE)

Place the tick mark labels along the x axis of the plot.

text(xy.coords(chips)$x-1,par('usr')[3]-0.6,labels=chips,srt=45,cex=0.9,xpd=TRUE)

Maximize the window in which the plots have appeared. Save the plots as a JPEG (File>Save As>JPEG>100% quality...). Once the graphs have been saved, close the window.

Create MA Plots and Box Plots for the Ontario Chips

Create a variable to store the names of the genes on the Ontario chip before the controls have been removed and before the replicates have been averaged.

Ontario.GeneList<-RGO$genes$Name

Calculate the log fold change of each gene before normalization has been performed.

lr<-log2((RGO$R-RGO$Rb)/(RGO$G-RGO$Gb))

Warning message: "NaNs produced" is OK.

Create a blank matrix with as many columns as there are GPR files for the Ontario chips and as many rows as there are unique genes on the Ontario chip.

z0<-length(lr[1,])
v0<-tapply(lr[,1],as.factor(Ontario.GeneList),mean)
z1<-length(v0)
MT<-matrix(nrow=z1,ncol=z0)

Calculate the log fold changes after averaging duplicate genes.

for(i in 1:z0) {MT[,i]<-tapply(lr[,i],as.factor(Ontario.GeneList),mean)}

Create a new matrix MI to store the data. It should have as many rows as there are unique genes on the Ontario chip and as many columns as there are GPR files for the Ontario chips.

MI<-matrix(nrow=z1,ncol=z0)

Set up a for loop to multiply each row of the matrix MT by the vector of values ds indicating the dye-swapped chips. Store the output to the matrix MI.

for(i in 1:z0) {MI[,i]<-ds[i]*MT[,i]}

Convert the matrix MI to a data frame and set the column names of the data frame to the names of the Ontario chips in the form <strain>_LogFC_t-<flask number>.

MID<-as.data.frame(MI)
colnames(MID)<-headers
rownames(MID)<-ontID

Calculate the intensity values before normalization has occurred.

ln<-(1/2*log2((RGO$R-RGO$Rb)*(RGO$G-RGO$Gb)))

Warning messages are OK:

1: In (RGO$R - RGO$Rb) * (RGO$G - RGO$Gb) :

NAs produced by integer overflow

2: NaNs produced

Create a blank matrix with as many columns as there are GPR files for the Ontario chips and as many rows as there are unique genes on the Ontario chip.

z2<-length(ln[1,])
zi<-tapply(ln[,1],as.factor(Ontario.GeneList),mean)
z3<-length(zi)
AO<-matrix(nrow=z3,ncol=z2)

Calculate the intensity values after averaging duplicate genes.

for(i in 1:z0) {AO[,i]<-tapply(ln[,i],as.factor(Ontario.GeneList),mean)}

Create a list of all of the strains for which there are Ontario chips.

strains<-c('wt','dCIN5','dGLN3','dHMO1','dZAP1')

Plot the log fold changes (labeled "M" on the y axis) against the intensities (labeled "A" on the x axis) before each chip has been normalized. "st" designates for which strain to create MA plots. "lt" determines which columns of the data correspond to the selected strain.
Set how many rows and columns the graphs will be partitioned in to with the call par(mfrow=c(#,#)), the first # sets the number of rows and the second # sets the number of columns.
The y axis limits (ylim) and x axis limits (xlim) may need to be altered depending on the magnitude of outliers.
After entering the call browser(), maximize the window in which the graphs have appeared. Save the graphs as a JPEG (File>Save As>JPEG>100% quality...). Once the graphs have been saved, close the window and press Enter for the next set of graphs to appear.

for (i in 1:length(strains)) {
  st<-strains[i]
  lt<-which(Otargets$Strain %in% st)
  if (st=='wt') {
      par(mfrow=c(3,5))
  } else {
      par(mfrow=c(4,5))
  }
  for (i in lt) {
    plot(AO[,i],MI[,i],main=headers[i],xlab="A",ylab="M",ylim=c(-5,5),xlim=c(0,15))
  }
  browser()
}

To continue generating plots, type the letter c and press enter.

Next, create MA plots of the data after within array normalization.

The log fold changes after within array normalization were calculated earlier and stored in the data frame MD2. Therefore, it is only necessary to calculate the intensity values after within array normalization has been performed.

Create a blank matrix with as many columns as there are GPR files for the Ontario chips and as many rows as there are unique genes on the Ontario chip.

j0<-tapply(MAO$A[,1],as.factor(MAO$genes[,5]),mean)
k0<-length(MAO$A[1,])
j1<-length(j0)
AAO<-matrix(nrow=j1,ncol=k0)

Calculate the intensity values after normalization has occurred and after duplicate genes have been averaged.

for(i in 1:k0) {AAO[,i]<-tapply(MAO$A[,i],as.factor(MAO$genes[,5]),mean)}

Plot the log fold changes (labeled "M" on the y axis) against the intensities (labeled "A" on the x axis) for each chip after within array normalization has been performed performed.
Remember, that after entering the call readline('Press Enter to continue'), maximize the window in which the graphs have appeared. Save the graphs as a JPEG (File>Save As>JPEG>100% quality...). Once the graphs have been saved, close the window and press Enter for the next set of graphs to appear.

for (i in 1:length(strains)) {
  st<-strains[i]
  lt<-which(Otargets$Strain %in% st)
  if (st=='wt') {
      par(mfrow=c(3,5))
  } else {
      par(mfrow=c(4,5))
  }
  for (i in lt) {
    plot(AAO[,i],MD2[,i],main=headers[i],xlab="A",ylab="M",ylim=c(-5,5),xlim=c(0,15))
  }
  browser()
}

To continue generating plots, type the letter c and press enter.

Use the following code to generate box plots of the log fold changes before normalization has occurred, after within array normalization has been performed, and after between array normalization has been performed.

"xcoord" sets the coordinates at which the tick mark labels will be placed along the x axis of the plot. The value which is being subtracted may need to be changed depending upon the font size and the number of tick mark labels.
"fsize" sets the font size of the tick mark labels. This value may need to be changed depending upon the number of characters in the tick mark label.
"ft" determines what columns of the data frame MAD (which contains the data from the GCAT and Ontario chips after performing between array normalization) correspond to the strain determined by "st".

for (i in 1:length(strains)) {
  par(mfrow=c(1,3))
  st<-strains[i]
  lt<-which(Otargets$Strain %in% st)
  if (st=='wt') {
      xcoord<-xy.coords(lt)$x-1
      fsize<-0.9
  } else {
      xcoord<-xy.coords(lt)$x-1.7
      fsize<-0.8
  }
  boxplot(MID[,lt],main='Before Normalization',ylab='Log Fold Change',ylim=c(-5,5),xaxt='n')
  axis(1,at=xy.coords(lt)$x,labels=FALSE)
  text(xcoord,par('usr')[3]-0.65,labels=headers[lt],srt=45,cex=fsize,xpd=TRUE)
  boxplot(MD2[,lt],main='After Within Array Normalization',ylab='Log Fold Change',ylim=c(-5,5),xaxt='n')
  axis(1,at=xy.coords(lt)$x,labels=FALSE)
  text(xcoord,par('usr')[3]-0.65,labels=headers[lt],srt=45,cex=fsize,xpd=TRUE)
  ft<-Otargets$MasterList[which(Otargets$Strain %in% st)]
  boxplot(MAD[,ft],main='After Between Array Normalization',ylab='Log Fold Change',ylim=c(-5,5),xaxt='n')
  axis(1,at=xy.coords(lt)$x,labels=FALSE)
  text(xcoord,par('usr')[3]-0.65,labels=headers[lt],srt=45,cex=fsize,xpd=TRUE)
  browser()
}

To continue generating plots, type the letter c and press enter.
Warnings are OK.

Dahlquist:Microarray Data Processing in R

Contents

Input Files for R

Create a targets file with information about the .gpr files

Data Normalization

Within Array Normalization for the Ontario Chips

Within Array Normalization for the GCAT Chips

Between Array Normalization of Merged GCAT and Ontario Chip Data

Between Array Normalization of Only Ontario Chip Data

Visualize the Normalized Data

Create MA Plots and Box Plots for the GCAT Chips

Create MA Plots and Box Plots for the Ontario Chips

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

research

Tools