Computational Journal: Difference between revisions

Revision as of 16:19, 17 June 2011

Week 1

May 20, 2011

After attempting to find a match between each .gpr file and the individual LogFC columns in the original raw data some problems occurred. A single .gpr file was chosen to run a few tests on to try and match with one of the columns. The file that was selected in this case was t15/t0 for flask 3 of the Dahlquist wildtype strain. Flask 3 was chosen for a dye swap of Red/Green to Green/Red, however, when computing the LogFC for each spot on the chip, it was discovered that the LogFC was performed using the Red intensity divided by the Green intensity. This will have to be confirmed with Dr. Dahlquist to see if this should be changed at all or if the Excel sheets obtained from May 19th, 2011, a journal for this day does not exist. However, using Red/Green for the fold change, a column was obtained that matched the LogFC originally performed in the .gpr file. The next step was to try and perform this in the statistics processing software R. The original code given was changed to focus on the median, due to the fact that R was formatted to process the mean foreground intensity over the median background intensity. Once this was done, R successfully created a list of LogFC, of which the first five were focused on to compare to the original .gpr file. Of these five, two were duplicates giving three individual genes to be focused on in the file. The first five data points given by R matched up with the first five genes given in the .gpr file, so we know that this was functioning correctly. The main problem that arose was that when the LogFC given by the .gpr file and R and the LogFC found in the raw data were compared, an inconsistency occurred. These numbers did not match, and after running a seperate test on the .gpr file for t30/t0 for the same culture, the same problem occurred.

Week 2

May 25, 2011

Spent the morning trying to match up the Log Fold Changes before global normalization and after global normalization from the .gpr files. Ideally the graph should have a linear trend line if global normalization had subtracted the same constant from every spot. However, some of the files output a fuzzy region around the origin, this can be seen in the separate powerpoint with the graphs. The individual red/green intensities were graphed next. The normalized red intensities was graphed against the non-normalized red intensities and the same for the green intensities. These graphs should all be a linear line as well, however some awkward points occurred. The fuzzy region from the LogFC graph is thought to be a resultant from the "error" messages that are scattered throughout the .gpr files. These are a result of the intensity ratio being negative and the Log of any negative number does not exist. The number of "error" messages in each file seemed to correlate with the fuzzy section around the origin, as the number of "error"'s went up, the fuzziness increased to an extent were it almost was not linear anymore.

May 26, 2011

This morning was spent troubleshooting a line of code given by Dr. Fitzpatrick, M1<-tapply(MA$M[,1],as.factor(MA$genes[,5]),mean), meant to average the duplicate spots in the .gpr files after they were locally normalized using loess. We had to come back later fix this code using a for loop so that R would normalize all of the .gpr files instead of the one we had specified, 5 given that that was the first column with gene names when we were working on this yesterday. However, this was a .gpr file so the first four columns were just 1's, were as the MA data frame that was created did not incoporate these columns into the data frame. Once this was all figured out, I went to lunch and to check on the S. cerevisiae cultures (see lab notebook for details). Once we got back, we spent the remainder of the time trying to make sure the loess normalization worked, after talking with Dr. Fitzpatrick for a bit we switched our determined this was best done by graphing the normalized vs. nonnormalized log fold changes, after they had both been averaged using the tapply function. We spent a short amount of time trouble shooting this, mostly clerical errors when typing the source code, after which we were able to graph the relationship, which came out to be a relatively linear line, showing that loess normalization was working correctly. With this we are able to go back and do the loess normalization on the data given by the rest of strains.

May 27, 2011

We ran into a few problems at our first attempts at running the code even though we had had successful previous attempts on the previous day. The main problem we found was for the first for loop, the MM must have the [,i] afterwards, or else the output sheet will only be one column, and not all the columns that were put into the loop. After a few more typos that were taken care of this was the outcome:

par(mfrow=c(2,1))
targets<-readTargets(file.choose())
f<-function(x) as.numeric(x$Flags > -99)
RG<-read.maimages(targets, source="genepix.median", wt.fun=f)
plotMA(RG,main="before within array normalization")
MA<-normalizeWithinArrays(RG, method="loess", bc.method="normexp")
M1<-tapply(MA$M[,1],as.factor(MA$genes[,5]),mean)
n0<-length(MA$M[1,])
n1<-length(M1)
MM<-matrix(nrow=n1,ncol=n0)
MM[,1]=M1
for(i in 1:20) {MM[,i]<-tapply(MA$M[,i],as.factor(MA$genes[,5]),mean)}
write.table(MM,"wt_M_avg.csv",sep=",")
plotMA(MA,main="after within array normalization")
MAB<-normalizeBetweenArrays(MA,method="scale",targets=NULL)
b0<-length(MAB$M[1,])
N2<-tapply(MAB$M[,1],as.factor(MAB$genes[,5]),mean)
b1<-length(M1)
MMB<-matrix(nrow=b1,ncol=b0)
MMB[,1]=N2 
for(i in 1:9) {MMB[,i]<-tapply(MAB$M[,i],as.factor(MAB$genes[,5]),mean)}
write.table(MMB,"between_array_norm.csv",sep=",")

The point of all this code is to take the .gpr files that we were given, put them into the R statistics database, create an MA plot for before normalization and then normalize them using the locally weighted scatterplot smoother (Loess), note the e in loess may stand for estimated seeing as how the we in lowess stands for weighted. After this was done the data was placed into a matrix, because R has trouble translating a MAlist to a data frame, this matrix was then averaged for every duplicate gene using the line of code discussed on May 26th. This was all written into an excel table and saved along with the accompanying picture of the MAplots. This data was then put through a between array normalization, placed into a separate matrix, boxplots were done of the before and after between array normalization, these were all saved. Some notes on the code is that all the for loops must be changed for the number of .gpr files that are put into the target text file. Work is still progressing on the GCAT chips, first of all, the flask numbering for the GCAT chip as well as flask 4 must be double checked because of some discrepancies as well as a want to not have to run all the data twelve times. However, I was able to place both targets onto R, normalize them, and merge them into a single MAlist. This can be done for the RG list before, at the moment, I am only testing this code to make sure it will accomplish what I need by testing segments and checking the attributes of the outcomes. Once the flask discrepancies can be sorted out with Dr. Dahlquist, a majority of the "New Master List" can be assembled and the GCAT can be focused on from there.

Week 3

June 1, 2011

How to use Microsoft Access to match tables:

Open microsoft access
Make sure all columns are labeled.
Save desired files as tab delimited text files.
File -> Get External Data -> Import your text files.
Choose ID's to be Field Name
Check first box for first row is field names
ID column can be text and is ok with duplicates, will kick it back out if it doesn't like it.
All other columns should be numbers
Repeat for other text file.
Choose own access key
Create a new query and add both text files
Click and drag from one ID field name to the other
Right click on line and choose join properties
Choose option three, to keep all "Ontario" data and add only new "GCAT" data, arrow should appear showing relationship.
Select all the field names and click and drag them to Field on the Query, it will automatically assign them.
Query -> Make-Table Query and name your table
Click run to get output, select all and copy and paste values into an excel file, make sure format is correct

Spent today working through the GCAT chips, once this all got finished, Dr. Dahlquist showed us how to us Microsoft Access to combine the GCAT and Ontario chips, as well as how to ignore any extra genes on the GCAT chips. After this I spent the afternoon working on constructing a powerpoint with all the MA plots and boxplots as well as a process description and the code. The validation graphs still need to be graphed and placed on the spreadsheet, I will try and get this done tonight so it is ready for June 2nd.

June 2, 2011

Spent the day going over our results with Dr. Fitzpatrick to fully understand them and make sure everything was scaled correctly. We discovered that the MA plots that were originally outputted by us through R are in fact only the first .gpr file for each strain. As we went back to fix all of those and get all of the graphs on the same page. This was solved by running a for loop for our MA plot maker:

for(i in 1:20) {plotMA(MA[,1])}

Next we decided to match up all the boxplots next to each other. This was a particularly tough time and upsetting once we made the discovery that we had to write out the whole code:

boxplot(MA$M[,1],MA$M[,2],MA$M[,3],MA$M[,4],MA$M[,5],MA$M[,6],MA$M[,7],MA$M[,8],MA$M[,9],MA$M[,10],MA$M[,11],MA$M[,12],MA$M [,13],MA$M[,14],MA$M[,15],MA$M[,16],MA$M[,17],MA$M[,18],MA$M[,19],MA$M[,20])

We made a correction to this code, changed the MA$M to an MAB$M in my case, to plot out the after within array normalization MA plots. After this we attempted to go through the "multtest" tutorial that Dr. Fitzpatrick had given us but ran into a few problems installing the packages. These problems persisted even after we thought we had solved them all, however, Dr. Dahlquist should have the mean to help us install these packages. The last piece of code that I generated was the boxplots for the before within array normalization data so we could see the centering effects. This was probably the most annoying code I have written, thankfully I only made one error which was easily corrected. The code is:

boxplot(log2((RG$R[,1]-RG$Rb[,1])/(RG$G[,1]-RG$Gb[,1])),log2((RG$R[,2]-RG$Rb[,2])/(RG$G[,2]-RG$Gb[,2])),-log2((RG$R[,3]-RG$Rb[,3])/(RG$G[,3]-RG$Gb[,3])),-log2((RG$R[,4]-RG$Rb[,4])/(RG$G[,4]-RG$Gb[,4])),log2((RG$R[,5]-RG$Rb[,5])/(RG$G[,5]-RG$Gb[,5])),log2((RG$R[,6]-RG$Rb[,6])/(RG$G[,6]-RG$Gb[,6])),-log2((RG$R[,7]-RG$Rb[,7])/(RG$G[,7]-RG$Gb[,7])),-log2((RG$R[,8]-RG$Rb[,8])/(RG$G[,8]-RG$Gb[,8])),log2((RG$R[,5]-RG$Rb[,9])/(RG$G[,9]-RG$Gb[,9])),log2((RG$R[,10]-RG$Rb[,10])/(RG$G[,10]-RG$Gb[,10])),-log2((RG$R[,11]-RG$Rb[,11])/(RG$G[,11]-RG$Gb[,11])),-log2((RG$R[,12]-RG$Rb[,12])/(RG$G[,12]-RG$Gb[,12])),log2((RG$R[,13]-RG$Rb[,13])/(RG$G[,13]-RG$Gb[,13])),log2((RG$R[,14]-RG$Rb[,14])/(RG$G[,14]-RG$Gb[,14])),-log2((RG$R[,15]-RG$Rb[,15])/(RG$G[,15]-RG$Gb[,15])),-log2((RG$R[,16]-RG$Rb[,16])/(RG$G[,16]-RG$Gb[,16])),log2((RG$R[,17]-RG$Rb[,17])/(RG$G[,17]-RG$Gb[,17])),log2((RG$R[,18]-RG$Rb[,18])/(RG$G[,18]-RG$Gb[,18])),-log2((RG$R[,19]-RG$Rb[,19])/(RG$G[,19]-RG$Gb[,19])),-log2((RG$R[,20]-RG$Rb[,20])/(RG$G[,20]-RG$Gb[,20])))

Week 4

June 8, 2011

Spent the morning lecturing about statistics and the Family-Wise Error Rate(FWER) and the False Discovery Rate(FDR). After a quick intro to MTP we started working through R to get a better understanding of what MTP needed to process through R. This was probably the most tedious point of the day because the output in R consisted of thousands of NaN's(not a number). This was frustrating to say the least. Once Dr. Fitzpatrick looked at what was going wrong we were able to step through the errors and output the pvalues that we wanted, both raw and adjusted, for all the data as well as an average. We used this line of code to do this:

seed<-99
group<-c(rep(0,3))
m1<-MTP(X=matrix,Y=group,test="t.onesamp",standardize=FALSE,B=100,method="?")

The method still needs to be changed for the FDR, without specifying it defaults to one we don't want. The matrix object in this line of code is just the GPR files saved as a CSV excel sheet and written into R. The test also needs to be one sample otherwise it defaults to a two sample t test. See the help section in R or type in ?MTP and look at the examples for more help. Tomorrow will be the real test to see if we can get this to work for the rest of the data points, we also need to compare the raw pvalues that are output with the calculated pvalues from our new normalized master spreadsheet to see how they correlate.

June 9, 2011

Spent most of the day waiting for R to output the adjusted p-values that we wanted. This took a good deal of time because there are 6000 different genes to calculate the p-value for. We adjusted the number of iterations so that we could get a more accurate result, because with a B=100, the p-values that come back are only shown up to two decimal places, while a B=1000 gives you three decimal places. Once all of these were calculated by R, we placed them into a spreadsheet to compare them to the p-values we calculated by hand. After this was done, I spent some time trying to graph the results and compare the number of rejects from each type of adjustment, I managed to get a graph but I am still researching to see if there is a way to specifiy our data some more and manage to have a better graph. Another thing that I need to research is the use of the TPPFP and the gFWER adjustments. As to what I need to learn, I need to figure out where they are applied, exactly why and in which situations, and how to implement these into R. Hopefully the materials sent by Dr. Fitzpatrick and the paper I need to read for Journal Club will give me a better understanding of what is going on.

June 10, 2011

Spent the day comparing the different MTP results using different adjustment methods. Attempting to see a correlation of the data however none has been seen yet. This may be due to the methods, however another and possibly more likely cause is that our calculations are not right. We need to make sure we are comparing the same strains, the same time points, and the same methods, otherwise there is a high possibility that no correlation will be seen between the excel calculated data and the R calculated data. After seeing little to no relationship between the data, we started compiling a spreadsheet consisting of the 15 genes/transcription factors that we are focusing on along with all their time points for every strain and the averages and standard deviations for every time point. After going through some of the data some errors were found, one we need to go back and refill our GLN3 data because we started out with only a fraction of the GPR files. We also need to compare the CIN5 values we were using to test our methods today with other CIN5 data, we managed to compare it to wild type data instead and took that as no correlation.

Week 4

June 13, 2011

Spent the morning going over the Matlab model that I did this previous semester. After this large review we started making the individual input sheets for Matlab from the yeast regulatory sheet we made on Friday. After a few minutes of this we discovered a mistake in GCAT's in the Master data sheet. So all the Matlab files had to be erased and re-pasted with the corrected data. While going back and fixing the GCAT data we discovered a few more errors in the number of replicates for ZAP1 as well as extra rows due to the controls still being present in the spreadsheet. Just to make sure that nothing was wrong with the rest of the data, we went back and re made the master spreadsheet by combining all of the individual files we got from R. Once all this was done we had to re make the yeast regulatory network spreadsheet, with all of the averages and standard deviations. After this we were able to make the Matlab input sheets for all of the deletion strains. We will start on the Matlab tomorrow morning.

June 14, 2011

Spent the day reviewing the yeast regulatory network and the Matlab simulation that I focused on this past six months. After making another powerpoint with all of the expression graphs that Matlab outputs for each gene in the network, we went on to graph the relationship between the lse and the norm values at different alpha levels. The Matlab iterations took up most of the time for the workday so I did not spend a lot of time on the actual code, and the concepts were mostly a review for me.

June 15, 2011

This morning consisted of running more Matlab simulations to try and find the switching point on the lse vs norm graph. The wild type graph produced a smooth curve, however, the dGLN3 graph had a awkward jump at the alpha value of 0.002. We chose this point to focus on for more statistical analysis. The afternoon was spent finding the relative error between each alpha value and 0.002, this was done for the optimized production rates, optimized weights, and the optimized thresholds. Having to do the relative error for every alpha value in all three categories caused this step to take the rest of the day.

June 17, 2011

Spent the morning comparing the estimated production rates, network weights and network thresholds with the seperate p-values that we simulated the other day. After comparing the graphs and seeing that an alpha value of 0.002 was a good choice and that the trends in the graphs are similar, we decided to re-build the regulation networks and run the simulations with them. The afternoon consisted of building the 10 different excel files, one for each strain and 1 for both the old regulation matrix and the new one with the genes in the strain names added in to see the effect. The first simulation took the rest of the day and the rest will be done over the coming week.

@@ Line 100: / Line 100: @@
 ===June 15, 2011===
 This morning consisted of running more Matlab simulations to try and find the switching point on the lse vs norm graph. The wild type graph produced a smooth curve, however, the dGLN3 graph had a awkward jump at the alpha value of 0.002. We chose this point to focus on for more statistical analysis. The afternoon was spent finding the relative error between each alpha value and 0.002, this was done for the optimized production rates, optimized weights, and the optimized thresholds. Having to do the relative error for every alpha value in all three categories caused this step to take the rest of the day.
+===June 17, 2011===
+Spent the morning comparing the estimated production rates, network weights and network thresholds with the seperate p-values that we simulated the other day. After comparing the graphs and seeing that an alpha value of 0.002 was a good choice and that the trends in the graphs are similar, we decided to re-build the regulation networks and run the simulations with them. The afternoon consisted of building the 10 different excel files, one for each strain and 1 for both the old regulation matrix and the new one with the genes in the strain names added in to see the effect. The first simulation took the rest of the day and the rest will be done over the coming week.

Computational Journal: Difference between revisions

Revision as of 16:19, 17 June 2011

Contents

Week 1

May 20, 2011

Week 2

May 25, 2011

May 26, 2011

May 27, 2011

Week 3

June 1, 2011

June 2, 2011

Week 4

June 8, 2011

June 9, 2011

June 10, 2011

Week 4

June 13, 2011

June 14, 2011

June 15, 2011

June 17, 2011

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

research

Tools