# Evan Montz Week 8

## Electronic Notebook

• I first downloaded the file, opened it, and re-saved it to make it unique.
• I inserted the "scaled_centered" worksheet into the file.
• Then I copied all the raw data into the new worksheet and added the rows for Average and Standard Deviation.
• Using the "=Average()" and "STDEV()" functions, I was able to calculate the values for the log ratios for each chip.
• It was noticed that there were some cells within the raw data that had no data in them. This yielded a minor error saying that some cells had no data. This error was ignored.
• Next I added in the "scaled_centered" columns for each run of each patient.
• Now using the "=(B4-\$B\$2)/\$B\$3" equation along with the drag down feature, this value was calculated for each cell for every run.
• NOTE: It is vital to use the "\$" symbol while writing this equation because without it, when we used the drag down feature, excel would continue to reference different cells and that was not what we desired.
• Next, I created the new worksheet titled "statistics"
• In this worksheet, I copied the first column along with all the columns that were scaled and centered.
• Then I created the three average columns for each patient and using the =AVERAGE command along with the drag down feature again, I computed the values for each cell for all three patients.
• Next, I created the column that was to display the average of all three averages. The averages of these values were taken using the same method as the previous step, except I took the average of the three averages rather than the scaled and centered data.
• Next, I created the Tstat column and computed the results using the function: =AVERAGE(N2:P2)/(STDEV(N2:P2)/SQRT(number of replicates)) where the number of replicates was 3 for our case. The drag down feature was again used to obtain the values for all cells.
• The last column that was created was the Pvalue column. The values were calculated using the function: =TDIST(ABS(R2),degrees of freedom,2) where the degrees of freedom was 2 for our case. The values were again dragged down to get the rest of the numbers.
• Finally, I created the "forGenMAPP" worksheet.
• Here, the data was manipulated to get the correct number of decimal places for the different values. The statistical columns were also cut and shifted to the beginning of the spreadsheet. I also added the "system code" column after the ID column and added an "N" to each cell using the drag down method.
• Lastly, the file was re-saved as an .xls file and then saved as a .txt file.

## Sanity Check: Number of genes significantly changed

• I clicked on the A1 cell and applied the Autofilter.
• The autofilter was applied to the pvalue column according to the following criterion:
```p<0.05 = 948 genes fit into this criteria
p<0.01 = 235 genes fit into this criteria
p<0.001 = 24 genes fit into this criteria
p<0.0001 = 2 genes fit into this criteria
```
• The pvalue criteria was returned to p<0.05
• The "Avg_LogFC_all" column was then filtered to display all genes with an average log fold change greater than zero.
```p<0.05 and "Avg_LogFC_all">0 = 352 genes fit into this criteria
```
• The "Avg_LogFC_all" column was then filtered to display all genes with an average log fold change less than zero.
```p<0.05 and "Avg_LogFC_all"<0 = 596 genes fit into this criteria
```
• The "Avg_LogFC_all" column was then filtered to display all genes with an average log fold change greater than 0.25 or less than -0.25 (This is more realistic).
```p<0.05 and "Avg_LogFC_all">0.25 = 339 genes fit into this criteria
p<0.05 and "Avg_LogFC_all"<-0.25 = 579 genes fit into this criteria
```
• From what I can tell, our experiment was very similar to that of Merrell et al. (2002). The main difference I see is that they used Statistical Analysis for Microarrays (SAM) program and we used Excel. Assuming that both programs do the same thing, both experiments were very similar.

## Sanity Check: Compare individual genes with known data

• VC0028 (in the table there were two genes with this title)
• fold change = 1.65 and 1.27 respectively
• p value = 0.0474 and 0.0629 respectively
• significantly changed? NO p<0.05 and YES p>0.05 respectively
• VC0941 (in the table there were two genes with this title)
• fold change = 0.09 and -.28 respectively
• p value = 0.6759 and 0.1636 respectively
• significantly changed? YES p>0.05 and YES p>0.05 respectively
• VC0869 (in the table there were five genes with this title)
• fold change = 1.59, 1.95, 2.20, 1.50 and 2.12 respectively
• p value = 0.0463, 0.0227, 0.0020, 0.0174, and 0.0200 respectively
• significantly changed? NO p<0.05, NO p<0.05, NO p<0.05, NO p<0.05 and NO p<0.05 respectively
• VC0051 (in the table there were two genes with this title)
• fold change = 1.92 and 1.89 respectively
• p value = 0.0139 and 0.0160 respectively
• significantly changed? NO p<0.05 and NO p<0.05 respectively
• VC0647 (in the table there were three genes with this title)
• fold change = -1.11, -0.94 and -1.05 respectively
• p value = 0.0003, 0.0125, 0.0051 respectively
• significantly changed? NO p<0.05, NO p<0.05 and NO p<0.05 respectively
• VC0468
• fold change = -0.17
• p value = 0.3350
• significantly changed? YES p>0.05
• VC2350
• fold change = -2.40
• p value = 0.0130
• significantly changed? NO p<0.05
• VCA0583
• fold change = 1.06
• p value = 0.1011
• significantly changed? YES p>0.05

Evan Montz 22:55, 24 October 2010 (EDT)