Tessa A. Morris Week 11

From OpenWetWare
Revision as of 10:25, 26 March 2015 by Tessa A. Morris (talk | contribs) (correct last two)
Jump to navigationJump to search

Electronic Lab Notebook

Date:

3/19/2015

Assignment:

Here

Partner:

Alyssa N Gomes

Purpose:

Analyze microarray data, comparing the wild type and a mutant strain, in this experiment Δgln3.

Methods:

Statistical Analysis Part 1: ANOVA

  1. Download microarray data from Lionshare and save to desktop, changing the name of the file to include initials (TM)
  2. Record number of replicates (change the color of the fill for each different time point to make it easier to see)
  3. Create a new worksheet, label it "stats"
  4. Copy the first two columns from sheet 1, "data"
  5. In Row 1:
    • Columns C-G label: GLN3_xbar_(TIME)
    • Columns H and I label: GLN3_xbar_grand and (STRAIN)_ss_HO.
    • Columns J-N label: GLN3_ss_(TIME)
    • Columns O, P, and Q label: GLN3_SS_full, Fstat and p-value.
  6. For C2 type =AVERAGE( then in the "data" sheet, highlight the data in Row 2 that is associated with the GLN3 and t15
    • Copy this formula down the row in the stats cell by double clicking the black plus sign in the bottom right hand corner of C2
  7. Repeat for all of the time points
  8. Record total number of data points
  9. For H2 labeled GLN3_xbar_grand take the average of C2-G2 and copy this formula down the column (Note Step 4)
  10. For I2 type =SUMSQ( then in the "data" sheet, highlight the data in Row 2 that is associated with the GLN3 and t15 and copy this formula down the column (Note Step 4)
  11. Repeat for all of the time points
  12. For J2 type =SUMSQ(data!C2:F2)-4*stats!C2^2 and copy this formula down the column (Note Step 4)
    • "data!C2:F2" is the data associated with t15 // The number "4" is the number of data points // "stats!c2" gets the average from Step 4 for t15 // "^2" squares thevalue
    • Repeat for Cells K-N
    • To save time, take note of which columns the data for each time points is in, so the formula from J2 can be copied and then adjusted slightly
  13. For O2 type =sum(j2:n2) and copy down the column (Note Step 4)
  14. For P2 type =((n-5)/5)*(i2-o2)/o2, where n is the total number of data points
  15. For Q2 type =FDIST(P2,5,n-5), where n is the total number of data points
  16. To adjust the p-value to correct for the multiple testing problem
    • Label R2 "GLN3_Bonferroni_p-value"
    • In R2 type =q2*6189 and copy down the column (Note Step 4)
  17. To see how many of the p-values are less than 0.05
    • Sort & Filter >> Filter >> on drop down arrow for Q1 (p-value) Number Filter >> less than >> 0.05 >> OK
    • The number of values less than 0.05 will appear in the bottom left hand of the screen
    • Record this value
  18. To correct p-values that are greater than 1 by the number 1
    • In S2 type =IF(r2>1,1,r2)
  19. Save the data set (Upload to Lionshareand share with professors, Dr. Dahlquist and Dr. Fitzpatrick, and lab partner, Alyssa N Gomes)

Data & Observations:

  • Alyssa and I were assigned to analyze Wild type vs. Δgln3
  • While Alyssa analyzes the wild type, I am going to analyze Δgln3
  • Note about excel: ID: gene id; standard name: gene symbol (more user friendly); each column represents one microarray
  • Time 15: 4 replicates // Time 30: 4 replicates // Time 60: 4 replicates // Time 90: 4 replicates // Time 120: 4 replicates
  • Total Number of data points: 20
  • 15: C-F // 30: G-J // 60: K-N // 90: O-R // 120: S-V
  • 1864 out of 6189 genes have a p-value of less than 0.05
  • Data was shared with Dr. Dahlquist, Dr. Fitzpatrick, and Alyssa N Gomes through Lionshare

Date:

3/26/2015

Methods:

'Calculate the Benjamini & Hochberg p value Correction'

  1. Insert a new worksheet named "B&H".
  2. First, create an index column by first typing "Index" into cell A1. Then type "1" into cell A2 and "2" into cell A3. Select both cells A2 and A3. Double-click on the plus sign on the lower right-hand corner of your selection to fill the column with a series of numbers from 1 to 6189. We will use this to put the genes back in order at the end of these calculations.
  3. Copy and paste the column of ID's from one of the previous worksheets into column B.
  4. For the following, use Paste special > Paste values. Copy Column Q (the unadjusted p values) from the stats worksheet and paste it into Column C.
  5. Select all of columns A, B, and C. Sort by ascending values on Column C. Click the sort button from A to Z on the toolbar, in the window that appears, sort by column C, smallest to largest.
  6. Type the header "Rank" in cell D1. Repeat what you did in step 2 to create a series of numbers in ascending order from 1 to 6189. This is the p value rank, smallest to largest.
  7. Now you can calculate the Benjamini and Hochberg p value correction. Type "dGLN3_B-H_p-value" in cell E1. Type the following formula in cell E2: =(C2*6189)/D2 and press enter. Copy that equation to the entire column using the trick you learned last week.
  8. Type "dGLN3_B-H_p-value" into cell F1.
  9. Type the following formula into cell F2: =IF(E2>1,1,E2) and press enter. Copy that equation to the entire column using the trick you learned last week.
  10. Select columns A through F. Now sort them by your Index in Column A in ascending order.
  11. Copy column F and use Paste special < Paste values to paste it into column T of your stats sheet.

'Sanity Check: Number of genes significantly changed'

Before we move on to clustering and the biological analysis of the data, we want to perform a sanity check to make sure that we performed our data analysis correctly. We are going to find out the number of genes that are significantly changed at various p value cut-offs.

  • Go to the "stats" worksheet.
  • Select row A and select the menu item Data > Filter > Autofilter (The funnel icon on the Data tab). Little drop-down arrows should appear at the top of each column. This will enable us to filter the data according to criteria we set.
  • Click on the drop-down arrow on Column Q. Select "Custom". In the window that appears, set a criterion that will filter your data so that the p value has to be less than 0.05.
    • How many genes have p < 0.05? and what is the percentage (out of 6189)?
    • How many genes have p < 0.01? and what is the percentage (out of 6189)?
    • How many genes have p < 0.001? and what is the percentage (out of 6189)?
    • How many genes have p < 0.0001? and what is the percentage (out of 6189)?
  • When we use a p value cut-off of p < 0.05, what we are saying is that you would have seen a gene expression change that deviates this far from zero by chance less than 5% of the time.
  • We have just performed 6189 hypothesis tests. Another way to state what we are seeing with p < 0.05 is that we would expect to see this a gene expression change for at least one of the timepoints by chance in about 5% of our tests, or 309 times. Since we have more than 309 genes that pass this cut off, we know that some genes are significantly changed. However, we don't know which ones. To apply a more stringent criterion to our p values, we performed the Bonferroni and Benjamini and Hochberg corrections to these unadjusted p values. The Bonferroni correction is very stringent. The Benjamini-Hochberg correction is less stringent. To see this relationship, filter your data to determine the following:
    • How many genes are p < 0.05 for the Bonferroni-corrected p value? and what is the percentage (out of 6189)?
    • How many genes are p < 0.05 for the Benjamini and Hochberg-corrected p value? and what is the percentage (out of 6189)?
  • In summary, the p value cut-off should not be thought of as some magical number at which data becomes "significant". Instead, it is a moveable confidence level. If we want to be very confident of our data, use a small p value cut-off. If we are OK with being less confident about a gene expression change and want to include more genes in our analysis, we can use a larger p value cut-off.
  • Comparing results with known data: the expression of the gene NSR1 (ID: YGR159C)is known to be induced by cold shock. Find NSR1 in your dataset. What is its unadjusted, Bonferroni-corrected, and B-H-corrected p values? What is its average Log fold change at each of the timepoints in the experiment?
  • You and your partner should compare the numbers you got between the wild type strain and the other strain you have been assigned. You will be reporting this information in both your final paper and final presentation in the course, organized as a table. Use this sample PowerPoint slide to record your data. Create a title for the slide that gives the "message" of the slide. Upload the slide to your individual journal page for this week (you and your partner should have an identical slide with the same filename). This is the first slide of your final presentation in the course.
  • Upload your updated spreadsheet to LionShare (using the same name as before; check the box to "overwrite file"). The e-mail link you provided to us earlier will allow us to download your updated spreadsheet.


Data & Observations:

Sanity check of Δgln3:

  • p less than 0.05: 1864 of 6189 30.11%
  • p less than 0.01: 1008 of 6189 16.29%
  • p less than 0.001:404 of 6189 6.53%
  • p less than 0.0001:126 of 6189 2.04%
  • B-H less than 0.05: 126 of 6189 2.04%
  • Bonferroni less than 0.05: 26 of 6189 0.42%

Biomathematical Modeling Navigation

User Page: Tessa A. Morris
Course Page: Biomathematical Modeling