User:Timothee Flutre/Notebook/Postdoc/2012/02/29: Difference between revisions

From OpenWetWare
Jump to navigationJump to search
(Autocreate 2012/02/29 Entry for User:Timothee_Flutre/Notebook/Postdoc)
 
Line 7: Line 7:
<!-- ##### DO NOT edit above this line unless you know what you are doing. ##### -->
<!-- ##### DO NOT edit above this line unless you know what you are doing. ##### -->
==Entry title==
==Entry title==
* Insert content here...


* In R, extract one row for each level of a column based on the value of a second column:
Let's create a dummy dataset:
x <- data.frame("g"=c("g2","g1","g2","g1"), "s"=c("s1","s2","s3","s4"), p=c(10^-4,10^-3,10^-2,10^-5))
Here is how it looks like:
    g  s    p
1 g2 s1 1e-04
2 g1 s2 1e-03
3 g2 s3 1e-02
4 g1 s4 1e-05
For instance, the "g" column indicates gene names, the "s" column indicates SNP names, and the "p" column indicates P-values of association between genotypes at the SNP and variation in gene expression levels. In such a case, I want to extract the best SNP for each gene, ie. those with the lowest P-value.
First, I sort the "g" column according to the P-values and I keep the row indices after sorting:
v <- x$g[i <- order(x$p)]
v
[1] g1 g2 g1 g2
i
[1] 4 1 2 3
Second, I find the first occurrence of each level in column "g" (in this example, the first occurrence corresponds to the lowest P-value for this gene):
min.occ <- !duplicated(v)
min.occ
[1]  TRUE  TRUE FALSE FALSE
Third, ...:
idx <- setNames(seq_len(nrow(x))[i][min.occ], v[min.occ])
idx
g1 g2
  4  1
Finally, I extract the data I'm interested in:
x[idx,]
    g  s    p
4 g1 s4 1e-05
1 g2 s1 1e-04
And according to this answer on [http://stackoverflow.com/a/6037559/597069 SO], it seems pretty fast!


<!-- ##### DO NOT edit below this line unless you know what you are doing. ##### -->
<!-- ##### DO NOT edit below this line unless you know what you are doing. ##### -->

Revision as of 10:07, 29 February 2012

Project name <html><img src="/images/9/94/Report.png" border="0" /></html> Main project page
<html><img src="/images/c/c3/Resultset_previous.png" border="0" /></html>Previous entry<html>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</html>Next entry<html><img src="/images/5/5c/Resultset_next.png" border="0" /></html>

Entry title

  • In R, extract one row for each level of a column based on the value of a second column:

Let's create a dummy dataset:

x <- data.frame("g"=c("g2","g1","g2","g1"), "s"=c("s1","s2","s3","s4"), p=c(10^-4,10^-3,10^-2,10^-5))

Here is how it looks like:

   g  s     p
1 g2 s1 1e-04
2 g1 s2 1e-03
3 g2 s3 1e-02
4 g1 s4 1e-05

For instance, the "g" column indicates gene names, the "s" column indicates SNP names, and the "p" column indicates P-values of association between genotypes at the SNP and variation in gene expression levels. In such a case, I want to extract the best SNP for each gene, ie. those with the lowest P-value.

First, I sort the "g" column according to the P-values and I keep the row indices after sorting:

v <- x$g[i <- order(x$p)]
v
[1] g1 g2 g1 g2
i
[1] 4 1 2 3

Second, I find the first occurrence of each level in column "g" (in this example, the first occurrence corresponds to the lowest P-value for this gene):

min.occ <- !duplicated(v)
min.occ
[1]  TRUE  TRUE FALSE FALSE

Third, ...:

idx <- setNames(seq_len(nrow(x))[i][min.occ], v[min.occ])
idx
g1 g2
 4  1

Finally, I extract the data I'm interested in:

x[idx,]
   g  s     p
4 g1 s4 1e-05
1 g2 s1 1e-04

And according to this answer on SO, it seems pretty fast!