User:Timothee Flutre/Notebook/Postdoc/2012/02/29: Difference between revisions
From OpenWetWare
(Autocreate 2012/02/29 Entry for User:Timothee_Flutre/Notebook/Postdoc) |
(→Entry title: R tip) |
||
Line 7: | Line 7: | ||
<!-- ##### DO NOT edit above this line unless you know what you are doing. ##### --> | <!-- ##### DO NOT edit above this line unless you know what you are doing. ##### --> | ||
==Entry title== | ==Entry title== | ||
* In R, extract one row for each level of a column based on the value of a second column: | |||
Let's create a dummy dataset: | |||
x <- data.frame("g"=c("g2","g1","g2","g1"), "s"=c("s1","s2","s3","s4"), p=c(10^-4,10^-3,10^-2,10^-5)) | |||
Here is how it looks like: | |||
g s p | |||
1 g2 s1 1e-04 | |||
2 g1 s2 1e-03 | |||
3 g2 s3 1e-02 | |||
4 g1 s4 1e-05 | |||
For instance, the "g" column indicates gene names, the "s" column indicates SNP names, and the "p" column indicates P-values of association between genotypes at the SNP and variation in gene expression levels. In such a case, I want to extract the best SNP for each gene, ie. those with the lowest P-value. | |||
First, I sort the "g" column according to the P-values and I keep the row indices after sorting: | |||
v <- x$g[i <- order(x$p)] | |||
v | |||
[1] g1 g2 g1 g2 | |||
i | |||
[1] 4 1 2 3 | |||
Second, I find the first occurrence of each level in column "g" (in this example, the first occurrence corresponds to the lowest P-value for this gene): | |||
min.occ <- !duplicated(v) | |||
min.occ | |||
[1] TRUE TRUE FALSE FALSE | |||
Third, ...: | |||
idx <- setNames(seq_len(nrow(x))[i][min.occ], v[min.occ]) | |||
idx | |||
g1 g2 | |||
4 1 | |||
Finally, I extract the data I'm interested in: | |||
x[idx,] | |||
g s p | |||
4 g1 s4 1e-05 | |||
1 g2 s1 1e-04 | |||
And according to this answer on [http://stackoverflow.com/a/6037559/597069 SO], it seems pretty fast! | |||
<!-- ##### DO NOT edit below this line unless you know what you are doing. ##### --> | <!-- ##### DO NOT edit below this line unless you know what you are doing. ##### --> |
Revision as of 10:07, 29 February 2012
Project name | <html><img src="/images/9/94/Report.png" border="0" /></html> Main project page <html><img src="/images/c/c3/Resultset_previous.png" border="0" /></html>Previous entry<html> </html>Next entry<html><img src="/images/5/5c/Resultset_next.png" border="0" /></html> |
Entry title
Let's create a dummy dataset: x <- data.frame("g"=c("g2","g1","g2","g1"), "s"=c("s1","s2","s3","s4"), p=c(10^-4,10^-3,10^-2,10^-5)) Here is how it looks like: g s p 1 g2 s1 1e-04 2 g1 s2 1e-03 3 g2 s3 1e-02 4 g1 s4 1e-05 For instance, the "g" column indicates gene names, the "s" column indicates SNP names, and the "p" column indicates P-values of association between genotypes at the SNP and variation in gene expression levels. In such a case, I want to extract the best SNP for each gene, ie. those with the lowest P-value. First, I sort the "g" column according to the P-values and I keep the row indices after sorting: v <- x$g[i <- order(x$p)] v [1] g1 g2 g1 g2 i [1] 4 1 2 3 Second, I find the first occurrence of each level in column "g" (in this example, the first occurrence corresponds to the lowest P-value for this gene): min.occ <- !duplicated(v) min.occ [1] TRUE TRUE FALSE FALSE Third, ...: idx <- setNames(seq_len(nrow(x))[i][min.occ], v[min.occ]) idx g1 g2 4 1 Finally, I extract the data I'm interested in: x[idx,] g s p 4 g1 s4 1e-05 1 g2 s1 1e-04 And according to this answer on SO, it seems pretty fast! |