# User:Timothee Flutre/Notebook/Postdoc/2012/02/29

(diff) ←Older revision | Current revision (diff) | Newer revision→ (diff)
Project name Main project page
Previous entry      Next entry

## Use R to extract the best SNP among all in cis of any given gene

This can be reformulated more generically as "extract one row for each level of a column based on the value of a second column":

Let's create a dummy dataset:

```x <- data.frame("g"=c("g2","g1","g2","g1"), "s"=c("s1","s2","s3","s4"), p=c(10^-4,10^-3,10^-2,10^-5))
```

Here is how it looks like:

```   g  s     p
1 g2 s1 1e-04
2 g1 s2 1e-03
3 g2 s3 1e-02
4 g1 s4 1e-05
```

For instance, the "g" column indicates gene names, the "s" column indicates SNP names, and the "p" column indicates P-values of association between genotypes at the SNP and variation in gene expression levels. In such a case, I want to extract the best SNP for each gene, ie. those with the lowest P-value.

First, I sort the "g" column according to the P-values and I keep the row indices after sorting:

```v <- x\$g[i <- order(x\$p)]
v
[1] g1 g2 g1 g2
i
[1] 4 1 2 3
```

Second, I find the first occurrence of each level in column "g" (in this example, the first occurrence corresponds to the lowest P-value for this gene):

```min.occ <- !duplicated(v)
min.occ
[1]  TRUE  TRUE FALSE FALSE
```

Third, I retrieve the row indices corresponding to these first occurrences per level:

```row.idx <- setNames(seq_len(nrow(x))[i][min.occ], v[min.occ])
row.idx
g1 g2
4  1
```

Finally, I extract the data I'm interested in:

```x[row.idx,]
g  s     p
4 g1 s4 1e-05
1 g2 s1 1e-04
```

This could also be done more simply by:

```x[i[min.occ],]
g  s     p
4 g1 s4 1e-05
1 g2 s1 1e-04
```

And according to this answer on SO, the whole procedure seems pretty fast!