User:Jonathan R. I. Coleman/Notebook/Notes and Protocols/2014/06/27

From OpenWetWare
Jump to navigationJump to search
Imputation Clean-up Main project page

Converting from IMPUTE2 .impute2 format to single-value genotype format

Joni Coleman, King's College London (Please send comments to jonathan[dot]coleman[at]kcl[dot]ac[dot]uk)


GUIDANCE FOR READING THIS FILE:

Comments look like this

   Commands (on the UNIX command line) look like this


PROBLEM:


IMPUTE2 provides genotypes as probabilities ranging from 0-1 for each genotype possibility (AA, AB, BB). Some downstream applications want a single-value format, ranging from 0-2, where 0 = AA, 1 = AB and 2 = BB.


SOLUTION:


Recode the three genotype probabilities from IMPUTE2 into a single value with this basic equation:

[0 * p(AA)] + [1 * p(AB)] + [2 * p(BB)]

which simplifies to:

p(AB) + [2 * p(BB)]


IMPLEMENTATION IN UNIX (Imputed to Phase3): (Credit: Tommy Carstensen)

   zcat impute2.gen.gz | awk '{printf $1"\t"$2; for(i=6; i<NF; i+=3) {if($(i+0) == 0 && $(i+1) == 0 && $(i+2) == 0) printf "\tNA"; else printf "\t"$(i+0)*0+$(i+1)*1+$(i+2)*2}; printf "\n"}' | bgzip > dosages.gz

1. Unzips the impute2 file

2. Prints the chromosome number (NB. "---" for unaltered imputed genotypes in impute2 file) and the SNP name (in Phase3 release, this is also the BP positions and alleles)

3. Iterates through the impute2 file and makes single-value dosage score for each line, or "NA" where there is a missing value "0 0 0".

4. Gzips the output file.


IMPLEMENTATION IN UNIX (Imputed to Phase1 Integrated Haplotypes): (Credit: Tommy Carstensen)

   zcat impute2.gen.gz | awk '{printf $1"\t"$2"\t"$3"\t"$4"\t"$5; for(i=6; i<NF; i+=3) {if($(i+0) == 0 && $(i+1) == 0 && $(i+2) == 0) printf "\tNA"; else printf "\t"$(i+0)*0+$(i+1)*1+$(i+2)*2}; printf "\n"}' | bgzip > dosages.gz

1. Implementation is as above, but the SNP name in earlier releases only contains rs ID, so this adds the BP and alleles to the file.

See also this excellent R package from Uni of Washington: [1]