User:Jonathan R. I. Coleman/Notebook/Notes and Protocols/2014/06/27: Difference between revisions
No edit summary |
|||
(One intermediate revision by the same user not shown) | |||
Line 41: | Line 41: | ||
'''IMPLEMENTATION IN UNIX (Imputed to Phase3): (Credit: Tommy Carstensen)''' | '''IMPLEMENTATION IN UNIX (Imputed to Phase3): (Credit: Tommy Carstensen)''' | ||
zcat impute2.gen.gz | awk '{printf $1"\t"$2; for(i=6; i<NF; i+=3) {printf "\t"$(i+0)*0+$(i+1)*1+$(i+2)*2}; printf "\n"}' | bgzip > dosages.gz | zcat impute2.gen.gz | awk '{printf $1"\t"$2; for(i=6; i<NF; i+=3) {if($(i+0) == 0 && $(i+1) == 0 && $(i+2) == 0) printf "\tNA"; else printf "\t"$(i+0)*0+$(i+1)*1+$(i+2)*2}; printf "\n"}' | bgzip > dosages.gz | ||
1. Unzips the impute2 file | 1. Unzips the impute2 file | ||
Line 47: | Line 47: | ||
2. Prints the chromosome number (NB. "---" for unaltered imputed genotypes in impute2 file) and the SNP name (in Phase3 release, this is also the BP positions and alleles) | 2. Prints the chromosome number (NB. "---" for unaltered imputed genotypes in impute2 file) and the SNP name (in Phase3 release, this is also the BP positions and alleles) | ||
3. Iterates through the impute2 file and makes single-value dosage score for each line. | 3. Iterates through the impute2 file and makes single-value dosage score for each line, or "NA" where there is a missing value "0 0 0". | ||
4. Gzips the output file. | 4. Gzips the output file. | ||
Line 54: | Line 54: | ||
'''IMPLEMENTATION IN UNIX (Imputed to Phase1 Integrated Haplotypes): (Credit: Tommy Carstensen)''' | '''IMPLEMENTATION IN UNIX (Imputed to Phase1 Integrated Haplotypes): (Credit: Tommy Carstensen)''' | ||
zcat impute2.gen.gz | awk '{printf $1"\t"$2"\t"$3"\t"$4"\t"$5; for(i=6; i<NF; i+=3) {printf "\t"$(i+0)*0+$(i+1)*1+$(i+2)*2}; printf "\n"}' | bgzip > dosages.gz | zcat impute2.gen.gz | awk '{printf $1"\t"$2"\t"$3"\t"$4"\t"$5; for(i=6; i<NF; i+=3) {if($(i+0) == 0 && $(i+1) == 0 && $(i+2) == 0) printf "\tNA"; else printf "\t"$(i+0)*0+$(i+1)*1+$(i+2)*2}; printf "\n"}' | bgzip > dosages.gz | ||
1. Implementation is as above, but the SNP name in earlier releases only contains rs ID, so this adds the BP and alleles to the file. | 1. Implementation is as above, but the SNP name in earlier releases only contains rs ID, so this adds the BP and alleles to the file. |
Revision as of 02:40, 30 June 2016
Imputation Clean-up | <html><img src="/images/9/94/Report.png" border="0" /></html> Main project page |
Converting from IMPUTE2 .impute2 format to single-value genotype formatJoni Coleman, King's College London (Please send comments to jonathan[dot]coleman[at]kcl[dot]ac[dot]uk)
Comments look like this Commands (on the UNIX command line) look like this
[0 * p(AA)] + [1 * p(AB)] + [2 * p(BB)] which simplifies to: p(AB) + [2 * p(BB)]
zcat impute2.gen.gz | awk '{printf $1"\t"$2; for(i=6; i<NF; i+=3) {if($(i+0) == 0 && $(i+1) == 0 && $(i+2) == 0) printf "\tNA"; else printf "\t"$(i+0)*0+$(i+1)*1+$(i+2)*2}; printf "\n"}' | bgzip > dosages.gz 1. Unzips the impute2 file 2. Prints the chromosome number (NB. "---" for unaltered imputed genotypes in impute2 file) and the SNP name (in Phase3 release, this is also the BP positions and alleles) 3. Iterates through the impute2 file and makes single-value dosage score for each line, or "NA" where there is a missing value "0 0 0". 4. Gzips the output file.
zcat impute2.gen.gz | awk '{printf $1"\t"$2"\t"$3"\t"$4"\t"$5; for(i=6; i<NF; i+=3) {if($(i+0) == 0 && $(i+1) == 0 && $(i+2) == 0) printf "\tNA"; else printf "\t"$(i+0)*0+$(i+1)*1+$(i+2)*2}; printf "\n"}' | bgzip > dosages.gz 1. Implementation is as above, but the SNP name in earlier releases only contains rs ID, so this adds the BP and alleles to the file. See also this excellent R package from Uni of Washington: [1]
|