User:Timothee Flutre/Notebook/Postdoc/2012/05/25: Difference between revisions
From OpenWetWare
(→One-liners with GNU tools: add "sort file with header") |
(→About one-liners in data wrangling: extract substring with GNU awk) |
||
(17 intermediate revisions by 2 users not shown) | |||
Line 2: | Line 2: | ||
|- | |- | ||
|style="background-color: #EEE"|[[Image:owwnotebook_icon.png|128px]]<span style="font-size:22px;"> Project name</span> | |style="background-color: #EEE"|[[Image:owwnotebook_icon.png|128px]]<span style="font-size:22px;"> Project name</span> | ||
|style="background-color: #F2F2F2" align="center"| | |style="background-color: #F2F2F2" align="center"|[[File:Report.png|frameless|link={{#sub:{{FULLPAGENAME}}|0|-11}}]][[{{#sub:{{FULLPAGENAME}}|0|-11}}|Main project page]]<br />{{#if:{{#lnpreventry:{{FULLPAGENAME}}}}|[[File:Resultset_previous.png|frameless|link={{#lnpreventry:{{FULLPAGENAME}}}}]][[{{#lnpreventry:{{FULLPAGENAME}}}}{{!}}Previous entry]] }}{{#if:{{#lnnextentry:{{FULLPAGENAME}}}}|[[{{#lnnextentry:{{FULLPAGENAME}}}}{{!}}Next entry]][[File:Resultset_next.png|frameless|link={{#lnnextentry:{{FULLPAGENAME}}}}]]}} | ||
|- | |- | ||
| colspan="2"| | | colspan="2"| | ||
<!-- ##### DO NOT edit above this line unless you know what you are doing. ##### --> | <!-- ##### DO NOT edit above this line unless you know what you are doing. ##### --> | ||
== | ==About one-liners in data wrangling== | ||
* ''' | * '''Motivation''': Once we receive raw data and before drawing robust conclusions, we (almost) always need to reformat them as well as extract a few key summary statistics. Hopefully this activity, called [https://en.wikipedia.org/wiki/Data_wrangling data wrangling], is particularly quick and easy on [http://www.gnu.org/gnu/gnu-linux-faq.html GNU/Linux] computers. For instance, using GNU utilities via the [https://en.wikipedia.org/wiki/Command-line_interface command-line interface], we can write a "one-liner", a sequence of tools in which the output of a tool is the input of the next. This is not only easy but also very powerful, as shown below. | ||
* '''Toolbox''': often available by default on many computers with GNU/Linux | |||
** [https://en.wikipedia.org/wiki/Bash_%28Unix_shell%29 Bash] | |||
** [https://en.wikipedia.org/wiki/AWK AWK] | |||
** [https://en.wikipedia.org/wiki/Grep grep] | |||
** [https://en.wikipedia.org/wiki/Sed sed] | |||
** [https://en.wikipedia.org/wiki/GNU_Core_Utilities GNU coreutils] (head, tail, cut, uniq, sort, tr, od, ...) | |||
* ''' | * '''Tutorials''': | ||
** [http://en. | ** [http://en.flossmanuals.net/command-line/index/ Introduction to the command-line] | ||
** | ** [http://www.oliverelliott.org/article/computing/tut_unix/ Introduction to Unix] by Oliver Elliott | ||
** | ** [http://www.ibm.com/developerworks/aix/library/au-unixtext/index.html Introduction to text manipulation on UNIX-based systems] by Brad Yoes (IBM) | ||
** | ** [https://github.com/jlevy/the-art-of-command-line The Art of the Command Line] by Joshua Levy | ||
** | ** [http://www.tldp.org/LDP/abs/html/ Advanced Bash-Scripting Guide] by Mendel Cooper | ||
** | ** [http://www.shellcheck.net/ ShellCheck] ([http://explainshell.com/ explainshell]) | ||
** [http://quinlanlab.org/tutorials/cshl2013/bedtools.html tutorial for bedtools] | |||
** [http://www.commentcamarche.net/faq/8386-kit-de-survie-linux kit de survie Linux] (en français) | |||
Line 30: | Line 38: | ||
* '''Use absolute values:''' | * '''Use absolute values:''' | ||
$ for i in {-5..5}; do echo $i; done | awk 'function abs(x){return (((x < 0.0) ? -x : x) + 0.0)} {print abs($1)}' | $ for i in {-5..5}; do echo $i; done | awk 'function abs(x){return (((x < 0.0) ? -x : x) + 0.0)} {print abs($1)}' | ||
* '''Summarize numbers:''' with R | |||
$ for i in {1..10}; do echo $i; done | Rscript -e 'summary(read.table("stdin"))' | |||
Line 57: | Line 69: | ||
for(i=1;i<=length(a[2]);i++)printf "}"; printf"\n"}' probes.fa > probes.fq | for(i=1;i<=length(a[2]);i++)printf "}"; printf"\n"}' probes.fa > probes.fq | ||
</nowiki> | </nowiki> | ||
* '''Extract sequence from fasta''': | |||
$ echo -e ">chr1\nAAA\n>chr2\nTTT\n>chr3\nGGG\n" | awk 'BEGIN{RS=">"} /chr2/ {print $0}' | |||
Line 63: | Line 80: | ||
$ echo -e "x\ty"; for i in {1..10}; do echo -e $i"\t"$RANDOM; done | (read -r; printf "%s\n" "$REPLY"; sort -k2,2n) | $ echo -e "x\ty"; for i in {1..10}; do echo -e $i"\t"$RANDOM; done | (read -r; printf "%s\n" "$REPLY"; sort -k2,2n) | ||
* '''Get rows from a big file which are also in a small file''': example of using awk with 2 input files by loading the important information from the small file into an array in memory, then parsing the big file line by line and comparing each with the content of the array | |||
$ echo -e "gene\tsnp\tpvalue\ngene1\tsnp1\t0.002\ngene2\tsnp2\t0.8\ngene2\tsnp3\t0.1" > file_all.txt | |||
$ echo -e "gene1\tsnp1" > file_subset.txt | |||
$ awk 'NR==FNR{a[$1$2]++;next;}{x=$1$2;if(x in a)print $0}' file_subset.txt <(sed 1d file_all.txt) | |||
* '''Get length of each sequence in a fasta file''': | |||
$ awk 'BEGIN{RS=">"} {split($0,a,"\n"); if(length(a)==0) next; seqlen=0; for(i=2;i<=length(a);++i){seqlen += length(a[i])}; printf a[1]"\t"seqlen"\n"}' sequences.fa | |||
* '''Get the bases 6 to 9 of each sequence in a fastq file''': provided that each read only uses 4 lines | |||
$ zcat reads.fq.gz | awk '(NR % 4 == 2)' | cut -c 6-9 | |||
* '''Reverse-complement a DNA sequence''': | |||
$ echo "AAATGAGCC" | rev | tr ATGC TACG | |||
* '''Identify a non-breaking space''': | |||
** download additional file 4 (Table S3) of [http://dx.doi.org/10.1186/s12870-016-0754-z this] article | |||
** open it with LibreOffice Calc | |||
** save it as "Text CSV" with "Character set = Unicode (UTF-8)", "Field delimiter = {Tab}", "Text delimiter = " (i.e. empty), and keep "Save cell content as shown" as checked | |||
** play with the following commands (and see the [http://ascii-code.com/ ASCII code]): | |||
$ cat 12870_2016_754_MOESM4_ESM.csv | sed -n 1123p | cut -f2 | od -An -c -b | |||
L i s z t e s 302 240 f e h e r \n | |||
114 151 163 172 164 145 163 302 240 146 145 150 145 162 012 | |||
$ cat 12870_2016_754_MOESM4_ESM.csv | sed -n 1123p | cut -f2 | sed 's/\xC2\xA0/ /g' | od -An -c -b | |||
L i s z t e s f e h e r \n | |||
114 151 163 172 164 145 163 040 146 145 150 145 162 012 | |||
* '''Extract substring based on regex''': uses regex groups, specific of GNU awk | |||
$ echo "project_all-lanes/H3NHKBBXX_7/demultiplex/H3NHKBBXX_7_A3-30-10-10_R1.fastq.gz" | awk '{match($0, /([a-zA-Z0-9-]*)_(R[12])/, a); print a[1]}' | |||
<!-- ##### DO NOT edit below this line unless you know what you are doing. ##### --> | <!-- ##### DO NOT edit below this line unless you know what you are doing. ##### --> |
Latest revision as of 05:58, 12 April 2018
Project name | Main project page Previous entry Next entry |
About one-liners in data wrangling
for i in {1..10}; do echo $i; done | sed 3,6d
$ for i in {1..20}; do echo $i; done | sed -n 3,5p
$ for i in {-5..5}; do echo $i; done | awk 'function abs(x){return (((x < 0.0) ? -x : x) + 0.0)} {print abs($1)}'
$ for i in {1..10}; do echo $i; done | Rscript -e 'summary(read.table("stdin"))'
$ echo -e "gene\tsnp\tpvalue\ng1\ts1\t0.3\ng1\ts2\t0.002\ng2\ts2\t0.7\ng2\ts3\t0.05" > dat.txt gene snp pvalue g1 s1 0.3 g1 s2 0.002 g2 s2 0.7 g2 s3 0.05 $ cat dat.txt | sed 1d | sort -k1,1 -k3,3 | awk '{print $3"\t"$2"\t"$1}' | uniq -f2 g1 s2 0.002 g2 s3 0.05
$ subgroups=("s1" "s2" "s3" "s4"); for i in {0..2}; do let a=$i+1; for j in $(seq $a 3); do s1=${subgroups[$i]}; s2=${subgroups[$j]}; echo $s1 $s2; done; done
$ awk 'BEGIN{RS=">"} {if(NF==0)next; split($0,a,"\n"); printf "@"a[1]"\n"a[2]"\n+\n"; \ for(i=1;i<=length(a[2]);i++)printf "}"; printf"\n"}' probes.fa > probes.fq
$ echo -e ">chr1\nAAA\n>chr2\nTTT\n>chr3\nGGG\n" | awk 'BEGIN{RS=">"} /chr2/ {print $0}'
$ echo -e "x\ty"; for i in {1..10}; do echo -e $i"\t"$RANDOM; done | (read -r; printf "%s\n" "$REPLY"; sort -k2,2n)
$ echo -e "gene\tsnp\tpvalue\ngene1\tsnp1\t0.002\ngene2\tsnp2\t0.8\ngene2\tsnp3\t0.1" > file_all.txt $ echo -e "gene1\tsnp1" > file_subset.txt $ awk 'NR==FNR{a[$1$2]++;next;}{x=$1$2;if(x in a)print $0}' file_subset.txt <(sed 1d file_all.txt)
$ awk 'BEGIN{RS=">"} {split($0,a,"\n"); if(length(a)==0) next; seqlen=0; for(i=2;i<=length(a);++i){seqlen += length(a[i])}; printf a[1]"\t"seqlen"\n"}' sequences.fa
$ zcat reads.fq.gz | awk '(NR % 4 == 2)' | cut -c 6-9
$ echo "AAATGAGCC" | rev | tr ATGC TACG
$ cat 12870_2016_754_MOESM4_ESM.csv | sed -n 1123p | cut -f2 | od -An -c -b L i s z t e s 302 240 f e h e r \n 114 151 163 172 164 145 163 302 240 146 145 150 145 162 012 $ cat 12870_2016_754_MOESM4_ESM.csv | sed -n 1123p | cut -f2 | sed 's/\xC2\xA0/ /g' | od -An -c -b L i s z t e s f e h e r \n 114 151 163 172 164 145 163 040 146 145 150 145 162 012
$ echo "project_all-lanes/H3NHKBBXX_7/demultiplex/H3NHKBBXX_7_A3-30-10-10_R1.fastq.gz" | awk '{match($0, /([a-zA-Z0-9-]*)_(R[12])/, a); print a[1]}' |