User:Timothee Flutre/Notebook/Postdoc/2012/05/25: Difference between revisions
From OpenWetWare
(→One-liners with GNU tools: add tuto cmd-line) |
(→About one-liners in data wrangling: add one-liner R-summary) |
||
(9 intermediate revisions by the same user not shown) | |||
Line 6: | Line 6: | ||
| colspan="2"| | | colspan="2"| | ||
<!-- ##### DO NOT edit above this line unless you know what you are doing. ##### --> | <!-- ##### DO NOT edit above this line unless you know what you are doing. ##### --> | ||
== | ==About one-liners in data wrangling== | ||
* '''Toolbox''': often available by default on many computers | * '''Motivation''': Once we receive raw data and before drawing robust conclusions, we (almost) always need to reformat them as well as extract a few key summary statistics. Hopefully this activity, called [https://en.wikipedia.org/wiki/Data_wrangling data wrangling], is particularly quick and easy on [http://www.gnu.org/gnu/gnu-linux-faq.html GNU/Linux] computers. For instance, using GNU utilities via the [https://en.wikipedia.org/wiki/Command-line_interface command-line interface], we can write a "one-liner", a sequence of tools in which the output of a tool is the input of the next. This is not only easy but also very powerful, as shown below. | ||
* '''Toolbox''': often available by default on many computers with GNU/Linux | |||
** [https://en.wikipedia.org/wiki/Bash_%28Unix_shell%29 Bash] | ** [https://en.wikipedia.org/wiki/Bash_%28Unix_shell%29 Bash] | ||
** [https://en.wikipedia.org/wiki/AWK AWK] | ** [https://en.wikipedia.org/wiki/AWK AWK] | ||
Line 14: | Line 16: | ||
** [https://en.wikipedia.org/wiki/Sed sed] | ** [https://en.wikipedia.org/wiki/Sed sed] | ||
** [https://en.wikipedia.org/wiki/GNU_Core_Utilities GNU coreutils] (head, tail, cut, uniq, sort, tr, ...) | ** [https://en.wikipedia.org/wiki/GNU_Core_Utilities GNU coreutils] (head, tail, cut, uniq, sort, tr, ...) | ||
* '''Tutorials''': | * '''Tutorials''': | ||
** [http://en.flossmanuals.net/command-line/index/ Introduction to the command-line] | ** [http://en.flossmanuals.net/command-line/index/ Introduction to the command-line] | ||
** [http://www.oliverelliott.org/article/computing/tut_unix/ Introduction to Unix] by Oliver Elliott | |||
** [http://www.ibm.com/developerworks/aix/library/au-unixtext/index.html Introduction to text manipulation on UNIX-based systems] by Brad Yoes (IBM) | ** [http://www.ibm.com/developerworks/aix/library/au-unixtext/index.html Introduction to text manipulation on UNIX-based systems] by Brad Yoes (IBM) | ||
** [https://github.com/jlevy/the-art-of-command-line The Art of the Command Line] by Joshua Levy | |||
** [http://www.tldp.org/LDP/abs/html/ Advanced Bash-Scripting Guide] by Mendel Cooper | |||
** [http://www.shellcheck.net/ ShellCheck] ([http://explainshell.com/ explainshell]) | |||
** [http://quinlanlab.org/tutorials/cshl2013/bedtools.html tutorial for bedtools] | |||
** [http://www.commentcamarche.net/faq/8386-kit-de-survie-linux kit de survie Linux] (en français) | |||
Line 31: | Line 38: | ||
* '''Use absolute values:''' | * '''Use absolute values:''' | ||
$ for i in {-5..5}; do echo $i; done | awk 'function abs(x){return (((x < 0.0) ? -x : x) + 0.0)} {print abs($1)}' | $ for i in {-5..5}; do echo $i; done | awk 'function abs(x){return (((x < 0.0) ? -x : x) + 0.0)} {print abs($1)}' | ||
* '''Summarize numbers:''' with R | |||
$ for i in {1..10}; do echo $i; done | Rscript -e 'summary(read.table("stdin"))' | |||
Line 71: | Line 82: | ||
$ awk 'NR==FNR{a[$1$2]++;next;}{x=$1$2;if(x in a)print $0}' file_subset.txt <(sed 1d file_all.txt) | $ awk 'NR==FNR{a[$1$2]++;next;}{x=$1$2;if(x in a)print $0}' file_subset.txt <(sed 1d file_all.txt) | ||
* '''Get length of each sequence in a fasta file''': | |||
$ awk 'BEGIN{RS=">"} {split($0,a,"\n"); if(length(a)==0) next; seqlen=0; for(i=2;i<=length(a);++i){seqlen += length(a[i])}; printf a[1]"\t"seqlen"\n"}' sequences.fa | |||
* '''Get the bases 6 to 9 of each sequence in a fastq file''': provided that each rad only uses 4 lines | |||
$ zcat reads.fq.gz | awk '(NR % 4 == 2)' | cut -c 6-9 | |||
* '''Reverse-complement a DNA sequence''': | |||
$ echo "AAATGAGCC" | rev | tr ATGC TACG | |||
<!-- ##### DO NOT edit below this line unless you know what you are doing. ##### --> | <!-- ##### DO NOT edit below this line unless you know what you are doing. ##### --> |
Revision as of 11:43, 12 October 2015
Project name | <html><img src="/images/9/94/Report.png" border="0" /></html> Main project page <html><img src="/images/c/c3/Resultset_previous.png" border="0" /></html>Previous entry<html> </html>Next entry<html><img src="/images/5/5c/Resultset_next.png" border="0" /></html> |
About one-liners in data wrangling
for i in {1..10}; do echo $i; done | sed 3,6d
$ for i in {1..20}; do echo $i; done | sed -n 3,5p
$ for i in {-5..5}; do echo $i; done | awk 'function abs(x){return (((x < 0.0) ? -x : x) + 0.0)} {print abs($1)}'
$ for i in {1..10}; do echo $i; done | Rscript -e 'summary(read.table("stdin"))'
$ echo -e "gene\tsnp\tpvalue\ng1\ts1\t0.3\ng1\ts2\t0.002\ng2\ts2\t0.7\ng2\ts3\t0.05" > dat.txt gene snp pvalue g1 s1 0.3 g1 s2 0.002 g2 s2 0.7 g2 s3 0.05 $ cat dat.txt | sed 1d | sort -k1,1 -k3,3 | awk '{print $3"\t"$2"\t"$1}' | uniq -f2 g1 s2 0.002 g2 s3 0.05
$ subgroups=("s1" "s2" "s3" "s4"); for i in {0..2}; do let a=$i+1; for j in $(seq $a 3); do s1=${subgroups[$i]}; s2=${subgroups[$j]}; echo $s1 $s2; done; done
$ awk 'BEGIN{RS=">"} {if(NF==0)next; split($0,a,"\n"); printf "@"a[1]"\n"a[2]"\n+\n"; \ for(i=1;i<=length(a[2]);i++)printf "}"; printf"\n"}' probes.fa > probes.fq
$ echo -e "x\ty"; for i in {1..10}; do echo -e $i"\t"$RANDOM; done | (read -r; printf "%s\n" "$REPLY"; sort -k2,2n)
$ echo -e "gene\tsnp\tpvalue\ngene1\tsnp1\t0.002\ngene2\tsnp2\t0.8\ngene2\tsnp3\t0.1" > file_all.txt $ echo -e "gene1\tsnp1" > file_subset.txt $ awk 'NR==FNR{a[$1$2]++;next;}{x=$1$2;if(x in a)print $0}' file_subset.txt <(sed 1d file_all.txt)
$ awk 'BEGIN{RS=">"} {split($0,a,"\n"); if(length(a)==0) next; seqlen=0; for(i=2;i<=length(a);++i){seqlen += length(a[i])}; printf a[1]"\t"seqlen"\n"}' sequences.fa
$ zcat reads.fq.gz | awk '(NR % 4 == 2)' | cut -c 6-9
$ echo "AAATGAGCC" | rev | tr ATGC TACG |