Schumer lab: Useful Unix/perl/awk commands
Very useful resources
Here are links to some very useful resources for quick file manipulation or bioinformatic functions:
1) The FAS scriptome is an invaluable resource for all sorts of file manipulation [[1]]
2) Stephen Turner's collection of one-liners [[2]]
3) Heng Li's collection of one-liners [[3]]
Useful general one-liners
Trim the header off of a file
tail -n +2 filename > newfile
Remove a particular column from your file
In this example, column 10:
cut -f 10 --complement filename > newfile
Use awk to duplicate or modify a column
Duplicate the second column of a file:
awk -v OFS='\t' '$2=$2"\t"$2' myfile > mynewfile
Rewrite the first column of a file as first column _ second column:
awk -v OFS='\t' '$1=$1"_"$2' myfile > mynewfile
Use awk to select rows
Select all rows where a particular column contains a certain word:
awk -F"\t" '$3 == "transcript" { print}' myfile > mynewfile
Select all rows where a particular column does not contain a certain word:
awk -F"\t" '$2 !== "N" { print}' myfile > mynewfile
Use perl to find and replace
Find and replace an exact string match:
perl -pi -e 's/find/replace/g' filename
You can also find and replace with wild cards, etc:
perl -pi -e 's/^[^_]*find/find/g' filename
Split or shuffle your file
Split your file into a set of files containing n lines (in this case 20):
split -d -e -l 20 myfile_name myfile_name_split
Randomly sample n lines from your file:
shuf -n 10000 myfile_name > my_sampled_file
Count unique lines in a file
cat myfile.txt | uniq | wc -l
Sort a file numerically
In this example, sort using the value in the second column:
sort -nk 2 my_file.bed > my_sorted_file.bed
Manipulating bam files
Calculate the average coverage of a bam file
module load biology
module load samtools
samtools depth my_bam_file | awk '{sum+=$3} END { print sum/NR}'
Calculate the number of mapped reads in a bam file
samtools view -c -F 260 my_bam_file