User:Morgan G. I. Langille/Notebook/Unknown Genes/2010/09/23

From OpenWetWare
Jump to navigationJump to search
Unknown Genes Main project page
Next entry

Filtering pfam vs metagenomic sample counts

  • 9810 PFams (out of 11K?) have at least one protein in one of the samples from the "Camera Proteins" dataset
  • However, calculating correlations or ecological distance measurements results in the pfams with very low numbers to appear to have high correlation.
  • To start to filter out these pfams without many counts I plotted the sum of the pfam counts across all samples (ranging from 1 to 209446 (ABC_trans of course))
  • Doesn't really give a good clear cutoff for using row sum or diversity index (e.g. sum row > 50 will remove many that have a high diversity index. vice versa for using diversity cutoff).

//R Code

library(vegan)

x<-read.table("camera_proteins_vs_pfam.txt")

shannon_diversity<-diversity(x)

row.sums <- apply(x, 1, sum)

plot(log10(row.sums),shannon_diversity)

//To filter a list in R use the "Filter" function

length(Filter(function(x){x>100},row.sums))

//this tells us that 4018 pfams are still left with a row sum size of greater than 100