9810 PFams (out of 11K?) have at least one protein in one of the samples from the "Camera Proteins" dataset
However, calculating correlations or ecological distance measurements results in the pfams with very low numbers to appear to have high correlation.
To start to filter out these pfams without many counts I plotted the sum of the pfam counts across all samples (ranging from 1 to 209446 (ABC_trans of course))
Doesn't really give a good clear cutoff for using row sum or diversity index (e.g. sum row > 50 will remove many that have a high diversity index. vice versa for using diversity cutoff).
//R Code
library(vegan)
x<-read.table("camera_proteins_vs_pfam.txt")
shannon_diversity<-diversity(x)
row.sums <- apply(x, 1, sum)
plot(log10(row.sums),shannon_diversity)
//To filter a list in R use the "Filter" function
length(Filter(function(x){x>100},row.sums))
//this tells us that 4018 pfams are still left with a row sum size of greater than 100