Wilke:Information theory

From OpenWetWare

Jump to: navigation, search
THE WILKE LAB

Home        Contact        People        Research        Publications        Materials

Concepts from information theory, such as entropy and mutual information, can provide useful insight in bioinformatics applications. One downside of these quantities is that they tend to be difficult to estimate without bias. For example, the naive estimator for entropy,

H = − (ni / N)ln(ni / N),
i

where ni are the counts of observations of type i and N is the total number of observations, tends to underestimate the true entropy for small samples.

However, estimators of entropy are by now relatively well understood, and we have a number of different estimators to choose from. The following python code implements four entropy estimators discussed in Grassberger 2003, Entropy Estimates from Insufficient Samplings, arXiv:physics/0307138.


#!/usr/bin/python

import math # for log function
from scipy.special.basic import * # for polygamma function, needed only by Entropy_H_psi(z).

EulerGamma = 0.57721566490153286060 # Euler Gamma constant

def Gi( n ):
	'''Helper function needed for entropy estimation.
Defined by Grassberger 2003. http://arxiv.org/abs/physics/0307138
'''
	if n == 0:
		return 0
	if n == 1:
		return -EulerGamma-math.log(2)
	if n == 2:
		return 2-EulerGamma-math.log(2)
	if (n % 2) == 1:
		return Gi( n-1 )
	return Gi( n-2 ) + 2./(n-1)



def Entropy_H_naive( z ):
	'''Naive entropy estimator. Generally biased. Should be avoided.
	
z is a list of integer numbers, e.g., z=[1,4,3,2,2,6].
It counts the number of times a particular observation has
been made.
'''
	N = sum(z) # total number of observations
	return math.log( N ) - (1./N) * sum([n*math.log(n) for n in z])


def Entropy_H_Miller( z ):
	'''Miller entropy estimator. Simplest correction to the
naive estimator.
	
z is a list of integer numbers, e.g., z=[1,4,3,2,2,6]. It counts
the number of times a particular observation has been made.
'''
	N = sum(z) # total number of observations
	return math.log( N ) - (1./N) * sum([n*(math.log(n)-0.5/n) for n in z])


def Entropy_H_psi( z ):
	'''Entropy estimator according to Grassberger 1988.
Better then Miller estimator. Based on the digamma function psi.
	
z is a list of integer numbers, e.g., z=[1,4,3,2,2,6].
It counts the number of times a particular observation has
been made.
'''
	N = sum(z) # total number of observations
	return math.log( N ) - (1./N) * \
		sum([n*(polygamma(0,n)+(-1.)**n/(n*(n+1))) for n in z])


def Entropy_H_G( z ):
	'''Best entropy estimator according to Grassberger 2003,
http://arxiv.org/abs/physics/0307138
	
z is a list of integer numbers, e.g., z=[1,4,3,2,2,6]. It counts
the number of times a particular observation has been made.
'''
	N = sum(z) # total number of observations
	return math.log( N ) - (1./N) * sum([n*Gi(n) for n in z])


# ***********************************************
#         code to test the four functions
# ***********************************************
z = [1,1,1,1,1,1,1,2,2,2,3,3,5,6,7,8,10,24,147]
print "z =", z
print "Entropy estimates:", Entropy_H_naive( z ), Entropy_H_Miller( z ),\
	Entropy_H_psi( z ), Entropy_H_G( z )

Example output:

z = [1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 5, 6, 7, 8, 10, 24, 147]
Entropy estimates: 1.47054411571 1.51257951394 1.52893481177 1.5321868862
Personal tools