User:Lindenb/Notebook/UMR915/20100901

From OpenWetWare

Jump to: navigation, search

20100831        Top        20100902       


Contents

PP interactions

Map UCSC knownGene to EMBL strings

java program scanning XML of http://string.embl.de/ might be too slow. trying using flat files. First download human data:

  curl -s "http://string.embl.de:8080/newstring_download/protein.links.v8.3.txt.gz" | gunzip -c | egrep '^9606\.' | egrep ' 9606\.' > ~/jeter.links

how many interactions ?

 wc jeter.links 
 2577772   7733316 118577512 jeter.links

how many proteins ?

  tr " " "\n" < jeter.links |  egrep '^9606\.' |sort | uniq | wc -l
  17369


does STRING only use ensembl identifiers ? YES:

 tr " " "\n" < jeter.links | egrep '^9606\.' | grep -v ENSP
 (nothing)

cleanup 'links', remove '9606.'

  sed 's/9606\.E/E/g' jeter.links | tr " " " " > A; mv A jeter.links


download mapping ENSP - knownGene:

 mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -P 3306 -D hg18 -e 'select distinct K.name,E.protein from knownToEnsembl as K, ensGtp as E where E.transcript=K.value and E.protein like "ENSP%" ' > jeter.ensembl
 wc -l jeter.ensembl 
 59310

sort this mapping

 sort -t '  ' -k 2,2 jeter.ensembl | grep ENSP > a ; mv a jeter.ensembl

sort links on first column

 sort -t '  ' -k1,1 jeter.links > a; mv a jeter.links

join links with ensembl for first column, sort result

 join -t '  ' -1 1 -2 2 jeter.links jeter.ensembl | cut -d '        ' -f 2- | sort -t '     ' -k 1,1 > a
 mv a jeter.links
 head jeter.links 
 ENSP00000000233	151	uc002lfr.1
 ENSP00000000233	151	uc003frn.1
 ENSP00000000233	154	uc001ndt.1
 ENSP00000000233	154	uc001ndu.1
 ENSP00000000233	155	uc001rzo.1
 ENSP00000000233	155	uc001rzp.1
 ENSP00000000233	157	uc002qtg.1
 ENSP00000000233	160	uc003fri.1
 ENSP00000000233	160	uc003frj.1
 ENSP00000000233	162	uc001nnb.1

join for second protein:

 join -t '  ' -1 1 -2 2 jeter.links  jeter.ensembl  | cut -d '      ' -f 2- > jeter.strings.txt
 wc jeter.strings.txt
 7896492  23689476 205308792 jeter.strings.txt

result:

 head jeter.strings.txt 
 151	uc002lfr.1	uc003vmb.1
 151	uc002lfr.1	uc010llb.1
 151	uc003frn.1	uc003vmb.1
 151	uc003frn.1	uc010llb.1
 154	uc001ndt.1	uc003vmb.1
 154	uc001ndt.1	uc010llb.1
 154	uc001ndu.1	uc003vmb.1
 154	uc001ndu.1	uc010llb.1
 155	uc001rzo.1	uc003vmb.1
 155	uc001rzo.1	uc010llb.1

put in mysql

 CREATE TABLE `kg2kg` (
 `confidence` smallint(6) NOT NULL,
 `kg1` varchar(11) NOT NULL,
 `kg2` varchar(11) NOT NULL,
 KEY `kg1` (`kg1`),
 KEY `kg2` (`kg2`)
 ) ENGINE=InnoDB DEFAULT CHARSET=latin1 ;

then

   mysql (...) -e 'load data local infile "jeter.strings.txt" into table kg2kg'

Hum may be too large... , another option

Load missing tables from UCSC

load http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/ensGtp.txt.gz and http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/knownToEnsembl.txt.gz from UCSC

 mysql -D hg18 < ensGtp.sql
 mysql -D hg18 < knownToEnsembl.sql
 mysql -D hg18 -e 'load data local infile "ensGtp.txt" into table ensGtp'
 mysql -D hg18 -e 'load data local infile "knownToEnsembl.txt" into table knownToEnsembl'

Download Links

 curl -s http://string.embl.de:8080/newstring_download/protein.links.v8.3.txt.gz | gunzip  -c  |\
  egrep '^9606\.' | egrep ' 9606\.' | sed 's/9606\.//g' | tr " " "       " > jeter.links


 CREATE TABLE `ensp2ensp` (
 `protein1` varchar(16) NOT NULL,
 `protein2` varchar(16) NOT NULL,
 `confidence` smallint(6) NOT NULL,
 KEY `protein1` (`protein1`),
 KEY `protein2` (`protein2`)
 ) ENGINE=InnoDB DEFAULT CHARSET=latin1

load

 mysql -u root -D emblstring -e 'load data local infile "jeter.links" into table ensp2ensp'

Fact knownGene to ENSEMBL is a 1:1 relation

select  K.name,count(name) as C from knownToEnsembl as K, ensGtp as E where E.transcript=K.value  group by K.name having C!=1;
Empty set (0.78 sec)
Personal tools