User:Lindenb/Notebook/UMR915/20100901
From OpenWetWare
PP interactions
Map UCSC knownGene to EMBL strings
java program scanning XML of http://string.embl.de/ might be too slow. trying using flat files. First download human data:
curl -s "http://string.embl.de:8080/newstring_download/protein.links.v8.3.txt.gz" | gunzip -c | egrep '^9606\.' | egrep ' 9606\.' > ~/jeter.links
how many interactions ?
wc jeter.links 2577772 7733316 118577512 jeter.links
how many proteins ?
tr " " "\n" < jeter.links | egrep '^9606\.' |sort | uniq | wc -l 17369
does STRING only use ensembl identifiers ? YES:
tr " " "\n" < jeter.links | egrep '^9606\.' | grep -v ENSP (nothing)
cleanup 'links', remove '9606.'
sed 's/9606\.E/E/g' jeter.links | tr " " " " > A; mv A jeter.links
download mapping ENSP - knownGene:
mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -P 3306 -D hg18 -e 'select distinct K.name,E.protein from knownToEnsembl as K, ensGtp as E where E.transcript=K.value and E.protein like "ENSP%" ' > jeter.ensembl wc -l jeter.ensembl 59310
sort this mapping
sort -t ' ' -k 2,2 jeter.ensembl | grep ENSP > a ; mv a jeter.ensembl
sort links on first column
sort -t ' ' -k1,1 jeter.links > a; mv a jeter.links
join links with ensembl for first column, sort result
join -t ' ' -1 1 -2 2 jeter.links jeter.ensembl | cut -d ' ' -f 2- | sort -t ' ' -k 1,1 > a mv a jeter.links
head jeter.links ENSP00000000233 151 uc002lfr.1 ENSP00000000233 151 uc003frn.1 ENSP00000000233 154 uc001ndt.1 ENSP00000000233 154 uc001ndu.1 ENSP00000000233 155 uc001rzo.1 ENSP00000000233 155 uc001rzp.1 ENSP00000000233 157 uc002qtg.1 ENSP00000000233 160 uc003fri.1 ENSP00000000233 160 uc003frj.1 ENSP00000000233 162 uc001nnb.1
join for second protein:
join -t ' ' -1 1 -2 2 jeter.links jeter.ensembl | cut -d ' ' -f 2- > jeter.strings.txt wc jeter.strings.txt 7896492 23689476 205308792 jeter.strings.txt
result:
head jeter.strings.txt 151 uc002lfr.1 uc003vmb.1 151 uc002lfr.1 uc010llb.1 151 uc003frn.1 uc003vmb.1 151 uc003frn.1 uc010llb.1 154 uc001ndt.1 uc003vmb.1 154 uc001ndt.1 uc010llb.1 154 uc001ndu.1 uc003vmb.1 154 uc001ndu.1 uc010llb.1 155 uc001rzo.1 uc003vmb.1 155 uc001rzo.1 uc010llb.1
put in mysql
CREATE TABLE `kg2kg` ( `confidence` smallint(6) NOT NULL, `kg1` varchar(11) NOT NULL, `kg2` varchar(11) NOT NULL, KEY `kg1` (`kg1`), KEY `kg2` (`kg2`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8 ;
then
mysql (...) -e 'load data local infile "jeter.strings.txt" into table kg2kg'
Hum may be too large... , another option
Load missing tables from UCSC
load http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/ensGtp.txt.gz and http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/knownToEnsembl.txt.gz from UCSC
mysql -D hg18 < ensGtp.sql mysql -D hg18 < knownToEnsembl.sql mysql -D hg18 -e 'load data local infile "ensGtp.txt" into table ensGtp' mysql -D hg18 -e 'load data local infile "knownToEnsembl.txt" into table knownToEnsembl'
Download Links
curl -s http://string.embl.de:8080/newstring_download/protein.links.v8.3.txt.gz | gunzip -c |\ egrep '^9606\.' | egrep ' 9606\.' | sed 's/9606\.//g' | tr " " " " > jeter.links
CREATE TABLE `ensp2ensp` ( `protein1` varchar(16) NOT NULL, `protein2` varchar(16) NOT NULL, `confidence` smallint(6) NOT NULL, KEY `protein1` (`protein1`), KEY `protein2` (`protein2`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8
load
mysql -u root -D emblstring -e 'load data local infile "jeter.links" into table ensp2ensp'
Fact knownGene to ENSEMBL is a 1:1 relation
select K.name,count(name) as C from knownToEnsembl as K, ensGtp as E where E.transcript=K.value group by K.name having C!=1; Empty set (0.78 sec)