Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Homologene BioPerl

by ZWcarp (Beadle)
on Dec 02, 2011 at 16:29 UTC ( #941366=perlquestion: print w/ replies, xml ) Need Help??
ZWcarp has asked for the wisdom of the Perl Monks concerning the following question:

I have been trying to take a list of human accession numbers for genes mutated in cancer tumor samples and see via HomoloGene if they have homologues in Drosophila Melanogaster, because I have a collaborator with a genetic screen assay set up in this species. The batch submission for Entrez doesn't seem to be working for me. Someone suggested that Bioperl might have a module that could be used to do something like this, and that it would be easier then dealing with Entrez batch queries. Does anyone have any ideas what module it would be or how to use it for this... I have not been able to find one. Thankyou for your time.

Comment on Homologene BioPerl
Re: Homologene BioPerl
by Marshall (Prior) on Dec 02, 2011 at 17:23 UTC
Re: Homologene BioPerl
by erix (Vicar) on Dec 02, 2011 at 18:30 UTC

    There is a bioperl module that knows how to talk to NCBI's E-Utilities: see http://www.bioperl.org/wiki/HOWTO:EUtilities_Cookbook (it mentions homologene - I suppose it works, but I haven't tried it). You can also use the EUtilities directly. Both approaches have a slight learning curve.

    Another, third approach is to download homologene into a local database. The NCBI E-Utilities work well, but working with homologene, I find it handier (and faster) to have all data locally, and use the file provided by NCBI in:

    ftp://ftp.ncbi.nih.gov/pub/HomoloGene/current/
    The file 'homologene.data' there, when stored in a database, looks like this (just showing 10 random rows):
    homologene_group_id | tax_id | geneid | symbol | protein_gi + | protein_accession ---------------------+--------+---------+-----------------+----------- +-+------------------- 3 | 9606 | 34 | ACADM | 187960098 + | NP_001120800.1 3 | 9598 | 469356 | ACADM | 114557331 + | XP_524741.2 3 | 9615 | 490207 | ACADM | 73960161 + | XP_547328.2 3 | 9913 | 505968 | ACADM | 115497690 + | NP_001068703.1 3 | 10090 | 11364 | Acadm | 6680618 + | NP_031408.1 3 | 10116 | 24158 | Acadm | 8392833 + | NP_058682.1 3 | 7955 | 406283 | acadm | 47085823 + | NP_998254.1 3 | 7227 | 38864 | CG12262 | 24660351 + | NP_648149.1 3 | 7165 | 1276346 | AgaP_AGAP005662 | 58387602 + | XP_315683.2 3 | 6239 | 181757 | acdh-10 | 17569725 + | NP_510788.1 (

    What you want is to look up your human gene or accession (human: tax_id=9606), take the group_id, and see if there is a Drosophila melanogaster (fly: tax_id=7227) record within the same group id.

    In case you have basic database skills, here is a way to load that file into a postgresql database:

    #!/bin/sh wget ftp://ftp.ncbi.nih.gov/pub/HomoloGene/current/homologene.data; < homologene.data psql -c " drop table if exists my_homologene_data; create table my_homologene_data ( homologene_group_id integer , tax_id integer , geneid integer , symbol text , protein_gi integer , protein_accession text ); copy my_homologene_data from stdin csv delimiter E'\t'; "; echo "select count(*) from my_homologene_data" | psql;

    The records that have the same group id are homologs.

    select * from my_homologene_data where homologene_group_id = 31015

    With that group id, you can easily construct links into specific NCBI homologene pages too:

    http://www.ncbi.nlm.nih.gov/homologene/?term=31015

    hth

    P.S. Re zoological nomenclature: in the binomial name Drosophila melanogaster, 'melanogaster' is the epitheton and must *always* be lower case; only genus names must be capitalised.

      I can't thank you enough for your help.

      So I tried to do what you are saying by downloading the database. I got that far and I have the homologene.data file

      [zwc2101|login2] ~/TALL/ref/HomoloGene $ head homologene.data 3 9606 34 ACADM 187960098 NP_001120800.1 3 9598 469356 ACADM 114557331 XP_524741.2 3 9615 490207 ACADM 73960161 XP_547328.2 3 9913 505968 ACADM 115497690 NP_001068703.1 3 10090 11364 Acadm 6680618 NP_031408.1 3 10116 24158 Acadm 8392833 NP_058682.1 3 7955 406283 acadm 47085823 NP_998254.1 3 7227 38864 CG12262 24660351 NP_648149.1 3 7165 1276346 AgaP_AGAP005662 58387602 XP_315683.2 3 6239 181757 acdh-10 17569725 NP_510788.1

      I must say though that my "database" skills are not existent, I am good however at parsing and basic unix/perl/ matching etc... So from what you are saying ... the group id number will be the same for each gene (including its homologies should they be named differently) and that I just need to match the everything that has the group id for each human accession number, and then see if any of the lines match any of the Drosophila tax IDs? Thanks again for your help you have helped me tremendously!

      Heres my code so far for this
      #!/usr/bin/perl -w use strict; open (FILEHANDLE,"$ARGV[0]") || die("Could not open OnlyNormal file"); my @homo = <FILEHANDLE> ; close (FILEHANDLE); open (FILEHANDLE, "$ARGV[1]") || die("Could not open input file"); my @file = <FILEHANDLE> ; close (FILEHANDLE); foreach my $line (@homo) { chomp $line; (my $GroupID)=split(/\t/,$line); foreach my $Gene(@file) { chomp $Gene; if ($line =~m/\t$Gene\t/ && $line =~m/9606/) #Human Gene + name and Human Taxid { foreach my $LINE(@homo) { (my @drosophila)=split(/\t/,$LINE); if ($LINE =~m/^$GroupID\t/ && $line =~m/722 +7/) # Gene group ID and Drosophila Tax ID { print $Gene . "\t". $drosophila[3] . +"\n"; # [3] is the Drosophila Gene name at the group ID determined ab +ove } } } } }

        See also the NCBI explanatory files in:

        ftp://ftp.ncbi.nih.gov/pub/HomoloGene

        Especially the README file, which says:

        ---------------------------------------------------------- homologene.data is a tab delimited file containing the following columns: 1) HID (HomoloGene group id) 2) Taxonomy ID 3) Gene ID 4) Gene Symbol 5) Protein gi 6) Protein accession -----------------------------------------------------------

        So yes, you search for your human accession in column 6, then look what value column 1 has (the homologene group id), and then look up whether there is a row which has both taxonomy_id=7227 (=D.melanogaster) *AND* that homologene group id.

        Btw (if you want more data), the 'Gene ID' can be handy too as it gives you access to the whole of entrez, and lets you construct URL's into the main gene page, etc, etc. More data 'addressable' via 'gene id' in the files in:

        ftp://ftp.ncbi.nih.gov/gene/DATA

        (esp. gene_info and gene2accession)

        (btw, I do /not/ see any Homologene records for your NP_001124398, so maybe your bioperl script does work after all, if you give it a human accession with known data in homologene)

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://941366]
Approved by wfsp
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (6)
As of 2014-12-22 03:37 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (110 votes), past polls