BioGeek has asked for the wisdom of the Perl Monks concerning the following question:

Hey, I'm in the process of writing my first bioinformatics script (see BioGeek's scratchpad).
(For those interested in the biological background of what I'm trying to accomplish:

For only very few genetic disorders the exact molecular cause is known. There are however a big number of inheritable diseases where researchers have been able to pinpoint -via various techniques- some stretch on one or more chromosomes where suspect to find the gene(s) responsible for the disease. Each of these stretches (called a locus) still contains several hundred genes, and testing each and every gene on it is near impossible because it would be much to time-consuming and expensive. So, several prioritization techniques have been invented to rank the candidate disease genes on each locus. A good prioritization technique would give as a result very low values for most of the genes on a locus, and high values for the few genes it thinks are involved in the disease pathway. A very strong indication that a gene is responsible for a disease would be when several independent prioritization techniques each give the gene a high score.
).

It consists of 4 parts:
1) Asks for user input.
#usr/bin/perl -T use strict; use warnings; use diagnostics; use LWP; use DBI; use Data::Dumper; ######################## # Let user make a choice ######################## TOP: print "Typ the number of the disease you want the\n"; print "genes to be calculated and compared of.\n"; print "1\tAlzheimer disease (5, 6)\n"; print "2\tBardet Biedl (syndrome 1, 2, 3, 5)\n"; print "3\tBreast cancer (type 4)\n"; print "4\tdiabetes mellitus, noninsulin-dependent (3)\n"; print "5\tHypertension (essential, susceptibility to, 2)\n"; print "6\tInflammatory bowel disease (7)\n"; print "7\tSystemic lupus erythematosus (susceptibility to, 3)\n"; print "8\tmuscular dystrophy, limb-girdle (autosomal recessive) (type +1D, 1H)\n"; print "9\tparkinsons disease (type 3, 4, autosomal dominant, Lewy body +)\n"; print "10\tprostate cancer (susceptibility to)\n";

2) depending on this input, a number of subroutines are fired up.
######################## # Define for which keywords to go looking in the different databases ######################## my $choice = <STDIN>; chomp($choice); if ($choice == 1) { #get_g2d('Alzheimer disease-5'); #get_g2d('Alzheimer disease 6'); get_pocus('alz'); #get_dgp('alz 10q'); #get_dgp('alz 12p11.23-q13.12'); } elsif ($choice == 2) { #get_g2d('Bardet-Biedl syndrome 1'); #get_g2d('Bardet-Biedl syndrome 2'); #get_g2d('Bardet-Biedl syndrome 3'); #get_g2d('Bardet-Biedl syndrome 5'); get_pocus('bb'); #get_dgp('bb 11q13'); #get_dgp('bb 16q21'); #get_dgp('bb 2q31'); #get_dgp('bb 3p13-p12'); } elsif ($choice == 3) { #get_g2d('Breast cancer, type 4'); get_pocus('bc'); #get_dgp('bc 13q21'); } elsif ($choice == 4) { #get_g2d('Diabetes mellitus, noninsulin-dependent, 3'); get_pocus('niddm'); #get_dgp('niddm 20q12-q13'); } elsif ($choice == 5) { #get_g2d('Hypertension, essential, susceptibility to, 2'); get_pocus('h'); #get_dgp('h 15q'); } elsif ($choice == 6) { #get_g2d('Inflammatory bowel disease 7'); get_pocus('ibd'); #get_dgp('ibd 1p36'); } elsif ($choice == 7) { #get_g2d('Systemic lupus erythematosus, susceptibility to, 3') +; get_pocus('sle'); #get_dgp('sle 4p16-15.2'); } elsif ($choice == 8) { #get_g2d('Muscular dystrophy, limb-girdle, type 1D'); #get_g2d('Muscular dystrophy, limb-girdle, type 2H'); get_pocus('md'); #get_dgp('md 7q'); #get_dgp('md 9q31-q33'); } elsif ($choice == 9) { #get_g2d('Parkinson disease, type 3'); #get_g2d('Parkinson disease 4, autosomal dominant, Lewy body') +; #get_pocus('pd'); #get_dgp('pd 4p15'); } elsif ($choice == 10) { #get_g2d('Prostate cancer, susceptibility to'); get_pocus('pc'); #get_dgp('pc 143000001-160000000'); } else { print "Type a number between 1 and 10.\n"; goto TOP }

3) One of the subroutines opens a locally stored file (available here), parses it, and returns a list with EnsembleID's (an international standard for gene name's) and their associated score's.
#################### # open file with results from POCUS database, parse it, # and retrieve gene-name and score. #################### my @results=(); sub get_pocus { my @disease_name_pocus = @_; print "Getting the genes and rankings from the POCUS analysis +for:\n"; print "$disease_name_pocus[0]\n"; open (POCUS, "/home/jeroen/POCUS/results_100.out") || die "cou +ldn't open the file: $!\n"; while (<POCUS>) { push @results, [ split ]; } open (MARKER, ">marker.list.txt") || die $!; for (my $i = 0; $i < scalar @results; $i++) { if ($results[$i]->[0] eq $disease_name +_pocus[0]) { if ($results[$i]->[1] ne $resul +ts[($i+1)]->[1]){ print MARKER "$results[$i]->[1 +]\t$results[$i]->[4]\n"; } } } } close(MARKER);

4) This information is then used to query the Ensembl database, and retrieve for each gene the start and stop positions on the chromosone, and their strand (can be + or -).
################## # Make connection with Ensembl, and retrieve start, end and rank. ################## my $datasource = "DBI:mysql:homo_sapiens_core_23_34e:ensembldb.ensembl +.org"; my $dbh=(); my $sth=(); # database and statement handles my @rows; # to get results or queries my $row_count; # to see how much we get back my $username = "anonymous"; my $password = ""; my $marker = 'marker_list.txt'; # connect to database: $dbh = DBI->connect($datasource, $username, $password, {RaiseError => +1}); open (LIST, "<$marker") or die ("Couldn't open file $marker: $!.\n"); while (<LIST>){ my ($ensemblID, $score) = $_; # select all sequence entries: $row_count = 0; $sth = $dbh->prepare (qq{ SELECT gene_stable_id.stable_id, gene.seq_region_start, gene.s +eq_region_end, gene.seq_region_strand FROM gene, gene_stable_id WHERE gene_stable_id.gene_id = gene.gene_id AND gene_stable_id.stable_id ="$ensemblID" }); $sth->execute (); # read results: while (@rows = $sth->fetchrow_array () ) { $row_count++; print join ("\t", @rows), "\n"; } unless ( $row_count ) { print "This gene doesn't appear to be placed on the current as +sembly.\n"; } # tidy up: close(LIST); $sth->finish (); $dbh->disconnect (); }
The first three parts work as advertised, but the last part gives me an Can't call method "finish" on an undefined value at get_pocus.pl line 168 error. Any clues as to what I might be doing wrong?

Also, I'd like my final output to be in the format:
$chromosome\t$start\t$end\t$ensemblID\t$score\t$strand\n
chr22 1000 5000 ENSG00000141485 960 +
chr22 2000 6000 ENSG00000145848 200 -

As noted, this is my very first script, so code will probably be hairy. Any remarks on how to streamline the code, or tips on how to improve it are appreciated very much.
Update: moved finish() within the while loop (thanks gellyfish). This resolves the DBI error. The question about the output, and the request for constructive feedback still stand tough.
Update 2:Fixed a typo (s/marker.list.txt/marker_list.txt/).

Replies are listed 'Best First'.
Re: Can't call on undefined value (DBI) + constructive feedback asked.
by graff (Chancellor) on Aug 03, 2004 at 14:55 UTC
    There are a few things about your code as posted that seem odd (i.e. not good):
    • You open MARKER for output inside the "get_pocus" subroutine, but close it outside the subroutine.
    • You connect to the database before going into the  while (<LIST>) loop at the end, but disconnect inside the while loop.
    • You prepare a statement on each iteration of that while loop with a quoted parameter in the "where" condition, whereas you should prepare the statement before the while loop with a "?" placeholder, and just call "execute" on that statement in each iteration, passing the parameter value to the "execute" call.

    Also, it looks like the usage is just a matter of providing a single number between 1 and 10. It would make more sense to have that as a command-line arg:

    my $Usage = <<ENDUSE; Usage: $0 N where N is a number between 1 and 10: 1 : Alzheimer disease 2 : Bardet Biedl 3 : Breast cancer ... ENDUSE die $Usage unless ( @ARGV == 1 and $ARGV[0] =~ /^\d+$/ and $ARGV[0] > 0 and #ARGV[0] < 11 ); my $choice = shift; ...
    This way, it's easier to run the script either interactively or from another process, because the information that the script needs from the user is part of the command line.
Re: Can't call on undefined value (DBI) + constructive feedback asked.
by knoebi (Friar) on Aug 03, 2004 at 14:53 UTC
    For the points one 1 and 2 you could also use a hash:

    my %hash = ( 1 => { desc=>'Alzheimer disease (5, 6)', keyword=>'alz'}, 2 => { desc=>'Something', keyword=>'bb' }, ...);
    And then you can loop over this hash for the print statement and also you don't need this long if elsif switch.

    ciao knoebi

Re: Can't call on undefined value (DBI) + constructive feedback asked.
by dragonchild (Archbishop) on Aug 03, 2004 at 14:58 UTC
    $dbh = DBI->connect($datasource, $username, $password, {RaiseError => +1}) or die $DBI::errstr;

    RaiseError doesn't affect the actual connect() call, just any method called on the $dbh afterwards.

    ------
    We are the carpenters and bricklayers of the Information Age.

    Then there are Damian modules.... *sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon. - flyingmoose

    I shouldn't have to say this, but any code, unless otherwise stated, is untested

      Are you sure? In my experience this code will not call my die statement because RaiseError is on....
      use DBI; eval { # this connect will fail $dbh = DBI->connect('dbi:Oracle:dev', 'user', 'pass', {RaiseError = +> 1, PrintError => 0}) || die "my die"; print "after connect\n"; }; if ($@) { print "$@\n"; }
      The output when running this code is...
      # perl test.pl DBI connect('dev','user',...) failed: ORA-01017: invalid username/pass +word; logon denied (DBD: login failed) at test.pl line 5
      But this code will call the die statement because RaiseError is off...
      use DBI; eval { # this connect will fail $dbh = DBI->connect('dbi:Oracle:dev', 'user', 'pass', {RaiseError = +> 0, PrintError => 0}) || die "my die"; print "after connect\n"; }; if ($@) { print "$@\n"; }
      This output when running this code is...
      # perl test.pl my die at test.pl line 5.
      Update: The DBI Changes file says this
       Changes in DBI 0.91,	10th December 1997
        ...
        DBI->connect(..., { RaiseError=>1 }) now croaks if connect fails.
        ...
      
Re: Can't call on undefined value (DBI) + constructive feedback asked.
by jZed (Prior) on Aug 03, 2004 at 16:43 UTC
    $sth->finish() is *only* needed when you have not fetched all available rows in the statement handle.

    Since your "results.100.out" file and your "marker.list.txt" file appear to be "tab delimited" files, you may want to consider using DBD::CSV which can easily handle files with tabs as field separators and anything as a record separator. That way, you'd be able to use DBI and SQL to search those files, eliminate the parsing parts of your code, gain the benefits of file locking (if relevant), and gain some speed (if relevant).