Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Matching multiple substrings of a string to arrays and printing those that match

by rarenas (Acolyte)
on Apr 03, 2018 at 19:32 UTC ( [id://1212284]=perlquestion: print w/replies, xml ) Need Help??

rarenas has asked for the wisdom of the Perl Monks concerning the following question:

Good evening wise monks,

I wrote this perl script to help filter out the raw data from a Pubmed article reader (called ppaxe, by Sergio Castillo). Basically, ppaxe reads for me thousands of articles on Pubmed and searches for possible interactions between proteins/genes. I end up with verbs that do not actually indicate an interaction or lines with multiple verbs, of which some of those verbs do and others do not.

My perl script basically needs to filter out any line that does not have a verb that indicates an interaction. I have a file of approved verbs, a file of discarded verbs and my ppaxe results file. I put my verb lists into arrays and used index instead of exists function for matching. I am not allowed to use regex so that the next generation that takes over can understand the program better.

When I run my perl program it just ends up printing the whole data file without actually filtering. Can anyone help me in correcting my program and teaching me what I am doing wrong?

Thanks so much,

#!/usr/bin/perl # discard_lines_by_verbs.pl use strict; use warnings; die "Please use suitable files" if (@ARGV != 3); my $dis_verbs = shift @ARGV; my $apr_verbs = shift @ARGV; my $ppaxe = shift @ARGV; open(my $in1, "<", "$dis_verbs") or die "error reading $dis_verbs. $!"; open(my $in2, "<", "$apr_verbs") or die "error reading $apr_verbs. $!"; open(my $in3, "<", "$ppaxe") or die "error reading $ppaxe. $!"; my @dis_dic; my @apr_dic; while (my $f1_line = <$in1>) { chomp($f1_line); @dis_dic = $f1_line; } while (my $f2_line = <$in2>) { chomp($f2_line); @apr_dic = $f2_line; } while (my $f3_line = <$in3>) { chomp($f3_line); if ( index($f3_line, @apr_dic) != -1 ) { print "$f3_line\n"; } elsif ( index($f3_line, @apr_dic && @dis_dic) != -1 ) { print "$f3_line\n"; } else { next; } } close($in1); close($in2); close($in3);

These files are small test versions:

approved_verbs_test:

ACTIVATES ADPRIBOSYLATED ALTERS ARGINYLATED ASSOCIATES BINDS

discarded_verbs_test:

ARE ASK ASSESS BASED BECAME IS

sample_ppaxe_data:

RPSA AKT1 18628488 0.634 BINDS,ALTERS RUNX2 DKK1 22960397 0.746 ADPRIBOSYLATED,ALTERS ARHGAP31 RASA1 17158447 0.56 ASSOCIATES ARHGAP31 RNASE1 17158447 0.602 BECOME RASA1 RNASE1 17158447 0.554 BASED NOS1 NOS3 19799911 0.628 ARGINYLATED,BASED VTN PRAP1 27189837 0.582 IS MAPK8 RHOD 11414711 0.698 ARGINYLATED,BINDS IL2 SETBP1 8398987 0.556 BINDS S100A8 S100A9 20105291 0.596 ASSESS

Desired outcome:

RPSA AKT1 18628488 0.634 BINDS,ALTERS RUNX2 DKK1 22960397 0.746 ADPRIBOSYLATED,ALTERS ARHGAP31 RASA1 17158447 0.56 ASSOCIATES NOS1 NOS3 19799911 0.628 ARGINYLATED,BASED MAPK8 RHOD 11414711 0.698 ARGINYLATED,BINDS IL2 SETBP1 8398987 0.556 BINDS

Replies are listed 'Best First'.
Re: Matching multiple substrings of a string to arrays and printing those that match
by choroba (Cardinal) on Apr 03, 2018 at 19:48 UTC
    @dis_dic = $f1_line;
    This assigns the $f1_line to the first element of the array, and deletes anything else. You probably meant push instead.

    index($f3_line, @apr_dic && @dis_dic)
    index's second argument is a substring, not an array, and definitely not a boolean AND of the sizes of two arrays (@arr1 && @arr2 returns 0 if @arr1 is empty, and size of @arr2 otherwise).

    Update: You might need to remove leading, trailing, or repeated commas from the last column after processing by the following code:

    ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
Re: Matching multiple substrings of a string to arrays and printing those that match
by toolic (Bishop) on Apr 03, 2018 at 19:46 UTC
    Your dis and apr arrays only have one element each in them, corresponding to the last line of each file. You should push into each array. For example, change:
    @dis_dic = $f1_line;

    to:

    push @dis_dic, $f1_line;

    Also, your use of index looks strange. Re-read the docs. You could use grep there. But, I think using hashes is better than arrays in this case.

      For reading the arrays of verbs one might also do something like:

      my @dis_dic; open(my $in1, "<", "$dis_verbs") or die "error reading $dis_verbs. $!"; chomp(@dis_dic = <$in1>); close($in1);

      Besides needing grep or List::Util qw(any) to test index for an array of substrings, I am confused by the logic/specification of:

      elsif ( index($f3_line, @apr_dic && @dis_dic) != -1 ) {

      I am not sure whether the intent is to test for both approved and discarded verbs or either one. There is no point testing for the 'both' case since we already printed for approved and similarly if looking for either I think we just need to grep/any discarded. Do you need discarded verbs removed from the matching lines as in choroba's post?

      Ron

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1212284]
Approved by toolic
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others meditating upon the Monastery: (6)
As of 2024-04-24 23:18 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found