Matching multiple substrings of a string to arrays and printing those that match

rarenas has asked for the wisdom of the Perl Monks concerning the following question:

Good evening wise monks,

I wrote this perl script to help filter out the raw data from a Pubmed article reader (called ppaxe, by Sergio Castillo). Basically, ppaxe reads for me thousands of articles on Pubmed and searches for possible interactions between proteins/genes. I end up with verbs that do not actually indicate an interaction or lines with multiple verbs, of which some of those verbs do and others do not.

My perl script basically needs to filter out any line that does not have a verb that indicates an interaction. I have a file of approved verbs, a file of discarded verbs and my ppaxe results file. I put my verb lists into arrays and used index instead of exists function for matching. I am not allowed to use regex so that the next generation that takes over can understand the program better.

When I run my perl program it just ends up printing the whole data file without actually filtering. Can anyone help me in correcting my program and teaching me what I am doing wrong?

Thanks so much,

#!/usr/bin/perl
# discard_lines_by_verbs.pl
use strict;
use warnings;

die "Please use suitable files" if (@ARGV != 3);
my $dis_verbs = shift @ARGV;
my $apr_verbs = shift @ARGV;
my $ppaxe = shift @ARGV;

open(my $in1, "<", "$dis_verbs")
  or die "error reading $dis_verbs. $!";
open(my $in2, "<", "$apr_verbs")
  or die "error reading $apr_verbs. $!";
open(my $in3, "<", "$ppaxe")
  or die "error reading $ppaxe. $!";

my @dis_dic;
my @apr_dic;

while (my $f1_line = <$in1>) {
  chomp($f1_line);
  @dis_dic = $f1_line;
}

while (my $f2_line = <$in2>) {
  chomp($f2_line);
  @apr_dic = $f2_line;
}

while (my $f3_line = <$in3>) {
  chomp($f3_line);
  if ( index($f3_line, @apr_dic) != -1 ) {
    print "$f3_line\n";
  }
  elsif ( index($f3_line, @apr_dic && @dis_dic) != -1 ) {
    print "$f3_line\n";
  }
  else {
    next;
  }
}
close($in1);
close($in2);
close($in3);
[download]

These files are small test versions:

approved_verbs_test:

ACTIVATES
ADPRIBOSYLATED
ALTERS
ARGINYLATED
ASSOCIATES
BINDS
[download]

discarded_verbs_test:

ARE
ASK
ASSESS
BASED
BECAME
IS
[download]

sample_ppaxe_data:

RPSA    AKT1    18628488    0.634    BINDS,ALTERS
RUNX2    DKK1    22960397    0.746    ADPRIBOSYLATED,ALTERS
ARHGAP31    RASA1    17158447    0.56    ASSOCIATES
ARHGAP31    RNASE1    17158447    0.602    BECOME
RASA1    RNASE1    17158447    0.554    BASED
NOS1    NOS3    19799911    0.628    ARGINYLATED,BASED
VTN    PRAP1    27189837    0.582    IS
MAPK8    RHOD    11414711    0.698    ARGINYLATED,BINDS
IL2    SETBP1    8398987    0.556    BINDS
S100A8    S100A9    20105291    0.596    ASSESS
[download]

Desired outcome:

RPSA    AKT1    18628488    0.634    BINDS,ALTERS
RUNX2    DKK1    22960397    0.746    ADPRIBOSYLATED,ALTERS
ARHGAP31    RASA1    17158447    0.56    ASSOCIATES
NOS1    NOS3    19799911    0.628    ARGINYLATED,BASED
MAPK8    RHOD    11414711    0.698    ARGINYLATED,BINDS
IL2    SETBP1    8398987    0.556    BINDS
[download]

Comment on Matching multiple substrings of a string to arrays and printing those that match Select or Download Code

Replies are listed 'Best First'.
Re: Matching multiple substrings of a string to arrays and printing those that match by choroba (Cardinal) on Apr 03, 2018 at 19:48 UTC
`@dis_dic = $f1_line;` [download] This assigns the $f1_line to the first element of the array, and deletes anything else. You probably meant push instead. `index($f3_line, @apr_dic && @dis_dic)` [download] index's second argument is a substring, not an array, and definitely not a boolean AND of the sizes of two arrays (`@arr1 && @arr2` returns 0 if @arr1 is empty, and size of @arr2 otherwise). Update: You might need to remove leading, trailing, or repeated commas from the last column after processing by the following code: Read more... (1127 Bytes) ($q=q:Sq=~/;[c](.)(.)/;chr(-\|\|-\|5+lengthSq)`"S\|oS2"`map{chr \|+ord }map{substrSq`S_+\|`\|}3E\|-\|`7**2-3:)=~y+S\|`+$1,++print+eval$q,q,a, [download]	[reply] [d/l] [select]
Re: Matching multiple substrings of a string to arrays and printing those that match by toolic (Bishop) on Apr 03, 2018 at 19:46 UTC
Your dis and apr arrays only have one element each in them, corresponding to the last line of each file. You should push into each array. For example, change: `@dis_dic = $f1_line;` [download] to: `push @dis_dic, $f1_line;` [download] Also, your use of index looks strange. Re-read the docs. You could use grep there. But, I think using hashes is better than arrays in this case.	[reply] [d/l] [select]
Re^2: Matching multiple substrings of a string to arrays and printing those that match by mr_ron (Chaplain) on Apr 03, 2018 at 21:24 UTC
For reading the arrays of verbs one might also do something like: `my @dis_dic; open(my $in1, "<", "$dis_verbs") or die "error reading $dis_verbs. $!"; chomp(@dis_dic = <$in1>); close($in1);` [download] Besides needing `grep` or `List::Util qw(any)` to test `index` for an array of substrings, I am confused by the logic/specification of: `elsif ( index($f3_line, @apr_dic && @dis_dic) != -1 ) {` [download] I am not sure whether the intent is to test for both approved and discarded verbs or either one. There is no point testing for the 'both' case since we already printed for approved and similarly if looking for either I think we just need to grep/any discarded. Do you need discarded verbs removed from the matching lines as in choroba's post? Ron	[reply] [d/l] [select]


Don't ask to ask, just ask
	PerlMonks