Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic

Need Speed:Search Tab-delimited File for pairs of names

by mnnb (Initiate)
on Dec 16, 2013 at 16:37 UTC ( #1067356=perlquestion: print w/replies, xml ) Need Help??
mnnb has asked for the wisdom of the Perl Monks concerning the following question:

I have the following code to search a tab-delimited text file (the first column only) for PAIRS of names. The pairs of names are passed on the command line. There is a match when both names of any pair are found in that 1st column. A match 'CODE' starts a s '' (empty string). If the 1st pair matches it adds an 'A' to the match CODE. If the 2nd pair matches it adds an 'B' to the match CODE. 3rd pair a 'C', 4th pair a 'D', 5th pair an 'E'. So, depending on the pair of names matched, the match code could be any combo of the five letters ABCDE: any 1,2,3,4 or all five letter or none. If a match is found, the match code and the line are sent to the output file.

I simply need a way to make this overall process as fast as possible. Help on any area: string search, concatenation or whatever.

One other important point/question. I must search using string literals, not regexes. Would a language like C, C++ or C# have faster string literal searching functionality as opposed to Perl? If so is there a source of info on how to go about this. Thanks in advance.

# use strict; # use warnings; $start_run = time(); use v5.10; use Win32::OLE; use autodie; # -------------------------------------------------------------------- +--------------------------------------------------------- sub name_search(@_, $search_string) { # say $search_string ; $found_code = '' ; if(((index $search_string, $ARGV[0]) >= 0) && ((index $search_stri +ng, $ARGV[1]) >= 0)) {$found_code = 'A' ;} if($#ARGV > 2) { if(((index $search_string, $ARGV[2]) >= 0) && ((index $search_ +string, $ARGV[3]) >= 0)) {$found_code .= 'B' ;} if($#ARGV > 4) { if(((index $search_string, $ARGV[4]) >= 0) && ((index $sea +rch_string, $ARGV[5]) >= 0)) {$found_code .= 'C' ;} if($#ARGV > 6) { if(((index $search_string, $ARGV[6]) >= 0) && ((index +$search_string, $ARGV[7]) >= 0)) {$found_code .= 'D' ;} if($#ARGV > 8) { if(((index $search_string, $ARGV[8]) >= 0) && ((in +dex $search_string, $ARGV[9]) >= 0)) {$found_code .= 'E' ;} } } } } return $found_code ; } $print_string = "" ; # Create header for output file. $print_string .= "\t\t" . 'A: ' . $ARGV[0] . " " . $ARGV[1] . "\n" + . "\t\t" . 'B: ' . $ARGV[2] . " " . $ARGV[3] . " +\n" . "\t\t" . 'C: ' . $ARGV[4] . " " . $ARGV[5] . " +\n" . "\t\t" . 'D: ' . $ARGV[6] . " " . $ARGV[7] . " +\n" . "\t\t" . 'E: ' . $ARGV[8] . " " . $ARGV[9] . " +\n" . 'CODE' . "\t" . 'NAME' . "\t" . 'RUNNER' . "\t +" . 'INFO1' . "\t" . 'INFO2' . "\t" . 'INFO3' . "\ +t" . 'INFO4' . "\t" . 'INFO5' . "\t" . 'INFO6' . "\ +t" . 'INFO7' . "\t" . 'INFO8' . "\t" . 'INFO9' . "\t" . 'INFO10' . " +\t" . 'INFO11' . "\t" . 'INFO12' . "\n"; #print $print_string; my @line ; $found_tag = ''; #=pod open (my $data, "<", 'SearchTable.txt'); while(<$data>){ chomp ; @line = split( /\t/, $_ ); $search_string = $line[0] ; $found_tag = &name_search(@_, $search_string) ; # say $found_tag u +nless $found_tag eq '' ; if($found_tag ne '') { $print_string .= $found_tag . "\t" . $line[0] . "\t" . $line[1 +] . "\t" . $line[2] . "\t" . $line[3] . "\t" . $line[4] . "\t" . $line[5] . "\t" . $line[6] + . "\t" . $line[7] . "\t" . $line[8] . "\t" . $line[9] . "\t" . $line[10] . "\t" . $line[1 +1] . "\t" . $line[12] . "\t" . $line[13] . "\n" ; } } close($data); open (OUT1, ">nameS_RECORDS.txt") or die; print OUT1 $print_string; close(OUT1) ; $run_time = time() - our $start_run; print "\n\nJob took $run_time seconds\n";

Replies are listed 'Best First'.
Re: Need Speed:Search Tab-delimited File for pairs of names
by kcott (Chancellor) on Dec 16, 2013 at 19:21 UTC

    G'day mnnb,

    Welcome to the monastery.

    I suspect you're very new to Perl and have guessed at most of the code you've posted here. Various inconsistencies also suggest you've borrowed code from other sources without understanding what they do.

    Commenting out strict and warnings is a big mistake: it doesn't fix the reported issues; you've simply stuck your head in the sand and pretended they're not there.

    I suggest you read perlintro. Follow the links you find there for further information on the various topics that are relevant to your current task and ask here if you don't understand.

    From your problem description, here's how I might have tackled it.

    #!/usr/bin/env perl use strict; use warnings; my @match_letters = 'A' .. 'E'; my @argv = qw{q w a s z x qwe wer zxcvb xcvbn}; my @re_pairs = map { [ qr{$argv[$_ * 2]}, qr{$argv[$_ * 2 + 1]} ] } 0 +.. 4; while (<DATA>) { my $first_col = (split /\t/)[0]; my $match_code = ''; for my $i (0 .. 4) { if ($first_col =~ $re_pairs[$i][0] && $first_col =~ $re_pairs[ +$i][1]) { $match_code .= $match_letters[$i]; } } print "$match_code: $_" if length $match_code; } __DATA__ qwerty blah1 asdfgh blah2 zxcvbn blah3


    AD: qwerty blah1 B: asdfgh blah2 CE: zxcvbn blah3

    As you can see, I've dummied up file and command line input and have only produced basic output. This may not be exactly what you want but should provide some direction: note, for instance, that I've only captured the first column not every tab-separated element; used a for loop instead of your deeply nested if statements; and, printed the original line read rather than attempting to recreate it from the split elements.

    [For subsequent questions, please follow the guidelines in "How do I post a question effectively?": a better question gets better answers.]

    -- Ken

      Hi Ken,

      your code is obviously much shorter and cleaner than the original post, but using regexes rather than the index function is rather unlikely to improve performance, which is the OP's primary request. Or did I miss something?

        The journey to a better program (for some definition of 'better', in this case faster) begins with a program that works and that one can understand. As suggested elsewhere, the OP code is a spaghetti monster that dare not enable strictures and warnings lest it reveal a host of naughty practices and lurking bugs.

        kcott's shorter and cleaner code, assuming it actually does what mnnb wants, is much more likely to be a good starting point for improvement. I haven't studied it closely, but it seems to me that the regexes, if insufficiently speedy, could fairly easily be replaced by the use of index. In any event, while the use of regexes will not improve performance, it is also unlikely, IMHO, to significantly degrade it versus index in this case. But only benchmarking will determine the trade-offs.

        Update: Minor wording changes; no semantic change.

        You are quite correct in that I haven't addressed mnnb's primary request; however, I did state my intention: "This may not be exactly what you want but should provide some direction: ...".

        There were so many issues with the posted code (e.g. "sub name_search(@_, $search_string) { ... }" and "$run_time = time() - our $start_run;") that I chose not to attempt to make this code (in its present form) faster as that didn't seem like a useful exercise.

        Beyond that, I can only echo what ++AnomalousMonk wrote in the first response to your comment.

        -- Ken

      Commenting out strict and warnings is a big mistake: it doesn't fix the reported issues;
      Why not? I can't diagnose this, because I'm not running on Windows, but it seems like you're chanting boilerplate at the questioner. Which of his mistakes would be caught by adding those strings?

        The bad sub prototype and the interesting use of 'our'. That is assuming that variables are declared appropriately.

        True laziness is hard work
Re: Need Speed:Search Tab-delimited File for pairs of names
by hdb (Monsignor) on Dec 16, 2013 at 19:00 UTC

    You should be analyzing your parameters before you start your search to make life easier, e.g. like this:

    use strict; use warnings; sub name_search { join '', map {$_->[0]} grep { 0<=index $_[0], $_->[1] and 0<=index $ +_[0], $_->[2] } @{$_[1]}; } my @criteria; my $letter = 65; push @criteria, [ chr( $letter++ ), splice @ARGV, 0, 2 ] while @ARGV > += 2; print name_search 'John Smith', \@criteria; print "\n";

    Also, have a look at join. You could write, assuming you open your output file before the loop:

    while(<$data>) { chomp; my @line = split /\t/; my $found_tag = name_search $line[0], \@criteria; print OUT1 join( "\t", $found_tag, @line[0..13] )."\n" if $found_tag +; } close $data;
Re: Need Speed:Search Tab-delimited File for pairs of names
by oiskuu (Hermit) on Dec 16, 2013 at 21:49 UTC

    Addressing the question of literal matching: note that regex allow you to safely embed variables using \Q and \E, e.g.

    my @pats = map qr/\Q$_\E/, @ARGV;
    The above prepares @ARGV as patterns, disabling interpretation of meta-characters.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1067356]
Approved by Corion
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (8)
As of 2018-06-20 18:10 GMT
Find Nodes?
    Voting Booth?
    Should cpanminus be part of the standard Perl release?

    Results (117 votes). Check out past polls.