Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

You seek a performance improvement over your awk solution. For there to be an improvement, there must be room to improve. I think you mentioned in a follow-up comment that your list of names is in the order in which they will appear in the files you are parsing. This is useful, as you can save the small amount of time it might have taken to import the list of names into a hash. So lets do a little profiling:

use strict; use warnings; use Time::HiRes qw(time); open my $name_infh, '<', 'path/to/names/list' or die $!; open my $haystack_infh, '<', 'path/to/tab/delimited/list' or die $!; my $t0 = time(); while(<$name_infh>) {} while(<$haystack_infh>) {} printf "Elapsed time: %-.03f\n", time-$t0;

Now run that on your input file; the largest one you've got, and see how long it takes. If it takes too long, you can stop right there because there is no Perl (or any other language) solution that will meet your time requirements unless you change the requirements by processing streams more frequently, or overnight when it doesn't matter, etc.

If it is fast enough, then you could take the next step by implementing a solution in Perl that is similiarly linear in its computational complexity:

use strict; use warnings; open my $name_infh => '<', 'path/to/names/list' or die "Unable to open names list: $!\n"; open my $haystack_infh => '<', 'path/to/tab/del/file' or die "Unable to open haystack file: $!\n"; my $name = <$name_infh>; chomp $name; while (my $line = <$haystack_infh>) { my ($test_name, $payload) = split /\t/, $line, 2; if ($name eq $test_name) { print "We have a winner: $test_name => $payload"; $name = <$name_infh>; last if !defined $name; chomp $name; } }

This operates under the assumption that there will be exactly one match for each name in your list, and that your names list is in the correct order. If those assumptions are incorrect, then read your names list into a hash to start with; this will incur only a slight penalty -- so slight it's probably not worth maintaining your names list in any particular order to begin with. If it's not in order, just do this:

my %want; while(<$name_infh>) { chomp; $want{$_}++; } while (my $line = <$haystack_infh>) { my ($test_name, $payload) = split /\t/, $line, 2; if (exists $want{$test_name}) { delete $want{$test_name}; print "We have a winner: $test_name => $payload"; last if ! keys %want; } }

This last solution is still a linear time solution, as was the previous one, but is more flexible on the order in which things happen. It still makes one assumption; you're only looking for each name one time. You can remove the delete and the last if lines if that assumption isn't correct.

At any rate, if the initial profiling check determined that the sheer act of reading the files takes longer than you have, you'll have to come up with a different strategy that doesn't involve sitting around waiting for large files to load.


Dave


In reply to Re: Extacting lines where one column matches a name from a list of names by davido
in thread Extacting lines where one column matches a name from a list of names by mr_clean

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others musing on the Monastery: (4)
As of 2024-03-29 11:59 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found