Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Extacting lines where one column matches a name from a list of names

by mr_clean (Initiate)
on Sep 13, 2019 at 01:07 UTC ( #11106109=perlquestion: print w/replies, xml ) Need Help??

mr_clean has asked for the wisdom of the Perl Monks concerning the following question:

Hello, wise monks.

I have some tab delimited data:

name1 \t data \t moredata \t evenmoredata \t
name2 \t data \t moredata \t evenmoredata \t
etc.

And a list of names which are interesting;

name1 name2 name3 name4

I have been using awk to extract rows from the tab delimited file where column 0 matches my name, and looping through my list of names in order to extract all matching rows. But as my data grows, this task gets very time consuming (6 hours for 200 mb worth of data; and I want to analyse files that are 10 gb!)

So I would like to user perl (I have no perl experience, previously worked in bash).

My plan:

Convert each entry in the data file to a key (the name column) and corresponding hashes (all the other data in each column).
Do a regex match between my list of names, and my hash keys.
Use the matching hashes to extract the keys.

But I have no idea where to start.

Any advice?

  • Comment on Extacting lines where one column matches a name from a list of names

Replies are listed 'Best First'.
Re: Extacting lines where one column matches a name from a list of names
by davido (Cardinal) on Sep 13, 2019 at 15:50 UTC

    You seek a performance improvement over your awk solution. For there to be an improvement, there must be room to improve. I think you mentioned in a follow-up comment that your list of names is in the order in which they will appear in the files you are parsing. This is useful, as you can save the small amount of time it might have taken to import the list of names into a hash. So lets do a little profiling:

    use strict; use warnings; use Time::HiRes qw(time); open my $name_infh, '<', 'path/to/names/list' or die $!; open my $haystack_infh, '<', 'path/to/tab/delimited/list' or die $!; my $t0 = time(); while(<$name_infh>) {} while(<$haystack_infh>) {} printf "Elapsed time: %-.03f\n", time-$t0;

    Now run that on your input file; the largest one you've got, and see how long it takes. If it takes too long, you can stop right there because there is no Perl (or any other language) solution that will meet your time requirements unless you change the requirements by processing streams more frequently, or overnight when it doesn't matter, etc.

    If it is fast enough, then you could take the next step by implementing a solution in Perl that is similiarly linear in its computational complexity:

    use strict; use warnings; open my $name_infh => '<', 'path/to/names/list' or die "Unable to open names list: $!\n"; open my $haystack_infh => '<', 'path/to/tab/del/file' or die "Unable to open haystack file: $!\n"; my $name = <$name_infh>; chomp $name; while (my $line = <$haystack_infh>) { my ($test_name, $payload) = split /\t/, $line, 2; if ($name eq $test_name) { print "We have a winner: $test_name => $payload"; $name = <$name_infh>; last if !defined $name; chomp $name; } }

    This operates under the assumption that there will be exactly one match for each name in your list, and that your names list is in the correct order. If those assumptions are incorrect, then read your names list into a hash to start with; this will incur only a slight penalty -- so slight it's probably not worth maintaining your names list in any particular order to begin with. If it's not in order, just do this:

    my %want; while(<$name_infh>) { chomp; $want{$_}++; } while (my $line = <$haystack_infh>) { my ($test_name, $payload) = split /\t/, $line, 2; if (exists $want{$test_name}) { delete $want{$test_name}; print "We have a winner: $test_name => $payload"; last if ! keys %want; } }

    This last solution is still a linear time solution, as was the previous one, but is more flexible on the order in which things happen. It still makes one assumption; you're only looking for each name one time. You can remove the delete and the last if lines if that assumption isn't correct.

    At any rate, if the initial profiling check determined that the sheer act of reading the files takes longer than you have, you'll have to come up with a different strategy that doesn't involve sitting around waiting for large files to load.


    Dave

Re: Extacting lines where one column matches a name from a list of names
by Marshall (Abbot) on Sep 13, 2019 at 02:25 UTC
    The first issue is that Perl Monks is a forum to help folks learn about Perl, thereby enabling those folks to get better at doing it themselves. This is not a general code writing service. However, if you were able to make some demonstrated attempt at this yourself, you probably will get lots of help. So one issue is how to get started? Do you have any programming experience at all? I am unsure of the very best current books on Beginning Perl, but I remember an O'Reilly book by that name. I hope other Monks can make additional suggestions?

    Your Plan is a bit confusing to me owing to your use of the term "hash keys" which I associate with something very specific and which might not be what you meant.

    In general a data search problem of this type is done by keeping your "list of names" in memory in an efficiently searchable form. A hash table would often be used for this. Is name1 in my list of names? can be answered very quickly.

    Things become more complex if some amount of "sort of matches" is allowed. The question: Does name1 "look like" something in my "list of names" can be complex or computationally expensive.I'd have to have some example data to make a concrete recommendations.

    So, if an exact match to one of words in your list is required, then a simple hash table of your names would suffice. Read a line of data, decide if the name matches and if so, do "something" with it, otherwise skip that line (do nothing). Read next line, rinse repeat.

    Please give some more detail. Then we can discuss "What to do" in more detail (the processing algorithm). Along the way, you will need to quite a bit of learning on your own about "How to do it". A good and perhaps stepwise plan should be of interest to you along with some books and other material to read in order for you to get started.

    Update: I guess one starting point would be to try to translate your awk code into Perl. The enormous execution time suggests to me that you have a very inefficient algorithm for determining if a name is relevant or not? How you are currently making that decision is one main point of mine above.

      I understand. I believe that using hash tables would speed things up significantly. I don't need sort of matches, because my data has been designed such that my list of names exactly match the corresponding entries within the table (i.e. regex).

      The list of names looks like this:

      1000567/1
      1000567/2
      1000574/1
      1000574/2

      And the total data (from which I want to extract the matching columns) looks like this:

      I have removed the tabs and replaced them with commas, so that each entry is in one string
      1083978/2,284224,284292,chrX,255,+,284224,284292,255,0,0,1,68,0
      122854/1,284224,284277,chrX,255,+,284224,284277,255,0,0,1,53,0
      641613/1,284224,284290,chrX,255,+,284224,284290,255,0,0,1,66,0


      Separately, I have the first column as a matching "hash file"
      1083978/2
      122854/1
      641613/1

      So now, I will use the hash file and the comma-separated data to create hash and key pairs. I will put the hashes in a table. And finally, I will compared my hashes to my list of names using regex matching. For the hashes that match, I will fetch the keys and replace commas with tabs, thus giving me my original data. Does this seem sensible?
        Does this seem sensible? No, it does not.

        I think you want to select particular lines of interest from the input for further processing? This simple code does that.

        #!/usr/bin/perl use strict; use warnings; use Inline::Files; # Just for this Demo # so that this is a single runnable file # Monk node_id=11106111 my %nameValid; # initialize Name Hash... # Name the hash by what the Value Means, # not by what the hash key means while (my $name = <LISTOFNAMES>) { chomp $name; # shorter idioms exist $nameValid{$name}=1; # but this is just fine } while (my $line = <DATA>) { chomp $line; my ($name,$data) = split ",",$line,2; if ( exists $nameValid{$name} ) { # name is in "list" do something # I have no idea what to do? # so I just print that line print "$name => $data\n"; } } # prints the one line of interest: # 122854/1 => 284224,284277,chrX,255,+,284224,284277,255,0,0,1,53,0 __LISTOFNAMES__ 1000567/1 1000567/2 122854/1 1000574/1 1000574/2 __DATA__ 1083978/2,284224,284292,chrX,255,+,284224,284292,255,0,0,1,68,0 122854/1,284224,284277,chrX,255,+,284224,284277,255,0,0,1,53,0 641613/1,284224,284290,chrX,255,+,284224,284290,255,0,0,1,66,0
        Please see Wiki Article on regex (Regular Expression).
        You may be able to use the -f option on grep to do what you want?

        One feature of your example data here is that the field of interest (the "name" field) is always the first field in the record, i.e., always at the start of a string read from a file. (Update: This approach assumes that the  $rx_sep field separator pattern cannot possibly appear in a "name" field!) This anchor can be very useful. If you build a regex to match all the names "of interest" (see haukex's article Building Regex Alternations Dynamically), it's a one-pass process to read all records in a file and match and extract only those records of interest.

        c:\@Work\Perl\monks>perl use strict; use warnings; my @names = qw(1000567/1 1000567/2 122854/1 1000574/2); my $rx_sep = qr{ , }xms; # adjust to match real field separator my ($rx_interesting) = map qr{ \A (?: $_) (?= $rx_sep) }xms, join q{ | }, map quotemeta, # proper! reverse sort # order! @names ; print "$rx_interesting \n"; # for debug my @data = ( '1083978/2,284224,284292,chrX,255,+,284224,284292,255,0,0,1,68,0', '122854/1,284224,284277,chrX,255,+,284224,284277,255,0,0,1,53,0', '641613/1,284224,284290,chrX,255,+,284224,284290,255,0,0,1,66,0', ); while (my $datum = shift @data) { print "interesting: >$datum< \n" if $datum =~ $rx_interesting; } __END__ (?msx-i: \A (?: 122854\/1 | 1000574\/2 | 1000567\/2 | 1000567\/1) (?= +(?msx-i: , )) ) interesting: >122854/1,284224,284277,chrX,255,+,284224,284277,255,0,0, +1,53,0<
        Note that  $rx_sep has to be adjusted to match whatever field separator your data records actually use.

        Update 1: I didn't notice that haukex already suggested this approach here. Oh well... At least you have a worked example :)

        Update 2: I've noticed a stupid mistake in my code as originally posted. A part of the sequence of operations to build  $rx_interesting was incorrectly given as
            reverse sort
            map quotemeta,
        The code has been corrected. The error (quotemeta-ing before sort-ing) should have made no difference in this particular application, but there are corner cases (update: in other potential applications) in which it would (although I'm unable to think of a good example of such a case ATM).


        Give a man a fish:  <%-{-{-{-<

Re: Extacting lines where one column matches a name from a list of names
by haukex (Chancellor) on Sep 13, 2019 at 05:25 UTC
    Do a regex match between my list of names, and my hash keys.

    This is only necessary if you need regex features, like partial matches - if that's necessary, see Building Regex Alternations Dynamically for one solution. If you want to match exact strings, then simply looking up the first column in a hash built from the list of names should be best, as Marshall showed.

      The OP wrote in a reply to me:
      "I don't need sort of matches, because my data has been designed such that my list of names exactly match the corresponding entries within the table (i.e. regex)."
      I think one of the problems is that the OP is using terms like "regex" in ways that don't seem to make sense. I don't think that the OP really means "regex".

      I am working on an approximate pattern matching problem currently. Right now I take an input, then generate a series of regex expressions that are or'd together. This indeed does work, but it is slow because that generated regex has to be compared against a large number of tokens. Right now I suspect that some sort of tree structure can be generated such that I can determine the approximate matches much, much faster. I absolutely know that this is true with a subset of my problem, something a bit more than N1. I am matching with a subset of N2. That subset of N2 being defined currently by my regex generator.

      I have a colleague who is willing to look at the problem. I am working on a spec that is detailed enough so that he can understand what needs to be done without having to have any pre-knowledge of my application. That's more difficult than it sounds - the what's and why's are complex.

      If anybody has experience with a similar sounding problem, send me a message.

Re: Extacting lines where one column matches a name from a list of names
by Anonymous Monk on Sep 13, 2019 at 03:24 UTC
    I have been using awk ... But I have no idea where to start.

    If you have perl installed, save your awk commands to a file, then type: a2p file

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://11106109]
Approved by haukex
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (4)
As of 2019-10-16 04:29 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Notices?