Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Best way to find patterns in csv file?

by punch_card_don (Curate)
on Nov 30, 2004 at 20:31 UTC ( #411265=perlquestion: print w/replies, xml ) Need Help??

punch_card_don has asked for the wisdom of the Perl Monks concerning the following question:

Monadic Monks,

I have a text file that looks like this:

record_id, datum_1, datum_2, ... , datum_30;

where record_id is integer, and 99% of datums are integers, although about 1% may be text or decimals. Datums can be null. There are 1.5-million records.

Then I have a collection of about 35,000 patterns I have to search for. That is, find all records that have, for example, datum_1 = x and datum_8 = y and datum_20 = z, regardless of what might be in other columns. A single record may contain several patterns, so each line has to be searched for each pattern

I realize this is just mimicing the functionality of a database, (select record_id from theTable where datum_1 = x and datum_8 = y and datum_20 = z) but I was wondering if there's a very efficient way of doing this directly on the file without setting up a database and without scanning 1.5-million lines 35,000 times (I wonder how long 50-billion line scans would take?). I've thought about this most of the day and come up with nothing promising....

Replies are listed 'Best First'.
Re: Best way to find patterns in csv file?
by tmoertel (Chaplain) on Nov 30, 2004 at 23:54 UTC
    (Update: Added explanation of why this method is efficient.)

    I would divide this problem into smaller pieces:

    1. Specify the patterns.
    2. Convert the pattern specifications into an internal representation that facilitates efficient pattern matching.
    3. Scan the CSV file for matches using the internal representation.

    I might write the specifications like follows and stuff them into the __DATA__ section or a configuration file:

    pattern 1: 1=2 2=3 0=-1 pattern 2: 1=2 3=4 pattern 3: 3=4 5=2
    The first line says, "Pattern 1 requires that column 1 have the value 2, column 2 have the value 3, and column 0 have the value -1." The other lines read similarly. (There is no requirement for the values to be numbers. Any value that has a string representation will work in the following implementation.)

    Next, I would write some code to convert the specs into an efficient internal representation. I might use the following representation, which has two parts:

    # PART 1: Per-column pattern-matching values # when this column ... # has this value ... # count it as a column match toward these pats ... 0 => { -1 => [ 'pattern 1' ] }, 1 => { 2 => [ 'pattern 1', 'pattern 2' ] }, 2 => { 3 => [ 'pattern 1' ] }, 3 => { 4 => [ 'pattern 2', 'pattern 3' ] }, 5 => { 2 => [ 'pattern 3' ] } # PART 2: Pattern sizes # this pattern ... # requires this many column matches ... 'pattern 1' => 3, 'pattern 2' => 2, 'pattern 3' => 2
    The idea behind the representation is that we can decompose each pattern into a set of per-column patterns that must all be matched in order to satisfy the original pattern. For example, pattern 1 can be broken into three per-column patterns:

    • column 1 == 2
    • column 2 == 3
    • column 0 == -1

    When scanning a line of the CSV file, if column 1 contains a 2, we can count it as a partial match toward pattern 1. If at the end of the line we have three such partial matches, we know we have satisfied all of pattern 1 and thus have a complete match.

    What makes this process efficient is that our internal representation merges all of the shared per-column patterns. Thus when we encountered that value of 2 in column 1 earlier, we would instantly know to count it toward both pattern 1 and pattern 2. The only extra piece we need is to keep track of the counts for the current line in the CSV file, and we can do that via a tiny hash. Thus we can process the file line by line, in passing, with only a tiny bit of overhead.

    In other words, this method ought to be fast, easy on the memory, and suitable for processing files larger than your hard drive.

    The code to convert the specs into the above representation is short:

    my %pattern_colvals; my %pattern_sizes; # convert pattern specs into internal representation while (<DATA>) { my ($pattern, $spec) = split /:\s*/, $_, 2; my @col_patterns = map [split /=/], split ' ', $spec; $pattern_sizes{$pattern} = @col_patterns; for (@col_patterns) { my ($col, $match_val) = @$_; push @{$pattern_colvals{$col}{$match_val}}, $pattern; } }
    With the hard work done, we can now parse the CSV file (on STDIN) and look up matches against the internal representation:
    # parse input CSV file and check for matches against patterns while (<>) { chomp; my @col_vals = split /,/; # use real CSV parsing instead my ($col, %col_matches); for (@col_vals) { for (@{ $pattern_colvals{$col++}{$_} }) { print "line $. matches $_$/" if ++$col_matches{$_} == $pattern_sizes{$_}; } } }

    Note that the inner-most loop almost always disappears because the hash lookup results in undef unless there is a match for the current column.

    Let us try it out:

    $ { echo -1,2,3,4,5,6 echo -1,2,4,0,0,9 echo 0,0,0,4,3,2 } | perl pm-patterns.pl line 1 matches pattern 1 line 1 matches pattern 2 line 3 matches pattern 3
    I hope that this helps.

    Cheers,
    Tom

Re: Best way to find patterns in csv file?
by stajich (Chaplain) on Nov 30, 2004 at 21:06 UTC
    A database like DBD::SQLite is pretty easy to use and setup and won't require sweeping through all the rows every time if you setup indexes. It seems it would do most of what you wanted to do. If you know some of the columns will always be numeric you can use the appropriate datatypes. Here I've just made a 4 data column table. Note that you can also do some tweaking to make SQLite a little faster by delaying table writes, etc - There is info on the module page. Here
    #!/usr/bin/perl -w use strict; use DBI; # this is just silly code to show how one might encode the # the columns with values we want to match on, of course you # might want fancier things like ranges, etc so consider it # just an example # so the 1st pattern should match rows where 10 is the # 1st data column and 20 in the second my @patterns = ([10,20,undef,undef], [undef,undef,'green',undef]); my $dbname = 'testdb'; unlink($dbname); # this is for testing so make sure we start # with clean file every time my $dbh = DBI->connect("dbi:SQLite:dbname=$dbname","","", ); my $table = $dbh->do(<<EOF CREATE TABLE record ( record_id int(10) PRIMARY KEY, datum1 int, datum2 int, datum3 varchar(64), datum4 double(10,3) ); KEY i_d1 (datum1); KEY i_d2 (datum2); KEY i_d3 (datum3); KEY i_d4 (datum4); EOF ); my $stmt = $dbh->prepare("INSERT INTO record VALUES (?,?,?,?,?)"); # data is from the stuff at the bottom, just for # example so the script can be selfcontained while(<DATA>){ chomp; my ($id,@datum) = split(/,/,$_); $stmt->execute($id,@datum); } for my $pat ( @patterns ) { my ($i,$str,@vals) = (1); for my $p ( @$pat ) { if( defined $p ) { $str .= ' AND ' if defined $str; $str .= 'datum'.$i.'=?'; push @vals, $p; } $i++; } if( $str && @vals ) { # skip empty pattern rows my $query = $dbh->prepare("SELECT * FROM record WHERE $str"); $query->execute(@vals); while( my $r = $query->fetchrow_arrayref ) { print join(",",@$r),"\n"; } } print "--\n"; # some sort of pattern search delimiter } # this is just embedding example data so this can be a # self contained example. __DATA__ 1,20,40,blue,800.0 2,10,20,yellow,612.23 3,1,17,green,9.89 4,77,50,red,5.102 5,8,33,orange,34.21 6,10,20,silver,998.8
      I second the idea of using a database (and DBD::SQLite unless you already have a "real" RDBMS readily available). 1.5 million rows is a lot of stuff after all. A database will pay off especially if you plan to do this query on a regular basis.

      Now, if you really want to work on the CSV, I personally would run it through grep first. Suppose you need datum1 = 99 and datum2 = 130 and datum4=5. You can cut down the size of the file by throwing out all lines that do not contain "99" and "130" and "5", which is probably a lot.

      grep 99 data.csv | grep 130 | grep 5 > filtered.csv
      After this preprocessing, a CSV module from CPAN might be able to handle the rest.

      Of course, this simple grep only works if you have AND queries (not OR). But you do, right?

      Update: I just saw that you have 35000 patterns to check. Forget the CSV file, use a database (and index the columns that appear in most patterns)

Re: Best way to find patterns in csv file?
by Velaki (Chaplain) on Nov 30, 2004 at 20:42 UTC

    If your 35,000 sort patterns clump, then you might want to sort them into a tree, so when you sweep through the 1.5 million lines (only once) by walking the tree to find each incremental match, you can hold partial matches, and immediately discard those that don't meet the essential criteria.

    Just a thought,
    -v
    "Perl. There is no substitute."
      Ya, I've been thinking about groupings. Given the patterns
      datum_1 = x and datum_2 = y and datum_3 = z datum_1 = x and datum_2 = a and datum_3 = d datum_1 = x and datum_20 = g and datum_13 = j
      first find alll lines for which datum_1 = x, the just work on those for ll other patterns that begin with datum_1 = x. Doing so, I figure I can reduce the number of line sans from 40-billion to ~400-million given the average number of repeat datums in patterns. A nice reduction, but still interested in better. I'll probably end up just putting it all in a database and let the machine do the heavy lifting...
Re: Best way to find patterns in csv file?
by Jenda (Abbot) on Nov 30, 2004 at 21:14 UTC

    The best way depends on the patterns. Anyway I think you should first preprocess your patterns so that you don't have to compare the same values over and over again. Let's say you had just three columns and these four patterns:

    p1: c1 = 5, c3 = 8
    p2: c1 = 5, c3 = 4
    p3: c2 = 4, c3 = 1
    p4: c2 = 3, c3 = 5
    
    so you would build a structure like this:
    %patterns = ( 5 => { undef => { 8 => ['p1'], 4 => ['p2'] } }, undef => { 4 => { 1 => ['p3'] }, 3 => { 5 => ['p4'] } } );
    and whenever you'd need to find all patterns that a match the current row you'd traverse this structure and find all leafs to which you can get through the keys matching the value of the columns or undefs. Eg. if the current row is 5,3,8 you'd first select both $patterns{5} and $patterns{undef()}, then after next step $patterns{5}{undef()} and $patterns{undef()}{3} and at last would end up with just @{$patterns{5}{undef()}{8}}=(p1). Not sure I make much sense, so let's try an example:
    my @row = the data, without the record_id!; my @matching = (\%patterns); while (@row and @matching) { my $value = shift(@row); my @next_matching; foreach my $branch (@matching) { if (exists $branch->{$value}) { # there is a matching pattern if (ref($branch->{$value}) eq 'HASH') { # the pattern requ +ires some more values to match push @next_matching, $branch->{$value}; } else { # the pattern matched completely push @found_patterns, @{$branch->{$value}}; } } if (exists $branch->{undef()}) { # there is a patter that does +n't care about this value if (ref($branch->{undef()}) eq 'HASH') { # the pattern req +uires some more values to match push @next_matching, $branch->{undef()}; } else { # the pattern matched completely push @found_patterns, @{$branch->{undef()}}; } } } @matching = @next_matching; } # and now all the matching patterns are in @found_patterns # the code is untested and only intended as an example

    Actually you may need to use a different value to mean "don't care" in the structure if some patterns require that some columns equal null and null != "".

    I would use Text::CSV_XS to do the actual reading and parsing of the text file.

    Jenda
    We'd like to help you learn to help yourself
    Look around you, all you see are sympathetic eyes
    Stroll around the grounds until you feel at home
       -- P. Simon in Mrs. Robinson

Re: Best way to find patterns in csv file?
by Yendor (Pilgrim) on Nov 30, 2004 at 20:39 UTC
Re: Best way to find patterns in csv file?
by jZed (Prior) on Nov 30, 2004 at 23:00 UTC
    The 16 line script below will take an arbitrary amount of data and run an arbitrary number of match criteria against it, only cycling over the data once. It uses a DATA section, but just replace the second argument to adTie() if the data and patterns are stored in files.

    The script outputs:

      record #2 contains pattern #1
      record #3 contains pattern #2
      record #4 contains pattern #1
      record #4 contains pattern #2
    
    #!perl -w use strict; use AnyData; my($data,$patterns) = split /\n~\n/, join'',<DATA>; my $dataHash = adTie( 'CSV',[$data],'r',{cols=>'id,d1,d2,d3,d4,d5'} + ); my $patternHash = adTie( 'CSV',[$patterns],'r',{cols=>'id,criteria'} ) +; while (my $row = each %$dataHash) { while (my $pattern = each %$patternHash) { my $criteria = $pattern->{criteria}; for my $datum(qw(d1 d2 d3 d4 d5)) { $criteria =~ s/$datum/$row->{$datum}/g; } next unless eval $criteria; printf "record #%s contains pattern #%s\n",$row->{id},$pattern +->{id}; } } __DATA__ 1,9,9,9,9,9 2,3,5,9,9,9 3,9,9,7,8,4 4,3,5,7,8,4 ~ 1,d1==3 and d2==5 2,d3==7 and d4==8 and d5==4
Re: Best way to find patterns in csv file?
by tall_man (Parson) on Dec 01, 2004 at 05:50 UTC
    If you could arrange the 35,000 patterns into a sequence of tests in the right order, you could save a lot of work. Say that pattern 1 requires a value of 1 in column 3, and pattern 2 requires a 2 in column 3. Suppose you tested that first. If the value was 1, then pattern 2 is eliminated. If the value is 2, pattern 1 is eliminated. If it is neither 1 or 2, both are eliminated. So you should look for columns that are tested in many patterns. Such columns are also good candidates for indexes (using BerkeleyDB, for example).

    A sequence of tests like this is called a decision tree. Creating a perfectly optimal one is an NP-complete problem, but there are methods for making fairly good ones. You might check out the module AI::DecisionTree for ideas.

Re: Best way to find patterns in csv file?
by punch_card_don (Curate) on Dec 01, 2004 at 15:39 UTC
    Holy cow! Thanks everyone. I'm reading in detail and will impliment....

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://411265]
Approved by zejames
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others studying the Monastery: (3)
As of 2022-09-28 02:47 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    I prefer my indexes to start at:




    Results (124 votes). Check out past polls.

    Notices?