Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation

comment on

( #3333=superdoc: print w/replies, xml ) Need Help??
(Update: Added explanation of why this method is efficient.)

I would divide this problem into smaller pieces:

  1. Specify the patterns.
  2. Convert the pattern specifications into an internal representation that facilitates efficient pattern matching.
  3. Scan the CSV file for matches using the internal representation.

I might write the specifications like follows and stuff them into the __DATA__ section or a configuration file:

pattern 1: 1=2 2=3 0=-1 pattern 2: 1=2 3=4 pattern 3: 3=4 5=2
The first line says, "Pattern 1 requires that column 1 have the value 2, column 2 have the value 3, and column 0 have the value -1." The other lines read similarly. (There is no requirement for the values to be numbers. Any value that has a string representation will work in the following implementation.)

Next, I would write some code to convert the specs into an efficient internal representation. I might use the following representation, which has two parts:

# PART 1: Per-column pattern-matching values # when this column ... # has this value ... # count it as a column match toward these pats ... 0 => { -1 => [ 'pattern 1' ] }, 1 => { 2 => [ 'pattern 1', 'pattern 2' ] }, 2 => { 3 => [ 'pattern 1' ] }, 3 => { 4 => [ 'pattern 2', 'pattern 3' ] }, 5 => { 2 => [ 'pattern 3' ] } # PART 2: Pattern sizes # this pattern ... # requires this many column matches ... 'pattern 1' => 3, 'pattern 2' => 2, 'pattern 3' => 2
The idea behind the representation is that we can decompose each pattern into a set of per-column patterns that must all be matched in order to satisfy the original pattern. For example, pattern 1 can be broken into three per-column patterns:

  • column 1 == 2
  • column 2 == 3
  • column 0 == -1

When scanning a line of the CSV file, if column 1 contains a 2, we can count it as a partial match toward pattern 1. If at the end of the line we have three such partial matches, we know we have satisfied all of pattern 1 and thus have a complete match.

What makes this process efficient is that our internal representation merges all of the shared per-column patterns. Thus when we encountered that value of 2 in column 1 earlier, we would instantly know to count it toward both pattern 1 and pattern 2. The only extra piece we need is to keep track of the counts for the current line in the CSV file, and we can do that via a tiny hash. Thus we can process the file line by line, in passing, with only a tiny bit of overhead.

In other words, this method ought to be fast, easy on the memory, and suitable for processing files larger than your hard drive.

The code to convert the specs into the above representation is short:

my %pattern_colvals; my %pattern_sizes; # convert pattern specs into internal representation while (<DATA>) { my ($pattern, $spec) = split /:\s*/, $_, 2; my @col_patterns = map [split /=/], split ' ', $spec; $pattern_sizes{$pattern} = @col_patterns; for (@col_patterns) { my ($col, $match_val) = @$_; push @{$pattern_colvals{$col}{$match_val}}, $pattern; } }
With the hard work done, we can now parse the CSV file (on STDIN) and look up matches against the internal representation:
# parse input CSV file and check for matches against patterns while (<>) { chomp; my @col_vals = split /,/; # use real CSV parsing instead my ($col, %col_matches); for (@col_vals) { for (@{ $pattern_colvals{$col++}{$_} }) { print "line $. matches $_$/" if ++$col_matches{$_} == $pattern_sizes{$_}; } } }

Note that the inner-most loop almost always disappears because the hash lookup results in undef unless there is a match for the current column.

Let us try it out:

$ { echo -1,2,3,4,5,6 echo -1,2,4,0,0,9 echo 0,0,0,4,3,2 } | perl line 1 matches pattern 1 line 1 matches pattern 2 line 3 matches pattern 3
I hope that this helps.


In reply to Re: Best way to find patterns in csv file? by tmoertel
in thread Best way to find patterns in csv file? by punch_card_don

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or or How to display code and escape characters are good places to start.
Log In?

What's my password?
Create A New User
Domain Nodelet?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (5)
As of 2022-09-30 15:01 GMT
Find Nodes?
    Voting Booth?
    I prefer my indexes to start at:

    Results (126 votes). Check out past polls.