Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change

From one beginner to others . . .

by greenhorn (Sexton)
on Jul 15, 2000 at 13:52 UTC ( #22692=perlmeditation: print w/replies, xml ) Need Help??

I assume this belongs in "Meditations," deep though it ain't. :) Hope the forced line-breaks at about 80 characters don't make a mess on folks' screens...

This is intended as a (useful, I hope) pep-talk to others like me--newcomers to Perl.
The punchline: when the bishops, popes, and minor and major deities here stress
"there's more than one way to do it," and when they post a surprising range of
solutions when answering someone's questions, take the hint: try different solutions,
yourself. Maybe lots of them. In time the payoff could well be markedly improved performance.

Case in point: yesterday I needed a script to extract only certain lines from a
plaintext file exported from a database (comma-separated values). If field 4
contained certain text, the record in question should be printed; otherwise,
skip the record and read the next one.

My first thought was to use split to create an array of each record's fields,
then compare the contents of the array's fourth element with the arg(s)
the user had provided on the command line (the script could test for
all of several strings the user might provide as filters.) All lines would
have to be read; there's no predicting how (or if) these databases have been
sorted before they're saved to CSV format.

It all seemed very straightforward. The script, reading a 13,576-line file,
produced the desired results in about 3.5 seconds. I wasn't about to give myself
a Hero Of The People medal for that, but I could live with it. I figured I
was done. No, wait . . .

It occurred to me: what if I were to get the contents of field 4 by using
a regular expression, instead? The RE was a bit unpleasant-looking.
This sort of thing: /^[^,]+,[^,]*,   etc. etc.

I looked at it a while and thought: Ridiculous. Long-ish regular expression--
big performance hit; "split" must be faster.

W r o n g.  The routine using the RE ran in about 3/4 of a second--roughly 4.5
times faster. Surprised the hell out of me. And here I'd almost dumped the
second approach as "obviously" less inefficient and "therefore" slower.

So my lesson for the day, reduced to one unscientific-sounding bromide,
was: Assume nothing; try stuff. Nirvana awaits (or, if not that, then
possible improvements in execution speed :).

Replies are listed 'Best First'.
(jcwren) RE: From one beginner to others . . .
by jcwren (Prior) on Jul 15, 2000 at 16:20 UTC
    While your point is flawlessly made, I'm curious about one thing: Did you go back and re-run the program using split? If the file gets cached, you may be seeing artificial improvements. This is where the module can really help in a comparison like this.

    Without seeing your data, I imagine you'd take one line that matched your criteria, and one that didn't. Then write a comparison using both a passing and failing case with your data, duplicated to use split one way, and the regexp the other. This eliminates the file system as a possible source of artificial improvements.

    There was a dicussion about whether split was faster than a regexp or not, rather recently. I don't recall the outcome, but that node may be worth hunting up.


    e-mail jcwren
      is split optimized?

      The jist was that split is faster on smaller lines if you want the first feild, but a regex was faster for longer lines. We didn't go into a quest for further fields. However you could get the code we used and modify it for such.


(Ovid) RE: From one beginner to others . . .
by Ovid (Cardinal) on Jul 15, 2000 at 21:53 UTC
    I know that gryng has stated that regex is faster than a split for long lines. However, his test case was $testlarge = "a " x 100000;. That's a two-hundred thousand character line. I'm highly inclined to doubt that most people are going to be working with lines of that length. In the real world (no offense to gryng intended), where we're not splitting one hundred thousand repetitions of "a ", we usually use much smaller chunks of data. For those, split is probably the best best.

    You might want to consider using Benchmark to figure out what works best. I looked at your regex and realized that after you added the capturing parens and assigned the $digit variables, you were looking at a significant performance hit. You can use Benchmark to analyze these things a bit more carefully.

    #!/usr/bin/perl -w use strict; use Benchmark; use vars qw($myvar @results $a $b $c $d); $myvar = "one,two,three,four"; timethese(1000000, { Regex => '($a=$1, $b=$2, $c=$3, $d=$4) if $myvar =~ /^([^,]+),([ +^,]+),([^,]+),([^,]+)$/', Split => '@results = split /,/, $myvar' });
    This produced:
    Benchmark: timing 1000000 iterations of Regex, Split... Regex: 27 wallclock secs (28.06 usr + 0.00 sys = 28.06 CPU) Split: 16 wallclock secs (16.19 usr + 0.00 sys = 16.19 CPU)
    In this case, split was clearly the winner. I'd be interested in seeing some sample data and a code snippet to see how you're getting a regex to outperform a split. The structure of the data is everything when it comes to crafting an efficient regex.


      I managed to figure out at least one way to use Benchmark, will wonders never cease. :)
      (Don't know if I have used it in the best possible way, though.)
      Result of running the script shown below:
      Benchmark: timing 30 iterations of REGEXP, SPLIT... REGEXP: 10 wallclock secs ( 9.73 usr + 0.25 sys = 9.98 CPU) SPLIT: 47 wallclock secs (47.20 usr + 0.28 sys = 47.48 CPU)

      The script didn't pass muster with "-w" when I was trying to print matching lines to "NUL" (file handle hadn't been opened). I changed it simply to count matching lines. "-w" is now happy. (Note to self: something else for later study: how to print only to "NUL" w/out complaint from "-w".)

      The source (CSV) file is 13,576 lines long (1,703,397 bytes). Each record has 12 fields; the average length per record is 124 characters. The task is to print only lines whose fourth fields contain "MAPI".

      use strict; use Benchmark; timethese( 30, { REGEXP => 'UsingRegExp', SPLIT => 'UsingSplit' } ); sub UsingRegExp { my $file = 'r:\csv\test.csv'; my $field; my $count = 0; open FH, $file or die "\n $file: $!\n"; while ( <FH> ) { # WANT 4TH FIELD. (NOTE: SOME FIELDS _MIGHT_ BE EMPTY.) ($field) = /^[^,]+,[^,]*,[^,]*,\s*([^,]+)\s*,/; $count++ if lc($field) eq "mapi"; # IGNORE CASE } close FH or die "\n $file: $!\n"; } sub UsingSplit { my $file = 'r:\csv\test.csv'; my @record; my $count = 0; open FH, $file or die "\n $file: $!\n"; while ( <FH> ) { @record = split /\s*,\s*/; $count++ if lc($record[3]) eq "mapi"; # IGNORE CASE } close FH or die "\n $file: $!\n"; }
        I prefer this to whiles:

        <code> map{$r=[split (regex here)];$c+=(lc($r->[9]) eq "mapi")}<FH>; <code>

        I haven't tested this works but I hope you get my drift.

        Brother Frankus.

      Hmm. And just when I thought it was safe to go back into the water, I added one more thing into the mix. Both subroutines contain: while ( <FH> ) { .. and below each such line, I added: next if $_ !~ /mapi/io; ("MAPI" is the string being tested for.)

      It's the crudest possible test, of course. But why bother with further processing if the string doesn't appear anywhere in the current line?

      With that statement added, there was a considerable difference in the benchmarked results:

      Benchmark: timing 30 iterations of REGEXP, SPLIT... REGEXP: 14 wallclock secs (14.35 usr + 0.19 sys = 14.54 CPU) SPLIT: 17 wallclock secs (16.01 usr + 0.28 sys = 16.29 CPU)
      I should have read my own pep-talk about trying stuff...
      Looks as if those forced line-breaks at 80 chars in the original post weren't such a good idea after all. Sorry.:(

      The question about caching is a good one. I can't say for certain that caching didn't play a part in the performance boost (or in that case I guess I should say a perceived performance boost).

      I later re-created the business end of the routine on another computer and timed both approaches. The results were similar. Haven't yet benchmarked it. I'm still, argh, a bit hazy on exactly how to use Benchmark. But never mind "hazy"--I will get make my way through the haze and try it. (To date the two approaches have been timed using only the 4nt command processor's own timer function--far from exact, to be sure.)

      One noticeable difference between the regular expression I used and the one you used in your example here: I had only one set of parens in it. I don't know if this is likely to make a big difference in performance.

      Thanks for the feedback, folks.

        Only using one capturing paren will improve the performance of your regex as it will not be forced to do as much backreferencing. In playing around with this, I managed to optimize the split by breaking it into a minimal number of segments. In all cases, with my example, split significantly outperformed the regex.
        #!/usr/bin/perl -w use strict; use Benchmark; use vars qw($myvar $result $a $b $c $d); $myvar = "one,two,three,four"; timethese(1000000, { Regex => '$a=$1, $b=$2, $c=$3, $d=$4 if $myvar =~ /^[^,]+,([^,] ++),[^,]+,[^,]+$/', Split1 => '$result = (split /,/, $myvar)[1]', Split2 => '$result = (split /,/, $myvar, 4)[1]', Split3 => '$result = (split /,/, $myvar, 3)[1]' }); Benchmark: timing 1000000 iterations of Regex, Split1, Split2, Split3. +.. Regex: 26 wallclock secs (25.75 usr + 0.00 sys = 25.75 CPU) Split1: 16 wallclock secs (16.31 usr + 0.00 sys = 16.31 CPU) Split2: 16 wallclock secs (16.15 usr + 0.00 sys = 16.15 CPU) Split3: 13 wallclock secs (12.74 usr + 0.00 sys = 12.74 CPU)
        Note the whopping improvement in performance of Split3. In my benchmark, it's approximately twice as fast as the regex.


Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlmeditation [id://22692]
Approved by root
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (3)
As of 2022-10-06 21:16 GMT
Find Nodes?
    Voting Booth?
    My preferred way to holiday/vacation is:

    Results (29 votes). Check out past polls.