Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

fast string parser: regex versus substr

by mce (Curate)
on Nov 20, 2003 at 16:09 UTC ( #308607=perlquestion: print w/ replies, xml ) Need Help??
mce has asked for the wisdom of the Perl Monks concerning the following question:

Hi All,

I want to find a way to parse a string in a real performant way. This is what I came up with.
I found 2 methods so far, but there must be a better way to do it.
The string is position delimited, i.e. from the 3the to the 16the it contains something, and so far. Now, I want only the text in these fields, not the blanks.
Let me show you

our @data=<DATA>; # some code comes here... sub dosubstr { foreach my $i ( 0..$#data ) { my $line=$data[$i]; my $jcpu=substr($line,2,16); my $j=substr($line,18,48); my $s=substr($line,290,16); $jcpu =~ s/\s//g; $s =~ s/\s//g; $j =~ s/\s//g; # warn "$jcpu $j $s"; # ..store the values in a hash, but that is not important here } } sub doregex { foreach my $i ( 0..$#data ) { my $line=$data[$i]; $line =~ m/^04(?=(\S+)).{16}(?=(\S+)).{40}.{216}.{16}(?=(\S+)).{ +16}/ ; my $jcpu=$1; my $j=$2; my $s=$3; # warn "$jcpu $j $s"; # ..store the values in a hash, but that is not important here } } __DATA__ 04A12345 RELEASE A12345 RELEASE A12345 04FTOP DD_BUIL+ FTOP DD_REKL+ FTOP 04FTOP DD_PLAN+ FTOP DD_REKL+ FTOP
Now, in a simple benchmark study, the doregex function is 3 times faster than the substr. But it just look so complex doesn't it.
So, I am asking the wisdom for my fellow monks to make it more performant.
I am talking about a data of thousands of lines and every second counts, as my operators don't like to wait for webpages :-) Thanks in advance,
Update: fixed substr value to correct Abigail-II comment
---------------------------
Dr. Mark Ceulemans
Senior Consultant
BMC, Belgium

Comment on fast string parser: regex versus substr
Download Code
Re: fast string parser: regex versus substr
by liz (Monsignor) on Nov 20, 2003 at 16:18 UTC
Re: fast string parser: regex versus substr
by dragonchild (Archbishop) on Nov 20, 2003 at 16:19 UTC
    Read up on unpack. For fixed-width data, it's the fastest Perl has to offer.

    ------
    We are the carpenters and bricklayers of the Information Age.

    The idea is a little like C++ templates, except not quite so brain-meltingly complicated. -- TheDamian, Exegesis 6

    ... strings and arrays will suffice. As they are easily available as native data types in any sane language, ... - blokhead, speaking on evolutionary algorithms

    Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.

Re: fast string parser: regex versus substr
by Abigail-II (Bishop) on Nov 20, 2003 at 16:22 UTC
    Well, at least one of the approaches is incorrect. I uncommented the warnings, turned them into prints and added the following to your program:
    dosubstr; print "---\n"; doregex;
    Running your program gives me:
    A12345 RELEASE A1 2345RELEASEA12345 FTOP DD_BUIL+ FT OPDD_REKL+FTOP FTOP DD_PLAN+ FT OPDD_REKL+FTOP ---

    Abigail

Re: fast string parser: regex versus substr
by Art_XIV (Hermit) on Nov 20, 2003 at 16:47 UTC

    unpack, as the other monks have stated, is almost undoubtedly the way to go if your data is fixed-width.

    It's a moot point, but your benchmarks may have been spurious since you didn't s/// whitespace in the doregex function like you did w/ dosubstr.

    Hanlon's Razor - "Never attribute to malice that which can be adequately explained by stupidity"
Re: fast string parser: regex versus substr
by mce (Curate) on Nov 20, 2003 at 16:55 UTC
    Thanks all,

    I came up with.

    sub dopack { foreach my $i ( 0..$#data ) { my $line=$data[$i]; my ($jcpu,$j,$s)=(unpack('@2A16A40A232A16', $line))[0,1,3]; warn "$jcpu $j $s"; # ..store the values in a hash, but that is not important here } }
    And it won the benchmark competition :-)

    I never really understood pack and unpack, but it am getting to like it.
    ---------------------------
    Dr. Mark Ceulemans
    Senior Consultant
    BMC, Belgium

      my ($jcpu,$j,$s)=(unpack('@2A16A40A232A16', $line))[0,1,3];
      I think you'd be better off using the "x" template to ignore the third field (with index 2):
      my ($jcpu,$j,$s)= unpack('@2A16A40x232A16', $line);

      You can replace the leading '@2' with 'x2'', too. Or, just the reverse, replace the 'x232' with '@290'. I don't think it'll make much difference speedwise.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://308607]
Approved by EvdB
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (14)
As of 2015-07-07 15:24 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (90 votes), past polls