Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Matching simple patterns - is there a faster way?

by mrguy123 (Hermit)
on Aug 09, 2012 at 06:41 UTC ( #986426=perlquestion: print w/ replies, xml ) Need Help??
mrguy123 has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks, as mentioned before, I am working on a project that checks many thousands of links. One of the tests I do is try to match a very simple pattern to the HTML page fetched by the link. A very simple example (taken from the Alpaca book) to demonstrate:
$_ = "yabba dabba doo"; if (/abba/){ print "It matched!\n"; }
My question is, if the pattern is indeed only letters like the example above (no wildcards, character classes etc.), is regex matching the fastest way to go? Maybe grep matching or substring matching would be quicker?
Or is the regex engine smart enough to do faster matching for simple patterns?

Since I am checking 1000s of HTML pages and in some cases I want to match quite a few patterns anything that could make this run a bit faster would really help me out
Also I'm quite curious about the answer to this question
Any ideas?

UPDATE: After trying kcott's helpful suggestion, I get this results (code below in one of my answers), so it seems regex is still faster:
Rate index regex index 997/s -- -67% regex 3004/s 201% --




Everybody seems to think I'm lazy
I don't mind, I think they're crazy


Comment on Matching simple patterns - is there a faster way?
Select or Download Code
Re: Matching simple patterns - is there a faster way?
by kcott (Abbot) on Aug 09, 2012 at 06:50 UTC

    You may find index to be a suitable alternative.

    You can use Benchmark to see if this, or other alternatives, are faster.

    -- Ken

Re: Matching simple patterns - is there a faster way?
by Erez (Curate) on Aug 09, 2012 at 07:26 UTC

    is regex matching the fastest way to go?

    Depends. In Perl it probably is, being a well implemented algorithm. In general it's not, since the regex algorithm isn't as fast as the one you find in Unix grep and similar, but is more robust and allows for a lot of clever and powerful matching.

    But that's obviously the wrong question, given that a: this looks like a premature optimization, and b: Perl isn't that fast to begin with, due to a lot of stuff that goes behind the scenes which allows for all the flexibility and DWIMability. A better, I believe, course would be to profile the entire program run, see what are the bottlenecks and then figure how to better iron them out, rather than assuming the regex engine will be your biggest problem, rather than, say, Perl's IO.

    Principle of Least Astonishment: Any language that doesn’t occasionally surprise the novice will pay for it by continually surprising the expert

      Hi Erez, you're correct that the bottleneck is not the regex matching (it is the HTML downloading). However, if I can make the regex matching a bit faster it will be helpful
      Also, as I mentioned before, this is I think an interesting question that I would like to hear the Monks opinion on

      BTW, I enjoyed your talk on "The view from under the bridge" in the Israeli Perl Workshop earlier this year
Re: Matching simple patterns - is there a faster way?
by Anonymous Monk on Aug 09, 2012 at 07:29 UTC

    Any ideas?

    The regex engine is always slower, but it really doesn't matter, 1000s of files isn't that much

    So for now regex matching is still faster. I am doing more testing to make sure this is the correct result.

    Post your code, your benchmark might be flawed ( Compare to this benchmark Re: regex alternation )

Re: Matching simple patterns - is there a faster way?
by tobyink (Abbot) on Aug 09, 2012 at 07:38 UTC

    Realistically if you're downloading thousands of HTML pages and matching them against a regexp, it's unlikely to be the regexp that's slowing you down.

    The first place to look at optimizing is the HTTP layer itself. When you request the pages are you sending an Accept-Encoding header? In many cases, this will allow the pages to be transmitted with compression. (Many of the Perl HTTP libs will automatically handle the decompression for you.)

    If you're using LWP::UserAgent, perhaps try switching to Furl::HTTP which is somewhat faster. It doesn't have quite as many features as LWP::UserAgent, but it might have enough for you.

    Could you try requesting pages in parallel rather than one at a time? If so, take a look at AnyEvent::HTTP.

    perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'
      You're indeed correct that the HTML download is the bottleneck, and I am doing a lot of work to optimize that (including parallel processing)
      However, if I can make the regexes a bit faster it will also help, and to be honest I am also curious about the answer to this question
Re: Matching simple patterns - is there a faster way?
by flexvault (Parson) on Aug 09, 2012 at 09:43 UTC

    mrguy123,

    I benchmarked the code replacing the 'print' with '{ }':

    Rate index regex regex 1562500/s -- -34% index 2380952/s 52% --
    Your not testing print, are you?

    What is funny, is that I ran the code 3 times, and the exact same result occurred all 3 times. I usually take the middle one!

    "Well done is better than well said." - Benjamin Franklin

      Hi, please see my benchmark test
      use strict; use Benchmark qw(cmpthese); # A big, sorted list (475254 elements) my @list = ('a'..'zzzz'); ##Put the list into a string my $string = join(/ /, @list); $string .= "\n"; cmpthese(0, { # Looking for a random item in @list with grep index => sub { # A random element in the list my $pattern = 'fqsu fqsv fqsw fqsx fqsy'; my $result = index($string, $pattern); }, # Looking for a random item in @list linearly (with foreach) regex => sub { my $pattern = 'fqsu fqsv fqsw fqsx fqsy'; $string =~ /$pattern/; }, } );
      I also did another test where I read a proper HTML file and received similar results. Can you please tell me what you did different?


      Everybody seems to think I'm lazy
      I don't mind, I think they're crazy

        mrguy123,

        As requested ( I typed this in from the code ):

        sub index { if ( index('yabba dabba doo','abba' ) >= 0 ) { } } sub regex { $_ = 'yabba dabba doo'; if ( /abba/ ) { } } cmpthese ( 500_000, { regex => sub { &regex }, index => sub { &index }, }, );
        I have found 'index' usually faster than 'regex' in this type of situation, but 'regex' has more power when needed.

        Regards...Ed

        "Well done is better than well said." - Benjamin Franklin

      Hi flexvault, I am assuming you did the benchmark on the 'yabba dabba doo' example, which makes perfect sense because that is what I posted.
      However, the real data is much larger and that may cause the difference in the benchmark
      You can see my code example below for regex/index matching with larger amount of data

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://986426]
Approved by Ratazong
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (7)
As of 2014-11-26 04:29 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My preferred Perl binaries come from:














    Results (162 votes), past polls