Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Re: Word replace - notetab light vs perl

by demerphq (Chancellor)
on Oct 06, 2005 at 07:10 UTC ( [id://497837]=note: print w/replies, xml ) Need Help??


in reply to Word replace - notetab light vs perl

I think this could come down to a few things. One of the most important IMO is the size of buffer being used. For the perl variant you read a line and replace it. Assuming your file is a "standardish" text file its going to have rows of around a hundred chars. IIRC in substitutions perl allocates a buffer slightly larger than is needed so that it doesnt have to grow the output buffer on every replace, so its quite possible that on a single line it only actually does the malloc/realloc once. Even if it does it multiple times on most O/S'es this wont involve a buffer copy as the malloc implementation will most likely be able to simply resize the allocated buffer.

Making things more interesting your regex is not actually going to be handled by the regex engine. It will be handled by a subset of it that is used to implement index(). More specifically it will be done via boyer-moore matching. This means that it can potentially determine if the string is present in the line in less char comparisons than there are in the line. In fact s/word/words/ is more like the following code:

sub replace { my ($str,$find,$repl,$pos)=@_; $pos||=0; while ((my $index=index($str,$find,$pos))>=0) { substr($str,$index,length($find))=$repl; $pos+=length($repl); } return $str; }

Contrast this situation to a word processor scenario. Most likely it doesn't do the boyer-moore optimisation, most likely its working on buffers that are larger than a single line, and there are good odds that the lines are stored in a way that makes the substitution less efficient anyway as word processors tend not to store their data in a way that would not be efficient for S&R but would be efficient for insert/delete operations. IE, if each char were actually a link in a double linked list each insertion/deletion event would require updating four pointers, a cost that the perl s/// doesnt have to incur.

It wouldn't surprise me if you found considerable difference in run time when you worked against a single buffer containing the full file, or used a different pattern for your comparisons.

---
$world=~s/war/peace/g

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://497837]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others taking refuge in the Monastery: (2)
As of 2024-03-19 06:55 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found