http://www.perlmonks.org?node_id=497648

Hi monks,

Thought this was interesting...

I've a text file that has about 35 000 lines. Every line contains a word that I wanted to replace. I opened the file in NoteTab Light and conveniently used its Replace function to do the replacement. A total of 35 000 replacements were made. It took a while and that got me curious because I'm running Pentium 4 @ 3.20 GHz with 512 MB RAM.

I decided to time the process and found out that it took about 35 seconds - I had no idea how to time it automatically so I used an online stopwatch.

I was naturally curious how fast it would be done in Perl. Amazingly, Perl took less than 2 secs around 1 second. It was hard to believe so I opened the new file to verify the results. Yes, there were indeed 35 000 lines and all the occurences of the word were replaced.

Amazing!

open(FH, "wrongs") or die $!; open(FH2, ">wrongs2") or die $!; while ($line = <FH>) { $line =~ s/wrongs/wrongs3/; print FH2 "$line"; } close (FH); close (FH2);
P.S: If I hadn't done it first in NoteTab Light, I wouldn't have any notion how "fast" Perl's 1 sec is.

Update

Tried doing it in Notepad (the one that comes with Windows) and it hanged!

Update2

Modified my Perl code to count the number of replacements as well as added benchmarking:

use Benchmark; $start = new Benchmark; open(FH, "wrongs") or die $!; open(FH2, ">wrongs2") or die $!; while ($line = <FH>) { $counted = () = $line =~ s/wrongs/wrongs4/; $counted2++ if $counted; print FH2 "$line"; } close (FH); close (FH2); $end = new Benchmark; # calculate difference $diff = timediff($end, $start); print "replaced: $counted2 The operation took: " , timestr($diff, 'all +'); # output # replaced: 35000 The operation took: 0 wallclock secs ( 0.23 usr 0. +03 sys + 0.00 cusr 0.00 csys = 0.27 CPU)
Is there a better way to count the total number of replacements?

Replies are listed 'Best First'.
Re: Word replace - notetab light vs perl
by revdiablo (Prior) on Oct 05, 2005 at 16:10 UTC

    I'm more amazed that it took 35 seconds in NoteTab light than I am that it only took 2 seconds in perl. 35,000 lines really isn't that much that it should take so long. That's only 1000 lines per second, which I would say is pretty abysmal.

    Update: just out of curiosity, I tried doing a single-word replacement on a 35,000 line file (where every line has the word being replaced) in vim. Like your case, I don't know any way to get vim to time it for me, but based on wallclock seconds it took 7-8 seconds. On an UltraSparc running at a whopping 350mhz with an enourmous 256mb of ram. On the same machine, it took perl 1.137s.

    This reiterates how horribly slow NoteTab is doing something that seems relatively simple to do. :-)

    Update: I agree strongly with ww's sentiment about not starting an editor war. That certainly wasn't my intention, and hopefully it wasn't seen that way.

      I am very surprised at the slow execution cited by OP, as it does not match my experience, even when doing a regex replacement using NoteTab Pro on 5000+/- files of ca. 75 - 2K lines apiece.

      But, FWIW (and not to get into a war over "best" editors), NoteTab Lite does NOT currently use perl syntax for RE's and -- AFAIK -- does NOT use the perl NOR PCRE regex engine.

      But, for further "FWIW," doing so has been suggested (to favorable response) for the next version (at least for NoteTab Pro and Standard, < US$20 and ca. US$ 12, respectively).

      I don't have any info on release date, but expect it will be announced fairly soon

      Disclaimer-- node by a happy NoteTab Pro user
      A correction. I did it again and Perl took 1 sec or less.
Re: Word replace - notetab light vs perl
by BrowserUk (Patriarch) on Oct 05, 2005 at 16:49 UTC

    Take a look at textpad, the same 35000 line test takes less than a second, and there is no learning curve. If you can use notepad, you'll find yourself right at home texpad immediately and when you need more it, you'll probably find it on a memu somewhere and all you'll need to remember is the keyboard shortcut, but that comes quickly.

    It'll cost you a one fee of $30/£16.50 if you decide to keep it, but it is well worth the cost.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.
      Thanks, BrowserUk :)

      I'm quite used to NoteTab Light - I like its tab feature where you can open numerous files and then access them through the tabs. Even after you exited from the program, the tabs are kept in memory.

      The other editor I'm using is emacs (windows ver). I know only a couple of commands so at the moment it isn't particularly powerful to me.

        Textpad has tabs too, if you like them. Have them at the top, or bottom, or left or right of the screen.

        Or a selector pane if you prefer. Turn it on and off with a keystroke.

        Or both. Or Neither.

        Like I say, you'll feel right at home :)


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
        "Science is about questioning the status quo. Questioning authority".
        The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.
      I've gone through a few editors now, including my current TextPad (though I haven't used Notetab). I find ConText quite useful (lots of features, free) and UltraEdit (lots of features, $35 or so) quite good.

        I first use TextPad when I pulled a copy, 2.x I think, from the internet maybe 10 years ago when on a customers site where they wouldn't let me connect my laptop to to their precious network and they had nothing except what came installed with the OS. I was there for 6 weeks and it did everything I required of it without ever having to read the help. That's about the best recommendation I can give any piece of software and I stuck with it.

        I've tried many others. Hell, I've got 4 or 5 others I've installed in the last couple of years on this machine, but I know TextPad inside out and it gets the job done.

        I came to the conclusion that anything more programmable than it are a double-edge sword. I used TECO for 3 years in college, then EDT for two in my first programming job, and E3(ibm iou that became a non-ibm commercial product) for 6 or so after that and LPEX. Each was very programable and I expended a lot of effort in configuring and tailoring each one to my tastes.

        I found two problems with them.

        1. It is very easy to become dependant upon your editor and your configuration and become lost and frustrated when it isn't available. Especially when the chips are down and your up against some kind of deadline or crisis.
        2. It's very easy to become distracted by perfecting your configuration to solve yet another trifling problem. I can remember more than one occassion when I've expended valuable time trying to get two or more macros or custom commands to work perfectly together to solve some problem that could have been more simply solved by a few manual steps that would have taken considerably less time than it took to automate the (often once in a blue moon) task.

        My primary requirements for an editor are

        • It shouldn't get in my way or make me think about it rather than the task I am trying to perform with it.
        • It shouldn't do anything I didn't explcitely ask it to do.
        • It shouldn't leave me high and dry if my portables battery dies, or the plug gets kicked out, or dog forbid, my code crashes the system.
        • It shouldn't throw anything away. Undo is the greatest timesaver ever invented.

        You'll notice those are all "should nots" rather shoulds or musts. I reject most of the other highly rated editors on one or more of those criteria.

        If the editor succeeds in not violating those, the rest is gravey.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
        "Science is about questioning the status quo. Questioning authority".
        The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.

        I used ConText (0.9x) for some time. Intially, i liked it very much. I ignored the rare file corruption(s). When *somebody else* put the fact bluntly (on editor forum or newsgroup), i had to be sensible about it.

        Hopefully ConText project has solved that problem by now (it has been a long time since then) as it was/is otherwise a good editor.

Re: Word replace - notetab light vs perl
by b10m (Vicar) on Oct 05, 2005 at 18:32 UTC

    This is why I like sed ;-)

    $ sed -i 's/wrongs/wrongs3/' wrongs
    --
    b10m

    All code is usually tested, but rarely trusted.
      I usually do it like this:

      $ perl -i -pe 's/wrongs/wrongs3/' wrongs

      This is good if you don't have a recent version of GNU sed (only those have the -i option). I use it often enough that I have alias pp="perl -i -pe".

Re: Word replace - notetab light vs perl
by terra incognita (Pilgrim) on Oct 05, 2005 at 17:06 UTC
    One thing that really affects performance in this area is how often that the screen is updated. Windows notepad is really poor in this regard in that they update the display after every replace. JEdit which is my default editor does the replace in a buffer then writes that buffer to the screen. By not forcing a refresh after every replace you cut down the time considerably.

    It sounds like NoteTab Lite is doing the same as Windows Notepad or do it slightly better by doing it in blocks of lines at a time. Either way not the best method.

Re: Word replace - notetab light vs perl
by blazar (Canon) on Oct 06, 2005 at 09:56 UTC
    Your post as been already discussed at length, I'd just like to add that (maybe you didn't know and may be glad to know that) instead of
    open(FH, "wrongs") or die $!; open(FH2, ">wrongs2") or die $!; while ($line = <FH>) { $line =~ s/wrongs/wrongs3/; print FH2 "$line"; } close (FH); close (FH2);
    you can use some cmd line swithces and arguments and shell redirection (this particular one works in any common shell that I know of, including command.com):
    perl -lpe 's/wrongs/wrongs3/' wrongs >wrongs2
    or
    perl -lpi -e 's/wrongs/wrongs3/' wrongs
    if you want in place editing. But under Windows AFAIK you will have to do (something like):
    perl -lpe "s/wrongs/wrongs3/" wrongs >wrongs2
    and
    perl -lpi.bak -e "s/wrongs/wrongs3/" wrongs
    respectively instead; i.e. use double quotes for quoting and cannot do in place editing withou backup.
    Modified my Perl code to count the number of replacements as well as added benchmarking:
    The code above can easily be adapted to show the count:
    perl -lpi '$c++ if s/wrongs/wrongs3/; END{warn "count=$c\n"}' wrongs
    For the benchmark, it's not just that easy. That's what Benchmark.pm is for, and indeed it works by repeating the code to be tested a suitable number of times. But if the task takes long enough, then bash's time command will suffice: here's a test on a file that's 3494270 lines long:
    $ time perl -lpe '$c++ if s/sex/cool/; END{warn "count=$c\n"}' gse1.lo +g >/dev/null count=2840 real 0m19.062s user 0m16.736s sys 0m0.940s
    Unfortunately it's not available under command.com and cmd.exe, that I know.

    Last, you may also want to count the number of substitutions when the /g global modifier is given. In that case

    perl -lpi '$c+=s/wrongs/wrongs3/g; END{warn "count=$c\n"}' wrongs
    or else you may use the "highly experimental" </c>(?{ code })</c> regex feature:
    perl -lpi 's/wrongs($c++)/wrongs3/g; END{warn "count=$c\n"}' wrongs
    but there's really no need for it here...
      Thanks, blazar!

      I've never used command line perl so "-lpi" is alien to me. It's about time to familiarise myself with the basic swiches.

Re: Word replace - notetab light vs perl
by demerphq (Chancellor) on Oct 06, 2005 at 07:10 UTC

    I think this could come down to a few things. One of the most important IMO is the size of buffer being used. For the perl variant you read a line and replace it. Assuming your file is a "standardish" text file its going to have rows of around a hundred chars. IIRC in substitutions perl allocates a buffer slightly larger than is needed so that it doesnt have to grow the output buffer on every replace, so its quite possible that on a single line it only actually does the malloc/realloc once. Even if it does it multiple times on most O/S'es this wont involve a buffer copy as the malloc implementation will most likely be able to simply resize the allocated buffer.

    Making things more interesting your regex is not actually going to be handled by the regex engine. It will be handled by a subset of it that is used to implement index(). More specifically it will be done via boyer-moore matching. This means that it can potentially determine if the string is present in the line in less char comparisons than there are in the line. In fact s/word/words/ is more like the following code:

    sub replace { my ($str,$find,$repl,$pos)=@_; $pos||=0; while ((my $index=index($str,$find,$pos))>=0) { substr($str,$index,length($find))=$repl; $pos+=length($repl); } return $str; }

    Contrast this situation to a word processor scenario. Most likely it doesn't do the boyer-moore optimisation, most likely its working on buffers that are larger than a single line, and there are good odds that the lines are stored in a way that makes the substitution less efficient anyway as word processors tend not to store their data in a way that would not be efficient for S&R but would be efficient for insert/delete operations. IE, if each char were actually a link in a double linked list each insertion/deletion event would require updating four pointers, a cost that the perl s/// doesnt have to incur.

    It wouldn't surprise me if you found considerable difference in run time when you worked against a single buffer containing the full file, or used a different pattern for your comparisons.

    ---
    $world=~s/war/peace/g

Re: Word replace - notetab light vs perl
by Courage (Parson) on Oct 06, 2005 at 14:41 UTC
    Your code could still be optimized to run two magnitudes faster.

    If you'll match/replace entire contents as a string with a single s///g expression, instead of your "while" construct, you'll see dramatic speed increase.

    You'll be surprised many times to see how fast Perl sometimes is....

    Best regards,
    Courage, the Cowardly Dog

      Thanks, Courage :)

      Did you mean the following?

      use Benchmark; $start = new Benchmark; open(FH, "wrongs") or die $!; open(FH2, ">wrongs2") or die $!; undef $/; $lines = <FH>; $counted2 = $lines =~ s/wrongs/wrongs4/g; print FH2 "$lines"; $/ = "\n"; close (FH); close (FH2); $end = new Benchmark; # calculate difference $diff = timediff($end, $start); print "replaced: $counted2 The operation took: " , timestr($diff, 'all +');
        yes, you're fast to learn. :)
        regular expressions were speed-optimized a lot. Do RTFS a bit, and you'll have an idea on what I have in mind.

        A drawback of such optimization - no-one now can improve that code, because it is highly complicated.

        Well, this is not the only place on where perl has good speed abilities.

        Best regards,
        Courage, the Cowardly Dog