Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Python regex faster than Perl?

by dave93 (Acolyte)
on Mar 26, 2025 at 14:08 UTC ( [id://11164441]=perlquestion: print w/replies, xml ) Need Help??

dave93 has asked for the wisdom of the Perl Monks concerning the following question:

I've got the two following snippets:

PERL:

my $fn = shift; exit 1 if not defined $fn; my $input = do { open my $fh, "<", $fn or die "open failed"; local $/; <$fh> }; my $count = () = $input =~ m/mul\(\d{1,3},\d{1,3}\)/g; print "Found $count matches.\n";

PYTHON:

import re import sys if len(sys.argv) < 2: exit(1) mul_re = re.compile(r"mul\(\d{1,3},\d{1,3}\)") with open (sys.argv[1], "r") as f: input = f.read() count = len(mul_re.findall(input)) print(f"Found {count} matches.")

Which do more or less the same thing. Running both scripts on the same file, I found that the Python version runs a full second faster! 1.375s as opposed to Perl's 2.466s

I had always thought that Perl's regex and parsing performance was particularly strong compared to other languages, so this was a shocker for me. What is it that I am doing wrong? How can I make the Perl version run as fast as the one in Python?

Thanks. -- Dave

Replies are listed 'Best First'.
Re: Python regex faster than Perl? - Chunking
by marioroy (Prior) on Mar 26, 2025 at 21:00 UTC

    Recently, I tried diffing two large files only for the UNIX diff command to choke the OS. That's because the diff utility slurps both files, requiring 2x memory consumption.

    I thought to provide chunking variants and measure the time taken for a 99 MB file.

    Perl

    #!/usr/bin/env perl #use v5.20; #use feature qw(signatures); #no warnings qw(experimental::signatures); use v5.36; use autodie; exit 1 if not @ARGV; sub read_file ($fh, $chunk_size=65536) { # Return the next chunk, including to the end of line. read($fh, my $chunk, $chunk_size); if (length($chunk) && substr($chunk, -1) ne "\n") { return $chunk if eof($fh); $chunk .= readline($fh); } return $chunk; } my $mul_pattern = 'mul\(\d{1,3},\d{1,3}\)'; my $filename = shift; my $count = 0; if (open(my $fh, '<', $filename)) { while (length(my $chunk = read_file($fh))) { $count += () = $chunk =~ m/$mul_pattern/g; } } print "Found $count matches.\n";

    Python

    #!/usr/bin/env python import re, sys if len(sys.argv) < 2: sys.exit(1) def read_file (file, chunk_size=65536): """ Lazy function generator to read a file in chunks, including to the end of line. """ while True: chunk = file.read(chunk_size) if not chunk: break if not chunk.endswith('\n'): chunk += file.readline() yield chunk mul_re = re.compile(r"mul\(\d{1,3},\d{1,3}\)") filename = sys.argv[1] count = 0 try: with open (filename, "r") as file: for chunk in read_file(file): count += len(mul_re.findall(chunk)) except Exception as e: print(e, file=sys.stderr) sys.exit(1) print(f"Found {count} matches.")

    Results

    Perl 0.463s Found 200246 matches. Python 0.250s Found 200246 matches.

      Here I try 10x the input size i.e. ~ 1 GB input file.

      # choroba's input generator, 990 MB # https://perlmonks.org/?node_id=11164445 # perl gen_input.pl > big use strict; use warnings; for (1..100_000_000) { print int(rand 2) ? "xyzabcd" : ("mul(" . int(rand 5000) . "," . int(rand 5000) . ")"); print "\n" unless int rand 10; }

      Non-chunking consumes greater than 1 GB of memory ~ 1.1 GB. Chunking consumes significantly less ~ 10 MB.

      Non-chunking: > 1 GB memory

      Perl 4.395s Found 1999533 matches. Python 2.262s Found 1999533 matches.

      Chunking: ~ 10 MB memory

      Perl 4.422s Found 1999533 matches. Python 2.247s Found 1999533 matches.

        Perl parallel demonstration

        Seeing Python win by 2x made me consider a parallel variant using MCE. But more so to find out if this will scale increasing the number of workers.

        #!/usr/bin/env perl # time NUM_THREADS=3 perl pcount.pl big use v5.36; use autodie; use MCE; exit 1 if not @ARGV; my $mul_pattern = 'mul\(\d{1,3},\d{1,3}\)'; my $filename = shift; my $count = 0; sub reduce_count ($worker_count) { $count += $worker_count; } my $mce = MCE->new( max_workers => $ENV{NUM_THREADS} // MCE::Util::get_ncpu(), chunk_size => 65536*16, use_slurpio => 1, gather => \&reduce_count, user_func => sub { my ($mce, $slurp_ref, $chunk_id) = @_; my $count = () = $$slurp_ref =~ m/$mul_pattern/g; $mce->gather($count); } )->spawn; $mce->process({ input_data => $filename }); $mce->shutdown; print "Found $count matches.\n";

        This calls for slurpio for best performance. No line by line processing behind the scene. The MCE gather option is set to a reduce function to tally the counts. And, increased chunk_size to reduced IPC among the workers. The input file is read serially.

        Results

        Found 1999533 matches. 1: 4.420s 2: 2.263s needs 2 workers to reach Python performance 3: 1.511s 4: 1.154s 5: 0.940s 6: 0.788s 7: 0.680s 8: 0.600s 9: 0.538s

        Python parallel demonstration

        Now I wonder about parallel in Python. We can reuse the chunk function introduced in the prior example.

        #!/usr/bin/env python # time NUM_THREADS=3 python pcount.py big import os, re, sys from multiprocessing import Pool, cpu_count if len(sys.argv) < 2: sys.exit(1) def read_file (file, chunk_size=65536*16): """ Lazy function generator to read a file in chunks, including to the end of line. """ while True: chunk = file.read(chunk_size) if not chunk: break if not chunk.endswith('\n'): chunk += file.readline() yield chunk def process_chunk(chunk): """ Worker function to process chunks in parallel. """ count = len(mul_re.findall(chunk)) return count mul_re = re.compile(r"mul\(\d{1,3},\d{1,3}\)") num_processes = int(os.getenv('NUM_THREADS') or cpu_count()) p = Pool(num_processes) file_name = sys.argv[1] try: with open (file_name, "r") as file: results = p.map(process_chunk, read_file(file)) p.close() p.join() except Exception as e: print(e, file=sys.stderr) sys.exit(1) print(f"Found {sum(results)} matches.")

        Results

        Found 1999533 matches. 1: 3.131s 2: 1.824s 3: 1.408s 4: 1.178s 5: 1.187s 6: 1.187s 7: 1.172s 8: 1.008s 9: 0.995s
Re: Python regex faster than Perl?
by ikegami (Patriarch) on Mar 28, 2025 at 10:58 UTC

    Two notes.

    One, Python's re engine is not nearly as capable as Perl's. I've only written a tiny amount of Python, but I've already hit its limitations repeatedly. Enough so that I now go straight to Python's regex engine. This one is much closer to Perl's.

    Two, Perl's regex engine is famously good at failing fast (i.e. detecting that a pattern can't match). I've always heard of it coming out slower in comparisons otherwise.

Re: Python regex faster than Perl?
by Arunbear (Prior) on Mar 28, 2025 at 12:04 UTC
    This reminded me of an ancient thread (Re: Interesting Perl/Java regexp benchmarking), and sure enough, Perl is faster in the negative case e.g.
    % hyperfine --warmup 3 'perl re.pl sample.txt' 'python3 re.py sample.t +xt' Benchmark 1: perl re.pl sample.txt Time (mean ± &#963;): 232.9 ms ± 2.2 ms [User: 132.1 ms, Sy +stem: 99.5 ms] Range (min … max): 230.8 ms … 238.4 ms 12 runs Benchmark 2: python3 re.py sample.txt Time (mean ± &#963;): 373.3 ms ± 5.6 ms [User: 246.1 ms, Sy +stem: 125.4 ms] Range (min … max): 365.5 ms … 383.7 ms 10 runs Summary perl re.pl sample.txt ran 1.60 ± 0.03 times faster than python3 re.py sample.txt
    All I changed was the separator for the number pair in the regex to ";" i.e.
    mul\(\d{1,3};\d{1,3}\)
    I used choroba's method to generate the sample (but with 10_000_000 lines).
Re: Python regex faster than Perl?
by choroba (Cardinal) on Mar 26, 2025 at 15:49 UTC
    I'm not sure what your input was, but I generated mine with the following one-liner:
    perl -wE 'for (1..10000) { print int(rand 2) ? "xyzabcd" : ("mul(" . i +nt(rand 5000). "," . int(rand 5000) . ")" ) ; print "\n" unless int r +and 10 }' > 1

    I then ran the two programs. Note that the one labelled "PYTHON" is in fact the Perl one, and vice versa.

    These were my results:

    time python3 1.py 1 Found 200 matches. real 0m0.027s user 0m0.018s sys 0m0.005s $ time 1.pl 1 Found 200 matches. real 0m0.006s user 0m0.004s sys 0m0.000s

    YMMV.

    map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]

      Thank you for your input, choroba. I've run each of the programs on input generated by your command and it seems that the Perl version does indeed best the Python one. My original input was an expanded form of the Day 3, Advent of Code 2024 input.

      So I modified your snippet to run 10000000 iterations, which resulted in a 99MiB file for me. This is similar in size to the original input I used. In doing that, I found that the Perl version again runs slower: 0.626s, to Python's 0.416s.

      Can you replicate this? It indicates to me that perhaps Python's regex implementation scales better.

      -- edit: Or perhaps Python's interpreter has a larger startup cost and its regex implementation is indeed faster in this case...

        I can confirm the behaviour. On my machine, it's 0.582s for Perl and 0.403s for Python.

        map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]

        I can also confirm perl 0.612s vs python 0.406s

        I have measured the time reading the file only. perl is twice as fast as python:

        Perl: real 0m0.064s user 0m0.016s sys 0m0.047s Python: real 0m0.123s user 0m0.046s sys 0m0.075s
        I find speed differences of 20% neglectable.

        I seem to remember¹ that Perl's regex engine got slower because of all the extra features added.

        Real "Regexes" should be implemented as a high performance state machine, but perl is using op-codes now to cover those super features.

        YMMV if it's worth the slow down...

        Keep in mind that Ruby is/was half as fast as Perl and not many cared because of the observed "benefits".

        In a world of scalable cloud computing, speed isn't anymore the decisive factor it used to be.

        Cheers Rolf
        (addicted to the Perl Programming Language :)
        see Wikisyntax for the Monastery

        ¹) I'm using this phrase often lately, but I posted too much here, and got tired double checking everything which was already written dozens of times.

        So a cloud of doubt might motivate others to add detailed information. Otherwise there is no interest, and it was the right decision not to invest more time.

Re: Python regex faster than Perl?
by tybalt89 (Monsignor) on Mar 26, 2025 at 19:00 UTC

    Hmmm...

    #!/usr/bin/perl use strict; # https://perlmonks.org/?node_id=11164441 use warnings; use Time::HiRes qw( time ); my $input = 'foobarmul(123,456)' x 3e6; my $start; $start = time; my $count1 = () = $input =~ /mul\(\d{1,3},\d{1,3}\)/g; printf " list context count1 %d time %.3f\n", $count1, time - $star +t; $start = time; my $count2 = 0; ++$count2 while $input =~ /mul\(\d{1,3},\d{1,3}\)/g; printf "scalar context count2 %d time %.3f\n", $count2, time - $star +t;

    Outputs:

    list context count1 3000000 time 1.025 scalar context count2 3000000 time 0.860
      True, but still slower than python :-(
      python Found 200013 matches. real 0m0.515s user 0m0.475s sys 0m0.040s perl orig Found 200013 matches. real 0m0.725s user 0m0.681s sys 0m0.044s perl line by line Found 200013 matches. real 0m0.906s user 0m0.874s sys 0m0.032s perl tybalt Found 200013 matches. real 0m0.611s user 0m0.583s sys 0m0.028s
      map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
Re: Python regex faster than Perl?
by duelafn (Parson) on Mar 27, 2025 at 11:59 UTC

    In the past whenever I benchmarked perl against python in practical applications that I cared about, perl would win. However, python has been on a major performance binge for the past several years,

    I've since stopped benchmarking perl vs python (I've picked up rust which I use when I care about performance), but I imagine that python has sped up relative to perl and is likely faster in more situations that it once was (depending, of course, on exactly what you are doing and how you code your algorithms).

    Good Day,
        Dean

      There's been a decision at $WORK to move things to python so I've been (reluctantly) picking the minimal amount I know back up and there are some "nice" things in their ecosystem. I've never really used PDL but pandas is pretty handy for CSV/spreadsheet-y stuff, not to mention polars which peps things up with rust under the hood while being mostly interoperable where not a direct drop-in replacement. The language is still bletcherous, but thanks to hylang you can write in a lisp and hide some of the warts. We're stuck at 3.9 for $REASONS but I'm interested what things look like when we eventually catch up to 3.13.

      The cake is a lie.
      The cake is a lie.
      The cake is a lie.

Re: Python regex faster than Perl?
by Fletch (Bishop) on Mar 26, 2025 at 15:59 UTC

    One performance nit (for both versions really) if you're going for speed then reading and processing line by line rather than slurping is going to probably be faster (regardless of language, unless there's some magic (e.g. mmap the file contents into a language level string)) than slurping the entire file into memory.

    Edit: Huh I may be out of date now. My recollection was that rule of thumb was the "best" (FSVO best aiming for speed) way to do something like this was line-by-line or reads at the underlying FS' block size.

    The cake is a lie.
    The cake is a lie.
    The cake is a lie.

      Interestingly, no. I modified the code in the following way:
      my $count = 0; $count += () = /mul\(\d{1,3},\d{1,3}\)/g while <$fh>;

      Slurping took 0.582s, line-by-line processing takes 0.680s.

      map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
Re: Python regex faster than Perl?
by dave93 (Acolyte) on Mar 27, 2025 at 13:24 UTC

    Thanks for the input, everyone. I'm quite new to Perl, I picked it up only a few months ago, as a replacement scripting language for bash. So I thought there must be something I'm missing. But it seems this might just be the way it is.

    It's quite the shame that Perl has been dethroned in the dynamic language game when it comes to regex performance. Python's re module even implements Perl regex! It's a shame.

    Thanks. -- Dave

      It implements a subset.</nitpick> You can't use e.g. (?{ CODE }) for example, but unless you're doing dark wizardry (or you're Abigail, BIRM . . .) most things will be close.

      The cake is a lie.
      The cake is a lie.
      The cake is a lie.

      Writing off a language on the basis of a single flavor of benchmark yielding modest time differences isn't really sensible. In the vast majority of cases finding a language you are comfortable using and that does what you need fast enough to suit the task at hand is much more important than finding a tool that runs a specific task a little faster.

      Note that "comfortable using" extends beyond just you sitting in a dark room coding. It extends to being able to get help when you need it. For that, Perl and PerlMonks leads the field. Sure, there are sites like Stack Overflow, but I haven't found a PerlMonks equivalent site for any other language.

      Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond

        I agree with you, but this comes off as quite a defensive response :P. You don't need to sell me on Perl, I already am sold.

        Perl is stronger at regex and that's the perception of it that I've had for long before I even thought to learn it. In a bullet-point list of Perl's strengths, it'll always come up. It also seems that Perl has had a reputation, in the past, of performing faster than Python (and Ruby). Given this I'm sure you'd understand why I found it surprising that Python performs better in all the simple regex cases I tried.

        edit: I'm not sure about community support for Python, though it does seem stronger to me than you give it credit for. But I'll say that I certainly haven't found Perl lacking and I appreciate PerlMonks.

        Thanks. -- David

      > Python's re module even implements Perl regex!

      Nope. There are a lot of features you won't find in Python or PCRE (and occasionally vice versa.)

      There are also examples where Sed dethrones Perl, it's a question of use case.

      There are also pathological cases of Regexes (with nested quantifiers) which never terminate in one engine but quickly yield results in another.

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      see Wikisyntax for the Monastery

      Hold your horses Dave! I am not absolutely convinced that the assumption that Python's regex engine is faster than Perl's has been conclusively proven, with benchmarks which take into consideration the 2nd point made by ikegami and the point made by Arunbear. Unless there is a report somewhere?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11164441]
Approved by marto
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others taking refuge in the Monastery: (4)
As of 2025-07-14 00:28 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?
    erzuuliAnonymous Monks are no longer allowed to use Super Search, due to an excessive use of this resource by robots.