dave93 has asked for the wisdom of the Perl Monks concerning the following question:
I've got the two following snippets:
PERL:
my $fn = shift;
exit 1 if not defined $fn;
my $input = do {
open my $fh, "<", $fn or die "open failed";
local $/;
<$fh>
};
my $count = () = $input =~ m/mul\(\d{1,3},\d{1,3}\)/g;
print "Found $count matches.\n";
PYTHON:
import re
import sys
if len(sys.argv) < 2: exit(1)
mul_re = re.compile(r"mul\(\d{1,3},\d{1,3}\)")
with open (sys.argv[1], "r") as f:
input = f.read()
count = len(mul_re.findall(input))
print(f"Found {count} matches.")
Which do more or less the same thing. Running both scripts on the same file, I found that the Python version runs a full second faster! 1.375s as opposed to Perl's 2.466s
I had always thought that Perl's regex and parsing performance was particularly strong compared to other languages, so this was a shocker for me. What is it that I am doing wrong? How can I make the Perl version run as fast as the one in Python?
Thanks. -- Dave
Re: Python regex faster than Perl? - Chunking
by marioroy (Prior) on Mar 26, 2025 at 21:00 UTC
|
Recently, I tried diffing two large files only for the UNIX diff command to choke the OS. That's because the diff utility slurps both files, requiring 2x memory consumption.
I thought to provide chunking variants and measure the time taken for a 99 MB file.
Perl
#!/usr/bin/env perl
#use v5.20;
#use feature qw(signatures);
#no warnings qw(experimental::signatures);
use v5.36;
use autodie;
exit 1 if not @ARGV;
sub read_file ($fh, $chunk_size=65536) {
# Return the next chunk, including to the end of line.
read($fh, my $chunk, $chunk_size);
if (length($chunk) && substr($chunk, -1) ne "\n") {
return $chunk if eof($fh);
$chunk .= readline($fh);
}
return $chunk;
}
my $mul_pattern = 'mul\(\d{1,3},\d{1,3}\)';
my $filename = shift;
my $count = 0;
if (open(my $fh, '<', $filename)) {
while (length(my $chunk = read_file($fh))) {
$count += () = $chunk =~ m/$mul_pattern/g;
}
}
print "Found $count matches.\n";
Python
#!/usr/bin/env python
import re, sys
if len(sys.argv) < 2: sys.exit(1)
def read_file (file, chunk_size=65536):
"""
Lazy function generator to read a file in chunks,
including to the end of line.
"""
while True:
chunk = file.read(chunk_size)
if not chunk:
break
if not chunk.endswith('\n'):
chunk += file.readline()
yield chunk
mul_re = re.compile(r"mul\(\d{1,3},\d{1,3}\)")
filename = sys.argv[1]
count = 0
try:
with open (filename, "r") as file:
for chunk in read_file(file):
count += len(mul_re.findall(chunk))
except Exception as e:
print(e, file=sys.stderr)
sys.exit(1)
print(f"Found {count} matches.")
Results
Perl 0.463s
Found 200246 matches.
Python 0.250s
Found 200246 matches.
| [reply] [d/l] [select] |
|
# choroba's input generator, 990 MB
# https://perlmonks.org/?node_id=11164445
# perl gen_input.pl > big
use strict;
use warnings;
for (1..100_000_000) {
print int(rand 2)
? "xyzabcd"
: ("mul(" . int(rand 5000) . "," . int(rand 5000) . ")");
print "\n" unless int rand 10;
}
Non-chunking consumes greater than 1 GB of memory ~ 1.1 GB. Chunking consumes significantly less ~ 10 MB.
Non-chunking: > 1 GB memory
Perl 4.395s
Found 1999533 matches.
Python 2.262s
Found 1999533 matches.
Chunking: ~ 10 MB memory
Perl 4.422s
Found 1999533 matches.
Python 2.247s
Found 1999533 matches.
| [reply] [d/l] [select] |
|
#!/usr/bin/env perl
# time NUM_THREADS=3 perl pcount.pl big
use v5.36;
use autodie;
use MCE;
exit 1 if not @ARGV;
my $mul_pattern = 'mul\(\d{1,3},\d{1,3}\)';
my $filename = shift;
my $count = 0;
sub reduce_count ($worker_count) {
$count += $worker_count;
}
my $mce = MCE->new(
max_workers => $ENV{NUM_THREADS} // MCE::Util::get_ncpu(),
chunk_size => 65536*16,
use_slurpio => 1,
gather => \&reduce_count,
user_func => sub {
my ($mce, $slurp_ref, $chunk_id) = @_;
my $count = () = $$slurp_ref =~ m/$mul_pattern/g;
$mce->gather($count);
}
)->spawn;
$mce->process({ input_data => $filename });
$mce->shutdown;
print "Found $count matches.\n";
This calls for slurpio for best performance. No line by line processing behind the scene. The MCE gather option is set to a reduce function to tally the counts. And, increased chunk_size to reduced IPC among the workers. The input file is read serially.
Results
Found 1999533 matches.
1: 4.420s
2: 2.263s needs 2 workers to reach Python performance
3: 1.511s
4: 1.154s
5: 0.940s
6: 0.788s
7: 0.680s
8: 0.600s
9: 0.538s
Python parallel demonstration
Now I wonder about parallel in Python. We can reuse the chunk function introduced in the prior example.
#!/usr/bin/env python
# time NUM_THREADS=3 python pcount.py big
import os, re, sys
from multiprocessing import Pool, cpu_count
if len(sys.argv) < 2: sys.exit(1)
def read_file (file, chunk_size=65536*16):
"""
Lazy function generator to read a file in chunks,
including to the end of line.
"""
while True:
chunk = file.read(chunk_size)
if not chunk:
break
if not chunk.endswith('\n'):
chunk += file.readline()
yield chunk
def process_chunk(chunk):
"""
Worker function to process chunks in parallel.
"""
count = len(mul_re.findall(chunk))
return count
mul_re = re.compile(r"mul\(\d{1,3},\d{1,3}\)")
num_processes = int(os.getenv('NUM_THREADS') or cpu_count())
p = Pool(num_processes)
file_name = sys.argv[1]
try:
with open (file_name, "r") as file:
results = p.map(process_chunk, read_file(file))
p.close()
p.join()
except Exception as e:
print(e, file=sys.stderr)
sys.exit(1)
print(f"Found {sum(results)} matches.")
Results
Found 1999533 matches.
1: 3.131s
2: 1.824s
3: 1.408s
4: 1.178s
5: 1.187s
6: 1.187s
7: 1.172s
8: 1.008s
9: 0.995s
| [reply] [d/l] [select] |
|
Re: Python regex faster than Perl?
by ikegami (Patriarch) on Mar 28, 2025 at 10:58 UTC
|
Two notes.
One, Python's re engine is not nearly as capable as Perl's. I've only written a tiny amount of Python, but I've already hit its limitations repeatedly. Enough so that I now go straight to Python's regex engine. This one is much closer to Perl's.
Two, Perl's regex engine is famously good at failing fast (i.e. detecting that a pattern can't match). I've always heard of it coming out slower in comparisons otherwise.
| [reply] [d/l] [select] |
Re: Python regex faster than Perl?
by Arunbear (Prior) on Mar 28, 2025 at 12:04 UTC
|
% hyperfine --warmup 3 'perl re.pl sample.txt' 'python3 re.py sample.t
+xt'
Benchmark 1: perl re.pl sample.txt
Time (mean ± σ): 232.9 ms ± 2.2 ms [User: 132.1 ms, Sy
+stem: 99.5 ms]
Range (min … max): 230.8 ms … 238.4 ms 12 runs
Benchmark 2: python3 re.py sample.txt
Time (mean ± σ): 373.3 ms ± 5.6 ms [User: 246.1 ms, Sy
+stem: 125.4 ms]
Range (min … max): 365.5 ms … 383.7 ms 10 runs
Summary
perl re.pl sample.txt ran
1.60 ± 0.03 times faster than python3 re.py sample.txt
All I changed was the separator for the number pair in the regex to ";" i.e.
mul\(\d{1,3};\d{1,3}\)
I used choroba's method to generate the sample (but with 10_000_000 lines). | [reply] [d/l] [select] |
Re: Python regex faster than Perl?
by choroba (Cardinal) on Mar 26, 2025 at 15:49 UTC
|
I'm not sure what your input was, but I generated mine with the following one-liner:
perl -wE 'for (1..10000) { print int(rand 2) ? "xyzabcd" : ("mul(" . i
+nt(rand 5000). "," . int(rand 5000) . ")" ) ; print "\n" unless int r
+and 10 }' > 1
I then ran the two programs. Note that the one labelled "PYTHON" is in fact the Perl one, and vice versa.
These were my results:
time python3 1.py 1
Found 200 matches.
real 0m0.027s
user 0m0.018s
sys 0m0.005s
$ time 1.pl 1
Found 200 matches.
real 0m0.006s
user 0m0.004s
sys 0m0.000s
YMMV.
map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
| [reply] [d/l] [select] |
|
Thank you for your input, choroba. I've run each of the programs on input generated by your command and it seems that the Perl version does indeed best the Python one. My original input was an expanded form of the Day 3, Advent of Code 2024 input.
So I modified your snippet to run 10000000 iterations, which resulted in a 99MiB file for me. This is similar in size to the original input I used. In doing that, I found that the Perl version again runs slower: 0.626s, to Python's 0.416s.
Can you replicate this? It indicates to me that perhaps Python's regex implementation scales better.
-- edit: Or perhaps Python's interpreter has a larger startup cost and its regex implementation is indeed faster in this case...
| [reply] |
|
I can confirm the behaviour. On my machine, it's 0.582s for Perl and 0.403s for Python.
map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
| [reply] [d/l] |
|
Perl:
real 0m0.064s
user 0m0.016s
sys 0m0.047s
Python:
real 0m0.123s
user 0m0.046s
sys 0m0.075s
| [reply] [d/l] |
|
I find speed differences of 20% neglectable.
I seem to remember¹ that Perl's regex engine got slower because of all the extra features added.
Real "Regexes" should be implemented as a high performance state machine, but perl is using op-codes now to cover those super features.
YMMV if it's worth the slow down...
Keep in mind that Ruby is/was half as fast as Perl and not many cared because of the observed "benefits".
In a world of scalable cloud computing, speed isn't anymore the decisive factor it used to be.
¹) I'm using this phrase often lately, but I posted too much here, and got tired double checking everything which was already written dozens of times.
So a cloud of doubt might motivate others to add detailed information. Otherwise there is no interest, and it was the right decision not to invest more time.
| [reply] |
Re: Python regex faster than Perl?
by tybalt89 (Monsignor) on Mar 26, 2025 at 19:00 UTC
|
#!/usr/bin/perl
use strict; # https://perlmonks.org/?node_id=11164441
use warnings;
use Time::HiRes qw( time );
my $input = 'foobarmul(123,456)' x 3e6;
my $start;
$start = time;
my $count1 = () = $input =~ /mul\(\d{1,3},\d{1,3}\)/g;
printf " list context count1 %d time %.3f\n", $count1, time - $star
+t;
$start = time;
my $count2 = 0;
++$count2 while $input =~ /mul\(\d{1,3},\d{1,3}\)/g;
printf "scalar context count2 %d time %.3f\n", $count2, time - $star
+t;
Outputs:
list context count1 3000000 time 1.025
scalar context count2 3000000 time 0.860
| [reply] [d/l] [select] |
|
True, but still slower than python :-(
python
Found 200013 matches.
real 0m0.515s
user 0m0.475s
sys 0m0.040s
perl orig
Found 200013 matches.
real 0m0.725s
user 0m0.681s
sys 0m0.044s
perl line by line
Found 200013 matches.
real 0m0.906s
user 0m0.874s
sys 0m0.032s
perl tybalt
Found 200013 matches.
real 0m0.611s
user 0m0.583s
sys 0m0.028s
map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
| [reply] [d/l] [select] |
Re: Python regex faster than Perl?
by duelafn (Parson) on Mar 27, 2025 at 11:59 UTC
|
In the past whenever I benchmarked perl against python in practical applications that I cared about, perl would win. However, python has been on a major performance binge for the past several years,
I've since stopped benchmarking perl vs python (I've picked up rust which I use when I care about performance), but I imagine that python has sped up relative to perl and is likely faster in more situations that it once was (depending, of course, on exactly what you are doing and how you code your algorithms).
| [reply] |
|
There's been a decision at $WORK to move things to python so I've been (reluctantly) picking the minimal amount I know back up and there are some "nice" things in their ecosystem. I've never really used PDL but pandas is pretty handy for CSV/spreadsheet-y stuff, not to mention polars which peps things up with rust under the hood while being mostly interoperable where not a direct drop-in replacement. The language is still bletcherous, but thanks to hylang you can write in a lisp and hide some of the warts. We're stuck at 3.9 for $REASONS but I'm interested what things look like when we eventually catch up to 3.13.
The cake is a lie.
The cake is a lie.
The cake is a lie.
| [reply] |
|
”…write in a lisp and hide some of the warts.”
Good deal, warts for parentheses 🤪😎. But thanks for the link. Very nice.
| [reply] |
Re: Python regex faster than Perl?
by Fletch (Bishop) on Mar 26, 2025 at 15:59 UTC
|
One performance nit (for both versions really) if you're going for speed then reading and processing line by line rather than slurping is going to probably be faster (regardless of language, unless there's some magic (e.g. mmap the file contents into a language level string)) than slurping the entire file into memory.
Edit: Huh I may be out of date now. My recollection was that rule of thumb was the "best" (FSVO best aiming for speed) way to do something like this was line-by-line or reads at the underlying FS' block size.
The cake is a lie.
The cake is a lie.
The cake is a lie.
| [reply] [d/l] |
|
Interestingly, no. I modified the code in the following way:
my $count = 0;
$count += () = /mul\(\d{1,3},\d{1,3}\)/g while <$fh>;
Slurping took 0.582s, line-by-line processing takes 0.680s.
map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
| [reply] [d/l] [select] |
Re: Python regex faster than Perl?
by dave93 (Acolyte) on Mar 27, 2025 at 13:24 UTC
|
Thanks for the input, everyone. I'm quite new to Perl, I picked it up only a few months ago, as a replacement scripting language for bash. So I thought there must be something I'm missing. But it seems this might just be the way it is.
It's quite the shame that Perl has been dethroned in the dynamic language game when it comes to regex performance. Python's re module even implements Perl regex! It's a shame.
Thanks. -- Dave
| [reply] |
|
| [reply] [d/l] |
|
Writing off a language on the basis of a single flavor of benchmark yielding modest time differences isn't really sensible. In the vast majority of cases finding a language you are comfortable using and that does what you need fast enough to suit the task at hand is much more important than finding a tool that runs a specific task a little faster.
Note that "comfortable using" extends beyond just you sitting in a dark room coding. It extends to being able to get help when you need it. For that, Perl and PerlMonks leads the field. Sure, there are sites like Stack Overflow, but I haven't found a PerlMonks equivalent site for any other language.
Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond
| [reply] |
|
I agree with you, but this comes off as quite a defensive response :P. You don't need to sell me on Perl, I already am sold.
Perl is stronger at regex and that's the perception of it that I've had for long before I even thought to learn it. In a bullet-point list of Perl's strengths, it'll always come up. It also seems that Perl has had a reputation, in the past, of performing faster than Python (and Ruby). Given this I'm sure you'd understand why I found it surprising that Python performs better in all the simple regex cases I tried.
edit: I'm not sure about community support for Python, though it does seem stronger to me than you give it credit for. But I'll say that I certainly haven't found Perl lacking and I appreciate PerlMonks.
Thanks. -- David
| [reply] |
|
> Python's re module even implements Perl regex!
Nope. There are a lot of features you won't find in Python or PCRE (and occasionally vice versa.)
There are also examples where Sed dethrones Perl, it's a question of use case.
There are also pathological cases of Regexes (with nested quantifiers) which never terminate in one engine but quickly yield results in another.
| [reply] |
|
| [reply] |
|
|