http://www.perlmonks.org?node_id=689516


in reply to One for the weekend: challenge

I have solved it but I have passed 3 stages in order to make it run fast. The code has been around 60 lines in all of the stages. I have decided to play on the strong sides of perl and I have used regular expressions.

So my first version took me one and a half hour to write and debug the logic. The code was rewriting regexes and it was pathetically slow. It was running on the set for 5 minutes on my home PC (Pentium 4).

Next version took me another one and an half hour to convert it to use pre-compiled regular expressions. I have used some black magic in order to make exhaustive pattern matching. The interesting line is this :

 qr{^($re1)(?{ push @match, [$-[0], $+[0]] })(*FAIL)};

where $re1 is gigantic alternative (123|2322|...). It was running on the set for 32 seconds - not bad but I was expecting more.

I have read that there is "trie optimization" of the regexes in last version of perl. It take me another 2 hours to make optimizations and to debug why the hell I am not getting trie optimized regexes for the gigantic alternative I was constructing from the dictionary. Experimentally I have found that there is a limit between 11000 and 13000 alternatives so I decided to partition the constructed regex in 10 smaller ones. And this one worked gaining much more speed. It was running on the set of numbers for 2.2 seconds. Now I an happy.

In summary I was developing the code for 5 hours. It consists of 65 lines of code that follows. It runs on my PC (Pentium 4 @ 3GHz) for 2.2 seconds and 1.6 sec on my office PC (Intel Dual Core E2160 @ 1.80GHz) and eats around 22M of ram

#!/usr/bin/perl use strict; my %dic; #load dictionary open DIC,"< dictionary.txt" or die "$!"; while (<DIC>){ chomp; my $w = uc($_); $w =~ s/"//g; if ($w =~ y/EJNQRWXDSYFTAMCIVBKULOPGHZ/01112223334455666777888999/ +){ $w =~ s/[^0-9]//; push @{$dic{$w}},$_; } } close DIC; # Partition the dictionary in 10 regular expressions. # I do not know why the trie regex optimisation is not # supported for very large alternations (> 12000 entries). # Changing ${^RE_TRIE_MAXBUFF} does not help my @rex; my @match; for my $i (0..9){ use re 'eval'; my $re1 = join ('|', grep { /^$i/ } keys %dic); $rex[$i] = qr{^($re1)(?{ push @match, [$-[0], $+[0]] })(*FAIL)}; } my $num; sub print_num { my $n = shift; $n =~ s/\ +/ /g; # remove duplicate spaces $n =~ s/^\ //; # remove leading space $n =~ s/\ $//; # remove trailing space print "$num: $n\n"; } sub process_num { my ($n,$rest,$last) = @_; if (length($rest)==0){ print_num($n); return; } my $i = 0; @match=(); for my $r (@rex){ # match all regexes $rest =~ /$r/; # the result is stored in } # @match global variable my @matches = @match; # and is copied locally for my $x (@matches){ my $match = substr($rest,$x->[0],$x->[1]-$x->[0]); my $tail = substr($rest,$x->[1]); for my $w (@{$dic{$match}}){ # we could have multiple words my $m = $n; # for one number $m =~ s/$match/ $w /; process_num($m,$tail,0); } $i++; } # permit only one digit if there is no match if (($i==0)&&($last==0)) { $rest=~s/.//; process_num($n,$rest,1); } } # Load input numbers while (<>){ chomp; my $n = $_; $n =~ s/[^0-9]//g; # we are interested only in didits if (length($n)){ # if we have number here $num = $_; # num is the original number process_num($n,$n,0); } }

I have learned a little bit more about regular expressions, their optimization and extension.

Replies are listed 'Best First'.
Re^2: One for the weekend: challenge
by ysth (Canon) on Jun 01, 2008 at 06:25 UTC
      Thanks. Yes I have figured out how to workaround this limitation. It will be great if it is fixed some day.
Re^2: One for the weekend: challenge
by BrowserUk (Patriarch) on Jun 01, 2008 at 16:02 UTC