Re: One for the weekend: challenge

in reply to One for the weekend: challenge

I have solved it but I have passed 3 stages in order to make it run fast. The code has been around 60 lines in all of the stages. I have decided to play on the strong sides of perl and I have used regular expressions.

So my first version took me one and a half hour to write and debug the logic. The code was rewriting regexes and it was pathetically slow. It was running on the set for 5 minutes on my home PC (Pentium 4).

Next version took me another one and an half hour to convert it to use pre-compiled regular expressions. I have used some black magic in order to make exhaustive pattern matching. The interesting line is this :

qr{^($re1)(?{ push @match, [$-[0], $+[0]] })(*FAIL)};

where $re1 is gigantic alternative (123|2322|...). It was running on the set for 32 seconds - not bad but I was expecting more.

I have read that there is "trie optimization" of the regexes in last version of perl. It take me another 2 hours to make optimizations and to debug why the hell I am not getting trie optimized regexes for the gigantic alternative I was constructing from the dictionary. Experimentally I have found that there is a limit between 11000 and 13000 alternatives so I decided to partition the constructed regex in 10 smaller ones. And this one worked gaining much more speed. It was running on the set of numbers for 2.2 seconds. Now I an happy.

In summary I was developing the code for 5 hours. It consists of 65 lines of code that follows. It runs on my PC (Pentium 4 @ 3GHz) for 2.2 seconds and 1.6 sec on my office PC (Intel Dual Core E2160 @ 1.80GHz) and eats around 22M of ram

#!/usr/bin/perl

use strict;

my %dic;

#load dictionary
open DIC,"< dictionary.txt" or die "$!";
while (<DIC>){
    chomp;
    my $w = uc($_);
    $w =~ s/"//g;
    if ($w =~ y/EJNQRWXDSYFTAMCIVBKULOPGHZ/01112223334455666777888999/
+){
        $w =~ s/[^0-9]//;
        push @{$dic{$w}},$_;
    }

}
close DIC;

# Partition the dictionary in 10 regular expressions.
# I do not know why the trie regex optimisation is not 
# supported for very large alternations (> 12000 entries).
# Changing ${^RE_TRIE_MAXBUFF} does not help

my @rex;
my @match;
for my $i (0..9){
    use re 'eval';
    my $re1  =  join ('|', grep { /^$i/ } keys %dic);
    $rex[$i] = qr{^($re1)(?{ push @match, [$-[0], $+[0]] })(*FAIL)};
}

my $num;

sub print_num {
    my $n = shift;
    $n =~ s/\ +/ /g;        # remove duplicate spaces
    $n =~ s/^\ //;          # remove leading space
    $n =~ s/\ $//;          # remove trailing space
    print "$num: $n\n";
}

sub process_num {
    my ($n,$rest,$last) = @_;
    if (length($rest)==0){
        print_num($n);
        return;
    }
    my $i = 0;
    @match=();
    for my $r (@rex){       # match all regexes
        $rest =~ /$r/;      # the result is stored in
    }                       # @match global variable
    my @matches = @match;   # and is copied locally
    for my $x (@matches){
        my $match = substr($rest,$x->[0],$x->[1]-$x->[0]);
        my $tail =  substr($rest,$x->[1]);
        for my $w (@{$dic{$match}}){   # we could have multiple words 
            my $m = $n;                # for one number
            $m =~ s/$match/ $w /;
            process_num($m,$tail,0);
        }
        $i++;
    }
    # permit only one digit if there is no match
    if (($i==0)&&($last==0)) {
        $rest=~s/.//;
        process_num($n,$rest,1);
    }
}

# Load input numbers
while (<>){
    chomp;
    my $n = $_;
    $n =~ s/[^0-9]//g;          # we are interested only in didits
    if (length($n)){            # if we have number here
        $num = $_;              # num is the original number
        process_num($n,$n,0);
    }
}
[download]

I have learned a little bit more about regular expressions, their optimization and extension.

Comment on Re: One for the weekend: challenge Select or Download Code

Replies are listed 'Best First'.
Re^2: One for the weekend: challenge by ysth (Canon) on Jun 01, 2008 at 06:25 UTC
See 5.10.0 regex slowdown; in particular, Re^4: 5.10.0 regex slowdown. -- Online Fortune Cookie Search	[reply]
Re^3: One for the weekend: challenge by karavelov (Monk) on Jun 01, 2008 at 06:48 UTC
Thanks. Yes I have figured out how to workaround this limitation. It will be great if it is fixed some day.	[reply]
Re^2: One for the weekend: challenge by BrowserUk (Patriarch) on Jun 01, 2008 at 16:02 UTC
This is a correct and efficient solution. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply]

In Section Meditations