Beefy Boxes and Bandwidth Generously Provided by pair Networks RobOMonk
No such thing as a small change
 
PerlMonks  

Merged list with regex matches

by tcf03 (Deacon)
on Sep 04, 2007 at 15:33 UTC ( #636961=perlquestion: print w/ replies, xml ) Need Help??
tcf03 has asked for the wisdom of the Perl Monks concerning the following question:

I need to produce a merged list. List1 contains all of List2 but in a different format. I have used regex to cull out what I need, but its an impractical approach as im iterating over List2 again and again. So, is there a more elegant solution to do this? Im fairly stumped.
#!/usr/bin/perl use strict; use warnings; # my @CS_CHECKS = ( '00012345', 'D123470', '0000123450', '0000023456', ' +50000123990' ); my @B_CHECKS = ( '1234', '12345', '123990', '12399' ); my %seen; for my $cn ( @CS_CHECKS ) { for my $bn ( @B_CHECKS ) { if ( $cn =~ /^(0|5|6)0+\Q$bn\E$/ ) { print "$bn = $cn\n"; } } }
Ted
--
"That which we persist in doing becomes easier, not that the task itself has become easier, but that our ability to perform it has improved."
  --Ralph Waldo Emerson

Comment on Merged list with regex matches
Download Code
Re: Merged list with regex matches
by Fletch (Chancellor) on Sep 04, 2007 at 15:49 UTC

    Build a single regex out of the components of @B_CHECKS (possibly using Regexp::Assemble or the like, or even a crude join( "|", map { "\Q$_\E" } @B_CHECKS)) which captures the variant part and use that instead.

Re: Merged list with regex matches
by Sidhekin (Priest) on Sep 04, 2007 at 15:49 UTC

    On the assumption that every $cn will match at most one $bn (or that you don't care about more than the first), you could precompile a regex and use the result of the match:

    #!/usr/bin/perl use strict; use warnings; my @CS_CHECKS = ( '00012345', 'D123470', '0000123450', '0000023456', ' +50000123990' ); my @B_CHECKS = ( '1234', '12345', '123990', '12399' ); my $re = do { my $x = join '|', map "\Q$_", @B_CHECKS; qr/^[056]0+($x) +$/ }; $_ =~ $re and print "$1 = $_\n" for @CS_CHECKS;

    Of course, the regex engine is still run through the alternatives (up until one matches), but at least it looks more elegant. :)

    (Oh, and it may be faster too, what with not having to recompile the regex again and again. But don't quote me on that.)

    print "Just another Perl ${\(trickster and hacker)},"
    The Sidhekin proves Sidhe did it!

Re: Merged list with regex matches
by dorko (Parson) on Sep 04, 2007 at 15:49 UTC
    Can you clean up your "dirty" list then use the two lists with something like List::Compare to create an array with your results? (I think what you want is the intersection of the two lists.)

    Cheers,

    Brent

    -- Yeah, I'm a Delt.
Re: Merged list with regex matches
by ikegami (Pope) on Sep 04, 2007 at 15:51 UTC

    Fundametally you will need two nested loops, even if they are not obviously visible. (e.g. One could be hidden in a regexp.) However, it is possible to optimise the loops.

    Possible optimisation one: Don't compile the same regexp over and over again:

    for my $bn ( @B_CHECKS ) { my $re = qr/^(?:0|5|6)0+\Q$bn\E$/; for my $cn ( @CS_CHECKS ) { if ( $cn =~ $re ) { print "$bn = $cn\n"; } } }

    Possible optimisation two: Create a single regexp:

    my ($re) = map qr/^(?:0|5|6)0+($_)$/, join '|', map quotemeta, @B_CHECKS; for my $cn ( @CS_CHECKS ) { if ( $cn =~ $re ) { print "$1 = $cn\n"; } }

    Possible optimisation three: Create a single regexp using Regexp::List

    use Regexp::List qw( ) my $re = Regexp::List->new->list2re(@B_CHECKS); $re = qr/^(?:0|5|6)0+($re)$/; for my $cn ( @CS_CHECKS ) { if ( $cn =~ $re ) { print "$1 = $cn\n"; } }

    Possible optimisation four: Get rid of the regexp entirely, and do hash lookups. This will only work (as is) if $bn will never match /^(?:0|5|6)0+/.

    my %B_CHECKS = map +($_ => 1), @B_CHECKS; for my $cn ( @CS_CHECKS ) { (my $bn = $cn) =~ s/^(?:0|5|6)0+//; if ( $B_CHECKS{$bn} ) { print "$bn = $cn\n"; } }

    Update: Added #3

      #4 Looks like what I need, I will need to double check that $bn will not match /^(0|5|6)0+/ These lists are fairly large so no matter what I do, Its going to be tough to speed up.
      Thanks!
      Ted
      --
      "That which we persist in doing becomes easier, not that the task itself has become easier, but that our ability to perform it has improved."
        --Ralph Waldo Emerson
Re: Merged list with regex matches
by mr_mischief (Prior) on Sep 04, 2007 at 16:24 UTC
    You could approach the regular expression from a different direction. Rather than building the regular expression over and over with the values from one list, just grab what you need in every case. Then, use that to check set membership.

    This particular method trades the complexity of the big regexen in the previous replies for the complexity of a hash and an extra conditional in the loop ( which is only conditionally executed ). I haven't benchmarked which way will be faster. Since we're talking about the regex engine the version of perl and your data set would make a difference in the speed and memory use of the large regex anyway.

    When you are wanting to check membership in a set, a hash is much faster than an array. You could even have your loop update which members have been seen and how many times if you like. I have that commented out below since it's not part of your stated problem.

    #!/usr/bin/perl # This particular code is tested. use strict; use warnings; # my @CS_CHECKS = ( '00012345', 'D123470', '0000123450', '0000023456', '50000123990' ); my %B_CHECKS = ( '1234' => 0, '12345' => 0, '123990' => 0, '12399' => 0 ); my %seen; for my $cn ( @CS_CHECKS ) { if ( $cn =~ /^(?:0|5|6)0+(.*)$/ ) { my $bn = $1; if ( exists $B_CHECKS{ $bn } ) { print "$bn = $cn\n"; # $B_CHECKS{ $bn }++; } } }

    BTW, it's not enforced by the language, but all-capital symbol names are usually reserved by convention for constants. Also, your text states you're building a list, but your code is checking to see if your lists match. I'm guessing your code is meant to check the lists after you build list 2 to make sure everything did get included?

Re: Merged list with regex matches
by bart (Canon) on Sep 05, 2007 at 11:20 UTC
    $cn =~ /^(0|5|6)0+\Q$bn\E$
    You never know with demerphq's future optimizations in regexp speed, but I'm quite sure that for now,
    /^[056]0+/
    will be faster than
    /^(?:0|5|6)0+/

    Benchmark code:

    use Benchmark 'cmpthese'; my @list = ( '00012345', 'D123470', '0000123450', '0000023456', '50000 +123990' ); cmpthese -3, { charclass => sub { grep /^[056]0+/, @list }, alt => sub { grep /^(?:0|5|6)0+/, @list }, };

    Benchmark result:

    Benchmark: running alt, charclass, each for at least 3 CPU seconds... alt: 3 wallclock secs ( 3.00 usr + 0.00 sys = 3.00 CPU) @ 50 +5561.00/s (n=1516683) charclass: 3 wallclock secs ( 3.20 usr + 0.00 sys = 3.20 CPU) @ 53 +4972.53/s (n=1713517) Rate alt charclass alt 505561/s -- -5% charclass 534973/s 6% --

    So it is indeed 5-6% faster, depending how you look at it.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://636961]
Approved by Sidhekin
Front-paged by almut
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (6)
As of 2014-04-20 18:25 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (486 votes), past polls