comment on

I got tired of remaking regex (of email addresses) for use by procmail and mutt by hand. Then I wrote the following. The worst case is every string is simply joined, which is still better than manual operation. Ideally one of three modules -- Regexp::(Assemble|Optimizer|Trie) -- will be used to generate the regex when available.

Mind that ...

/i flag is used; feel free to make it optional yourself;
duplicates in input are not considered;
end-spaces are removed.

#!perl

use strict; use warnings;

our $VERSION = '0.07';

#  or-re.pl - Given a list of strings, prints a OR'd regex; prints tes
+t results
#  for the input.
#
#  End-spaces & duplicate strings are removed before generating the re
+gex.

scalar @ARGV
  or die qq[Give strings to make one "or" regex.\n];

my @list = prepare( @ARGV );

my $re = build_re( @list );
printf "Regex...\n  %s\n\n" , $re;

re_test( $re , @list );

exit;

sub re_test
{
  my ( $re , @list ) = @_;

  ref $re or $re = qr/$re/;

  print "Test ...\n";
  for my $it ( @list )
  {
    printf "  %s  %14s\n"
      , $it
      , ( $it =~ $re ? 'matches' : 'does NOT match' )
      ;
  }
  return;
}

sub prepare
{
  scalar @_ or return;

  my %seen;
  return map
          { s/^\s+//;
            s/\s+$//;
            !$seen{ $_ }++ ? $_ : ()
          }
          @_;
}

sub build_re
{
  scalar @_ or return '';

  my @arg = @_;

  my %mod_map =
    ( 'Regexp::Assemble'  => \&via_assemble
    , 'Regexp::Optimizer' => \&via_optimizer
    , 'Regexp::Trie'      => \&via_trie
    , 'mine'              => sub { return decorate( simpleton( @_ ) ) 
+}
    );
  my @order =
    ( 'Regexp::Assemble'
    , 'Regexp::Optimizer'
    , 'Regexp::Trie'
    );

  my $maker = 'mine';
  for my $mod ( @order )
  {
    _load_module( $mod ) or next;

    $maker = $mod;
    last;
  }
  return $mod_map{ $maker }->( @arg )
}

sub decorate { return qq/\\b(?i:$_[0])\\b/ ; }

sub simpleton
{
  return
    join '|'
      , map
        { quotemeta( $_ ) }
        sort
        { length $b <=> length $a
            ||
          lc $a cmp lc $b
        }
        @_ ;
}

sub via_assemble
{
  my $maker = Regexp::Assemble->new(  'chomp'  => 1
                                    , 'reduce' => 1
                                    , 'modifiers'  => 'i'
                                    , 'anchor_word' => 1
                                    );

  $maker->add( map quotemeta( $_ ) , @_ );

  # R:A::as_string() method eschews flags given. This preserves it at 
+the cost
  # of extraneous syntax elsewhere.
  return $maker->re() . '';
}

sub via_optimizer
{
  my $re = simpleton( @_ );
  my $maker = Regexp::Optimizer->new();
  return $maker->as_string( qr/\b(?:$re)\b/i );
}

sub via_trie
{
  my $maker = Regexp::Trie->new();
  #  Cannot add as list, unlike R::Assemble.
  $maker->add( $_ ) for @_;
  return $maker->regexp() . '';
}

sub _load_module
{
  my ( $mod ) = @_;

  local $@;
  eval qq/ require $mod; 1; /;
  $@ and do { warn "Could not load $mod: $@\n"; return; };

  warn "#  Using $mod ...\n";
  return 1;
}
[download]

In reply to Generate a single "or regex" from given strings by Anonymous Monk

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Come for the quick hacks, stay for the epiphanies.
	PerlMonks