I got tired of remaking regex (of email addresses) for use by procmail and mutt by hand. Then I wrote the following. The worst case is every string is simply joined, which is still better than manual operation. Ideally one of three modules -- Regexp::(Assemble|Optimizer|Trie) -- will be used to generate the regex when available.
Mind that ...
- /i flag is used; feel free to make it optional yourself;
- duplicates in input are not considered;
- end-spaces are removed.
#!perl
use strict; use warnings;
our $VERSION = '0.07';
# or-re.pl - Given a list of strings, prints a OR'd regex; prints tes
+t results
# for the input.
#
# End-spaces & duplicate strings are removed before generating the re
+gex.
scalar @ARGV
or die qq[Give strings to make one "or" regex.\n];
my @list = prepare( @ARGV );
my $re = build_re( @list );
printf "Regex...\n %s\n\n" , $re;
re_test( $re , @list );
exit;
sub re_test
{
my ( $re , @list ) = @_;
ref $re or $re = qr/$re/;
print "Test ...\n";
for my $it ( @list )
{
printf " %s %14s\n"
, $it
, ( $it =~ $re ? 'matches' : 'does NOT match' )
;
}
return;
}
sub prepare
{
scalar @_ or return;
my %seen;
return map
{ s/^\s+//;
s/\s+$//;
!$seen{ $_ }++ ? $_ : ()
}
@_;
}
sub build_re
{
scalar @_ or return '';
my @arg = @_;
my %mod_map =
( 'Regexp::Assemble' => \&via_assemble
, 'Regexp::Optimizer' => \&via_optimizer
, 'Regexp::Trie' => \&via_trie
, 'mine' => sub { return decorate( simpleton( @_ ) )
+}
);
my @order =
( 'Regexp::Assemble'
, 'Regexp::Optimizer'
, 'Regexp::Trie'
);
my $maker = 'mine';
for my $mod ( @order )
{
_load_module( $mod ) or next;
$maker = $mod;
last;
}
return $mod_map{ $maker }->( @arg )
}
sub decorate { return qq/\\b(?i:$_[0])\\b/ ; }
sub simpleton
{
return
join '|'
, map
{ quotemeta( $_ ) }
sort
{ length $b <=> length $a
||
lc $a cmp lc $b
}
@_ ;
}
sub via_assemble
{
my $maker = Regexp::Assemble->new( 'chomp' => 1
, 'reduce' => 1
, 'modifiers' => 'i'
, 'anchor_word' => 1
);
$maker->add( map quotemeta( $_ ) , @_ );
# R:A::as_string() method eschews flags given. This preserves it at
+the cost
# of extraneous syntax elsewhere.
return $maker->re() . '';
}
sub via_optimizer
{
my $re = simpleton( @_ );
my $maker = Regexp::Optimizer->new();
return $maker->as_string( qr/\b(?:$re)\b/i );
}
sub via_trie
{
my $maker = Regexp::Trie->new();
# Cannot add as list, unlike R::Assemble.
$maker->add( $_ ) for @_;
return $maker->regexp() . '';
}
sub _load_module
{
my ( $mod ) = @_;
local $@;
eval qq/ require $mod; 1; /;
$@ and do { warn "Could not load $mod: $@\n"; return; };
warn "# Using $mod ...\n";
return 1;
}
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.