Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number

Lookahead/Lookbehind Regular Expression...

by mogmismo (Novice)
on Dec 19, 2013 at 16:55 UTC ( #1067836=perlquestion: print w/ replies, xml ) Need Help??
mogmismo has asked for the wisdom of the Perl Monks concerning the following question:

So, I'm trying to match, in a string, all values that are abbreviations. In this string:

"a history of u.s. coast guard aviation."

I would like the "u.s." to become "us" but not replace any other periods. Another example would be taking "M.C. Esher" and converting to "MC Esher"

So far, I can remove the middle dot, with a:

$string =~ s/(?<=\w)\.(?=\w)//g;

But I can't figure out how to do a lookahead/lookbehind/lookahead/lookbehind... I've tried this, and it fails:

$string =~ s/(?<=\w)\.(?=\w)\.(?=\s)//g;

Any ideas, Monks?

Comment on Lookahead/Lookbehind Regular Expression...
Select or Download Code
Replies are listed 'Best First'.
Re: Lookahead/Lookbehind Regular Expression...
by Your Mother (Chancellor) on Dec 19, 2013 at 17:37 UTC

    This is an impossible problem to solve without either perfect sentence delimiters or a grammar parser and even then, not all text is grammatical. One example to show the difficulty: I will be the M.C. Usher will be singing.

Re: Lookahead/Lookbehind Regular Expression...
by wind (Priest) on Dec 19, 2013 at 17:35 UTC
    It sounds like you're just wanting to remove any periods that follow just one word character. I'm sure there might be a rare exception to this, but the following would accomplish that basic feat:
    use strict; use warnings; my $str = "a history of u.s. coast guard aviation. M.C. Esher"; $str =~ s/(?<=\W\w)\.//g; print $str, "\n";
    I'd also be tempted to force capitalization on those abbreviations, but that might not be what you're after:
    # Remove periods and capitalize, so we have US instead of us. $str =~ s/(?<=\W)(\w)\./\U$1/g;
    - Miller
Re: Lookahead/Lookbehind Regular Expression...
by hdb (Prior) on Dec 19, 2013 at 17:15 UTC

    I would prefer a two step approach avoiding look(ahead|behind|around) assertions: first find patterns like a repeat of letter followed by a dot, then remove the dots and replace the pattern. Like this:

    use strict; use warnings; my $str = "a history of u.s. coast guard aviation. M.C. Esher"; $str =~ s/\W\K((\w\.)+)/ $1 =~ s{\.}{}gr /ge; print "$str\n";

    The \W\K at the beginning is required to avoid matching "am.c.". It will also match a single occurence of letter followed by dot but that can be easily fixed by requiring two or more repetitions.

    Update: removed the unnecessary non-capturing ?: from the regex.

Re: Lookahead/Lookbehind Regular Expression...
by educated_foo (Vicar) on Dec 20, 2013 at 00:51 UTC
    I wouldn't bother with look-ahead/behind on this problem:
    $_ = 'u.s. asdf.g. and i. M.C. Escher and P. Picasso'; s/\b(\w)\.(\w)\./$1$2/g; # $_ eq 'us asdf.g. and i. MC Escher and P. Picasso'

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1067836]
Approved by marto
Front-paged by toolic
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (3)
As of 2016-04-29 04:37 GMT
Find Nodes?
    Voting Booth?
    :nehw tseb si esrever ni gnitirW

    Results (438 votes). Check out past polls.