Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Lookahead/Lookbehind Regular Expression...

by mogmismo (Novice)
on Dec 19, 2013 at 16:55 UTC ( #1067836=perlquestion: print w/ replies, xml ) Need Help??
mogmismo has asked for the wisdom of the Perl Monks concerning the following question:

So, I'm trying to match, in a string, all values that are abbreviations. In this string:

"a history of u.s. coast guard aviation."

I would like the "u.s." to become "us" but not replace any other periods. Another example would be taking "M.C. Esher" and converting to "MC Esher"

So far, I can remove the middle dot, with a:

$string =~ s/(?<=\w)\.(?=\w)//g;

But I can't figure out how to do a lookahead/lookbehind/lookahead/lookbehind... I've tried this, and it fails:

$string =~ s/(?<=\w)\.(?=\w)\.(?=\s)//g;

Any ideas, Monks?

Comment on Lookahead/Lookbehind Regular Expression...
Select or Download Code
Replies are listed 'Best First'.
Re: Lookahead/Lookbehind Regular Expression...
by Your Mother (Chancellor) on Dec 19, 2013 at 17:37 UTC

    This is an impossible problem to solve without either perfect sentence delimiters or a grammar parser and even then, not all text is grammatical. One example to show the difficulty: I will be the M.C. Usher will be singing.

Re: Lookahead/Lookbehind Regular Expression...
by wind (Priest) on Dec 19, 2013 at 17:35 UTC
    It sounds like you're just wanting to remove any periods that follow just one word character. I'm sure there might be a rare exception to this, but the following would accomplish that basic feat:
    use strict; use warnings; my $str = "a history of u.s. coast guard aviation. M.C. Esher"; $str =~ s/(?<=\W\w)\.//g; print $str, "\n";
    I'd also be tempted to force capitalization on those abbreviations, but that might not be what you're after:
    # Remove periods and capitalize, so we have US instead of us. $str =~ s/(?<=\W)(\w)\./\U$1/g;
    - Miller
Re: Lookahead/Lookbehind Regular Expression...
by hdb (Prior) on Dec 19, 2013 at 17:15 UTC

    I would prefer a two step approach avoiding look(ahead|behind|around) assertions: first find patterns like a repeat of letter followed by a dot, then remove the dots and replace the pattern. Like this:

    use strict; use warnings; my $str = "a history of u.s. coast guard aviation. M.C. Esher"; $str =~ s/\W\K((\w\.)+)/ $1 =~ s{\.}{}gr /ge; print "$str\n";

    The \W\K at the beginning is required to avoid matching "am.c.". It will also match a single occurence of letter followed by dot but that can be easily fixed by requiring two or more repetitions.

    Update: removed the unnecessary non-capturing ?: from the regex.

Re: Lookahead/Lookbehind Regular Expression...
by educated_foo (Vicar) on Dec 20, 2013 at 00:51 UTC
    I wouldn't bother with look-ahead/behind on this problem:
    $_ = 'u.s. asdf.g. and i. M.C. Escher and P. Picasso'; s/\b(\w)\.(\w)\./$1$2/g; # $_ eq 'us asdf.g. and i. MC Escher and P. Picasso'

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1067836]
Approved by marto
Front-paged by toolic
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others lurking in the Monastery: (10)
As of 2015-07-08 07:59 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (96 votes), past polls