Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Re: Fine tuning a reg exp

by runrig (Abbot)
on Feb 23, 2012 at 15:52 UTC ( #955765=note: print w/ replies, xml ) Need Help??


in reply to Fine tuning a reg exp

See perlre. All uppercase can be matched with:

/^[[:upper:]]+$/
Update: This is unnecessary and doesn't help much. [A-Z] does just as well unless you consider unicode...


Comment on Re: Fine tuning a reg exp
Select or Download Code
Re^2: Fine tuning a reg exp
by markjrouse (Initiate) on Feb 23, 2012 at 17:01 UTC
    Thanks all, this is all really useful. I think I've got this bit now, but using the same text as an example:
    ABU BAKR, Ibrahim Ali Muhammad (a.k.a. AL-LIBI, Abd al-Muhsin) (individual) [SDGT] AFGHAN SUPPORT COMMITTEE (ASC) (a.k.a. AHYA UL TURAS; a.k.a. JAMIAT AYAT-UR-RHAS AL ISLAMIA; a.k.a. JAMIAT IHYA UL TURATH AL ISLAMIA; a.k.a. LAJNAT UL MASA EIDATUL AFGHANIA) Grand Trunk Road, near Pushtoon Garhi Pabbi, Peshawar, Pakistan; Cheprahar Hadda, Mia Omar Sabaqah School, Jalalabad, Afghanistan [SDGT]
    I'm now trying to match firstnames. What reg exp is need to match
    Ibrahim Ali Muhammad
    the reason being is that I'm trying to add tags to a text document, so that I can work manipulate it like this:
    $line =~ s/regexp\<name\>$1\<\/name\>/;
    I want to achieve this:
    ABU BAKR, <name>Ibrahim Ali Muhammad</name> (a.k.a. AL-LIBI, Abd al-Mu +hsin) (individual) [SDGT] AFGHAN SUPPORT COMMITTEE (ASC) (a.k.a. AHYA UL TURAS; a.k.a. JAMIAT AYAT-UR-RHAS AL ISLAMIA; a.k.a. JAMIAT IHYA UL TURATH AL ISLAMIA; a.k.a. LAJNAT UL MASA EIDATUL AFGHANIA) Grand Trunk Road, near Pushtoon Garhi Pabbi, Peshawar, Pakistan; Cheprahar Hadda, Mia Omar Sabaqah School, Jalalabad, Afghanistan [SDGT]

      Is there anything (punctuation, perhaps? placement with other words and terms?) that will consistently distinguish a name from any other proper noun in your text? For example, how can your script consistently distinguish between "Ibrahim Ali Muhammad" and "Grand Trunk Road" and "Pushtoon Garhi Pabbi", since all use the same capitalization scheme? You might have to define some more complicated criteria for recognizing names. Or will names only be in the headings of each entry, i.e. toward the beginning?

      In general, you would want:

      $line =~ s{($regexp)}{<name>$1</name>}g;

      The 'g' flag may or may not be needed, depending on what you're doing. If there's more than one name in a line, that would catch it. If there's only one name, you don't need it. The parentheses () match the name in your line and place it in $1, so you can put the tags around it in your replacement expression. Using curly brackets {} instead of / to mark your regexp avoids having to escape your slashes ("leaning toothpick syndrome," I think someone called it -- it can get confusing!). Any other characters could be used to delimit your regexp if you'd prefer. What I have above is equivalent to this:

      $line =~ s/($regexp)/<name>$1<\/name>/g;
        Yes, names should only be at the beginning of each line. Yeah, it gets tricky because names are: , names space|comma, but then so are other elements that are bot names.
Re^2: Fine tuning a reg exp
by tchrist (Pilgrim) on Feb 23, 2012 at 23:17 UTC
    /^[[:upper:]]+$/
    Iíve never understood why people use that instead of the much easier to type, read, and use \p{upper}. Can you tell me why?
      Because the former works in a shell and sed, too?

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://955765]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others about the Monastery: (6)
As of 2014-07-30 04:43 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (229 votes), past polls