Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Comment on

( #3333=superdoc: print w/ replies, xml ) Need Help??

This node has been constructed to assist the new programmer in understanding the use of Regular Expressions.

Regular expressions are also called PATTERNS. Patterns are used to locate text strings within text lines. (Text lines are not the only pattern you can search, but for this tutorial, we are searching text lines.)

In this node, I will be specifically dealing with some basic building blocks of Regular Expressions. These are Pattern Searches, Alternative ( or Alternation) Search Patterns, Substitution Operators, Concatenation and Pattern-Matching Character Classes.

Here is a basic example of searching for a pattern in a sentence:

# Example 1 # $_ = "I logged in as brusimm and found that I had email."; # Test s +tring if (/brus/) { # Patter +n: brus print "There, our pattern showed up in our string.\n" ; # Tell u +ser } else { # or print "No match in our string for our pattern.\n"; # As you +'ll see... } # end

The variable $_ actually had what we were looking for, and so it printed the line. If you were to change (/brus/) to (/crus/), and rerun the script, crus is not there to be found, and our 'no match' print statement would print to screen because of that.

Notice the forward slashes around /brus/. Normally, we can use any delimiter with the pattern match call of m//. IE: m{brus}, or m,brus, etc. But Iíve initiated you to the shortcut of typing less while getting more! You will encounter pattern searches with //, but if you ever see the m//, you will know what it is.

In Example 1, our variable, $_ is just a single text line. If you were searching a whole file, you would rather use "while" than the "if", like this:

while (<>) { if (/brus/) { print $_; } }

Another way to search for strings is with an Alternative ( | Alternation ) search pattern, which is presented by the vertical bar (|), or can be referenced as "or". It is one way to look for multiple terms. In the below example, if you replace /brus/ with /but/, we would get the output corresponding to "No match". IF we replace the /but/ with the following input of /but|brus/, we would be rewarded with the 'success' print statement because 'brus' OR 'but' showed up in the string we searched.

I want to search for one or more terms

# Example 2 # $_ = "I logged in as brusimm and found that I had email."; # Test s +tring if (/but|brus/) { # Patter +nS: but or brus print "There, our pattern showed up in our string.\n" ; # Tell u +ser } else { print "No match in our string for our pattern.\n"; # As you +'ll see... } # end
This worked great for me when I had a large list of names, and I went looking for each line where one of several names that I was interested in showed up!

Letís say I consumed way too much coffee & might have mistyped the test string:

# Example 3 # $_ = "I logged in as bruuuusimm and found that I had email."; # Our t +est sentence if (/brus/) { # Our pattern of brus print "There, brus showed up.\n" ; # if pattern is found, print i +t! } + # end

My search pattern would not work. Itís too explicit. BUT if we were to add an asterisk after the "u", telling Perl we werenít sure of the number of uís in the name, we might have better success finding the pattern. So rather than
if (/brus/) {
the line could be inputted as:
if (/bru*s/) {

and then we would have our print line show up.

Substitution Operators

Letís say we know there were many instances in a file, where bruuuusimm occurred and we know we need to fix it. Really, we need to fix it!

The instruction would look something like this example line:

if (s/bru*s/brus/) {

Where every time we found bruuus, it would be replaced with brus. Notice the control character* of ďsĒ before the FIRST forward slash. (*A control character initiates, modifies, or stops a program function, event, operation, or control operation.) Check it out with this script:

# example 4 # $_ = "I logged into bruuuusimm and saw I had email.\n"; # original er +ror print $_ ; # printing pr +oof of error if (s/bru*s/brus/) { # fixing it print $_ ; # proving we +fixed it. }

In example 4, I had the incorrect variable, and we printed it to prove that, and then after running our replacement, we printed it again to see if the replacement actually happened.

Something we may need to also think about is that issue of unintended matches. If perchance, we were doing a replacement within a large sum of text and the strings brs or abbrs were in that collection of text, those occurrences would also be changed, so we need to be aware of that scenario.MONKS: I'm Looking for additional reference material to show how to lock in the pattern I am using here. (Bruce)

If we wanted to match a single character, for example, an 'a', our pattern would be /a/.

# example 5 # $_ = "I logged into brusimm and saw I had email.\n"; if (/a/) { print $_ ; }

The line printed. If you were to replace /a/ with /z/, the line would not print because there is no "z" (Corresponding match) in the line.

This apparently may not work on (\n), so be aware. (That's for another day)

Additionally, if we wanted to combine 2 string values into a single string, that can be accomplished by the operator . in your code line.

# example 6 # print "Hello" . ' ' . "world"; #Same as 'Hello world'

Pattern-Matching Character Classes

Pattern-matching character classes is done by a pair of open and closed square brackets and a set of characters within the brackets.

The important thing to remember is that you only need one of these characters to be present in the reviewed string for a successful pattern to match.

What that means is if you run the following code:

# example 7 # $_ = "I logged into brusimm and saw I had FIVE email.\n"; if (/[xyz]/) { print $_ ; } #

Nothing will print from the IF query because there are no xís, yís or zís in the sentence. But, if you replace /[xyz]/ with /[abc]/, the IF query prints because one or more of the parameters was met.

NOW, letís be careful here. I input lowercase letters. Had I input /[ABC]/, there would be no output to print, because case matters. Hmm.

So letís try this: Instead of "abc", let's replace it with a lowercase "f", and run that example. As you see, nothing happens. If you replace the lowercase "f" with an uppercase "F", we should get a printout because in my test sentence, the number FIVE is spelled out, with an UPPERCASE "F".

Now if you were looking for "f", and not sure of the case, one way to implement the search would be by the following /[fF]/. We are now saying, look for both upper and lower case versions of this letter.

But wait, I do not want to type out the whole alphabet or a whole series of numbers to find something. My time is way too short to do that because I'm working on tutorials! Is there a shorter way, (Unlike this node) to do this?

Yep.. instead of /[abc]/ in example 7, you could put /[a-c]/. (Hmm, in this case thatís not less typing, but hopefully you get the point?

So letís look at this example:

# example 8 # $_ = "I make a black pencil line.\n"; if (/[q-zQ-Z]/) { print $_ ; }

Here I am looking for both upper and lower case letters from q to z. But the IF statement does not print. Oh yeah, there are no letters like that. Letís replace the "q" with an "a". There, thatís better. This can also work for single digit numbers. Iíll let you try that on your own.

Now say you wanted to find a line that did not have certain characters. Humor me for a second: in your search pattern, /[q-z]/, you get no print out.. BUT, if you modify your search pattern to the following: /[^q-z]/, now it does print! Basically, the upper caret says match anything that IS NOT in this pattern! Howís them apples?! So if you wanted to find sentences with no numbers, you'd do this: /[^0-9]/. This upper caret is basically a NEGATED CHARACTER CLASS.

You may also want to check out muba's node on Regexp's Do's & Dont's.

Other Sources for this subject:

pulling vowels from a sentence,
perldoc notes,
Perl.com, More in depth look
Perltut
and, CPAN

That concludes this tutorial. Source of my information is Learning Perl, 3rd & 4th Editions, by Schwartz, Phoenix & Foy.

END PROPOSED TUTORIAL


In reply to Regular Expressions Tutorial, the Basics (for BEGINNERS) by brusimm

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • Outside of code tags, you may need to use entities for some characters:
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others surveying the Monastery: (6)
    As of 2014-08-21 22:40 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      The best computer themed movie is:











      Results (144 votes), past polls