Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.
 
PerlMonks  

Regular Expressions Tutorial, the Basics (for BEGINNERS)

by brusimm (Pilgrim)
on Jan 22, 2007 at 18:04 UTC ( [id://595955]=perlmeditation: print w/replies, xml ) Need Help??

This node has been constructed to assist the new programmer in understanding the use of Regular Expressions.

Regular expressions are also called PATTERNS. Patterns are used to locate text strings within text lines. (Text lines are not the only pattern you can search, but for this tutorial, we are searching text lines.)

In this node, I will be specifically dealing with some basic building blocks of Regular Expressions. These are Pattern Searches, Alternative ( or Alternation) Search Patterns, Substitution Operators, Concatenation and Pattern-Matching Character Classes.

Here is a basic example of searching for a pattern in a sentence:

# Example 1 # $_ = "I logged in as brusimm and found that I had email."; # Test s +tring if (/brus/) { # Patter +n: brus print "There, our pattern showed up in our string.\n" ; # Tell u +ser } else { # or print "No match in our string for our pattern.\n"; # As you +'ll see... } # end

The variable $_ actually had what we were looking for, and so it printed the line. If you were to change (/brus/) to (/crus/), and rerun the script, crus is not there to be found, and our 'no match' print statement would print to screen because of that.

Notice the forward slashes around /brus/. Normally, we can use any delimiter with the pattern match call of m//. IE: m{brus}, or m,brus, etc. But I’ve initiated you to the shortcut of typing less while getting more! You will encounter pattern searches with //, but if you ever see the m//, you will know what it is.

In Example 1, our variable, $_ is just a single text line. If you were searching a whole file, you would rather use "while" than the "if", like this:

while (<>) { if (/brus/) { print $_; } }

Another way to search for strings is with an Alternative ( | Alternation ) search pattern, which is presented by the vertical bar (|), or can be referenced as "or". It is one way to look for multiple terms. In the below example, if you replace /brus/ with /but/, we would get the output corresponding to "No match". IF we replace the /but/ with the following input of /but|brus/, we would be rewarded with the 'success' print statement because 'brus' OR 'but' showed up in the string we searched.

I want to search for one or more terms

# Example 2 # $_ = "I logged in as brusimm and found that I had email."; # Test s +tring if (/but|brus/) { # Patter +nS: but or brus print "There, our pattern showed up in our string.\n" ; # Tell u +ser } else { print "No match in our string for our pattern.\n"; # As you +'ll see... } # end
This worked great for me when I had a large list of names, and I went looking for each line where one of several names that I was interested in showed up!

Let’s say I consumed way too much coffee & might have mistyped the test string:

# Example 3 # $_ = "I logged in as bruuuusimm and found that I had email."; # Our t +est sentence if (/brus/) { # Our pattern of brus print "There, brus showed up.\n" ; # if pattern is found, print i +t! } + # end

My search pattern would not work. It’s too explicit. BUT if we were to add an asterisk after the "u", telling Perl we weren’t sure of the number of u’s in the name, we might have better success finding the pattern. So rather than
if (/brus/) {
the line could be inputted as:
if (/bru*s/) {

and then we would have our print line show up.

Substitution Operators

Let’s say we know there were many instances in a file, where bruuuusimm occurred and we know we need to fix it. Really, we need to fix it!

The instruction would look something like this example line:

if (s/bru*s/brus/) {

Where every time we found bruuus, it would be replaced with brus. Notice the control character* of “s” before the FIRST forward slash. (*A control character initiates, modifies, or stops a program function, event, operation, or control operation.) Check it out with this script:

# example 4 # $_ = "I logged into bruuuusimm and saw I had email.\n"; # original er +ror print $_ ; # printing pr +oof of error if (s/bru*s/brus/) { # fixing it print $_ ; # proving we +fixed it. }

In example 4, I had the incorrect variable, and we printed it to prove that, and then after running our replacement, we printed it again to see if the replacement actually happened.

Something we may need to also think about is that issue of unintended matches. If perchance, we were doing a replacement within a large sum of text and the strings brs or abbrs were in that collection of text, those occurrences would also be changed, so we need to be aware of that scenario.MONKS: I'm Looking for additional reference material to show how to lock in the pattern I am using here. (Bruce)

If we wanted to match a single character, for example, an 'a', our pattern would be /a/.

# example 5 # $_ = "I logged into brusimm and saw I had email.\n"; if (/a/) { print $_ ; }

The line printed. If you were to replace /a/ with /z/, the line would not print because there is no "z" (Corresponding match) in the line.

This apparently may not work on (\n), so be aware. (That's for another day)

Additionally, if we wanted to combine 2 string values into a single string, that can be accomplished by the operator . in your code line.

# example 6 # print "Hello" . ' ' . "world"; #Same as 'Hello world'

Pattern-Matching Character Classes

Pattern-matching character classes is done by a pair of open and closed square brackets and a set of characters within the brackets.

The important thing to remember is that you only need one of these characters to be present in the reviewed string for a successful pattern to match.

What that means is if you run the following code:

# example 7 # $_ = "I logged into brusimm and saw I had FIVE email.\n"; if (/[xyz]/) { print $_ ; } #

Nothing will print from the IF query because there are no x’s, y’s or z’s in the sentence. But, if you replace /[xyz]/ with /[abc]/, the IF query prints because one or more of the parameters was met.

NOW, let’s be careful here. I input lowercase letters. Had I input /[ABC]/, there would be no output to print, because case matters. Hmm.

So let’s try this: Instead of "abc", let's replace it with a lowercase "f", and run that example. As you see, nothing happens. If you replace the lowercase "f" with an uppercase "F", we should get a printout because in my test sentence, the number FIVE is spelled out, with an UPPERCASE "F".

Now if you were looking for "f", and not sure of the case, one way to implement the search would be by the following /[fF]/. We are now saying, look for both upper and lower case versions of this letter.

But wait, I do not want to type out the whole alphabet or a whole series of numbers to find something. My time is way too short to do that because I'm working on tutorials! Is there a shorter way, (Unlike this node) to do this?

Yep.. instead of /[abc]/ in example 7, you could put /[a-c]/. (Hmm, in this case that’s not less typing, but hopefully you get the point?

So let’s look at this example:

# example 8 # $_ = "I make a black pencil line.\n"; if (/[q-zQ-Z]/) { print $_ ; }

Here I am looking for both upper and lower case letters from q to z. But the IF statement does not print. Oh yeah, there are no letters like that. Let’s replace the "q" with an "a". There, that’s better. This can also work for single digit numbers. I’ll let you try that on your own.

Now say you wanted to find a line that did not have certain characters. Humor me for a second: in your search pattern, /[q-z]/, you get no print out.. BUT, if you modify your search pattern to the following: /[^q-z]/, now it does print! Basically, the upper caret says match anything that IS NOT in this pattern! How’s them apples?! So if you wanted to find sentences with no numbers, you'd do this: /[^0-9]/. This upper caret is basically a NEGATED CHARACTER CLASS.

You may also want to check out muba's node on Regexp's Do's & Dont's.

Other Sources for this subject:

pulling vowels from a sentence,
perldoc notes,
Perl.com, More in depth look
Perltut
and, CPAN

That concludes this tutorial. Source of my information is Learning Perl, 3rd & 4th Editions, by Schwartz, Phoenix & Foy.

END PROPOSED TUTORIAL

Replies are listed 'Best First'.
Re: RFC - Regular Expressions Tutorial, the Basics (for BEGINNERS)
by Not_a_Number (Prior) on Jan 22, 2007 at 19:43 UTC
    If we wanted to match a single character, our pattern would be /a./. The control character is a '.' after the letter in question.

    No! /a./ means match two characters, an 'a' followed by any other character. To match a single lower-case 'a', the pattern is simply /a/ - try it and see:

    $_ = 'He lives in Liberia'; print 'with dot: ', $_ if /a./; # No output print 'without dot: ', $_ if /a/; # Outputs 'without dot: He lives in Liberia'

    Also, in your last paragraph, you've inverted your square brackets and slashes...

    Update:

    To clarify, the dot (period) is not a 'control character', as you seem to imagine, but a 'wildcard character'. To quote from perlretut (which I highly recommend you read, and include in your list of other sources):

    The period '.' matches any character but "\n"

    Update2: s/perlreftut/perlretut, sorry!

Re: RFC - Regular Expressions Tutorial, the Basics (for BEGINNERS)
by ww (Archbishop) on Jan 22, 2007 at 19:11 UTC
    brusimm, this seems a good start; even meritorious!

          ...sufficiently so that I hope I will not offend with observations on a few things that seem to me to be shortcomings. So, if you care,

      Additionally, within your statement of Notice the control character of “s” before the forward slash, you are using multiple slashes in the statement, and each of them is preceeded by an 's'. This could be misleading.

      Great start, however.

      --MidLifeXis

Re: RFC - Regular Expressions Tutorial, the Basics (for BEGINNERS)
by Solo (Deacon) on Jan 22, 2007 at 19:38 UTC

    s/bru*s/brus/

    will also change brs to brus and abbrs to abbrus.

    You might want to touch on unintended matches.

    --Solo

    --
    You said you wanted to be around when I made a mistake; well, this could be it, sweetheart.
Re: RFC - Regular Expressions Tutorial, the Basics (for BEGINNERS)
by demerphq (Chancellor) on Jan 24, 2007 at 13:32 UTC

    Its a little strange to me how few regexp tutorials start with the basics and move on from there. Maybe I'm too close to the trees or maybe im too advanced to see what a beginner would need, but it strikes me that ommitting the basics is a bad start.

    There are five fundamental building blocks of a regular expression. They are "characters", "concatenation", "alternation", "grouping", and "kleene closure"

    Characters are literal characters that must be matched. A character is matched by finding the leftmost occuring equivelent in the input string.

    Concatenation is the principle that two characters are concatenated together when not seperated by an operator. Concatenation is implied in a pattern, there is no special operator for it, and has the lowest precedence of all operators except for alternation.

    Alternation is the way to say "match this subpattern or that subpattern". It is denoted by putting a | symbol in between the two subpatterns. Alternation has the lowest precedence of all the operators.

    Grouping is a way to combine multiple components into a self contained subpattern. Alternation is often place into a grouping construct. In perl grouping is denoted by putting the subpattern in a parenthesis.

    Kleene closure is a special pattern that matches 0 or more subpatterns in a string. This is denoted by a postfix * operator, or in less technical terms by placing a * after the subpattern.

    It turns out that many of the common tasks one would wish to perform with a regex are quite clumsy when restricted to such a sparse language. Therefore various extensions have been made which allow common constructs to be written more elegantly.

    Its common to want to match 1 or more subpatterns. While this can be expressed using klene closure alone, it can be clumsy, therefore the postfix plus operator is provided. P+ is defined to match the same thing as PP*.

    Its common to want to match any one of several characters at a given point in a string. Therefore the "character class" parenthetical construct is provided. [ABC] matches the same text that (A|B|C) matches. Note that this is restricted to single characters and not longer subpatterns.

    The ability to optionally match something is a common requirement. Therefore the ? postfix operator is provided. P? matches the same thing as (P|) matches. (P or nothing)

    Anyway, just some thoughts for you. Obviously it all could use more polishing, buts its basic material that i think makes it easier to understand regexes.

    ---
    $world=~s/war/peace/g

Re: RFC - Regular Expressions Tutorial, the Basics (for BEGINNERS)
by muba (Priest) on Jan 24, 2007 at 02:57 UTC

    Ouch.

    Nice work - but take the replies above in consideration, there are quite some valid points there.

    This also reminds me that I'm still working on a similar project </shameless plug>

Re: Regular Expressions Tutorial, the Basics (for BEGINNERS)
by targetsmart (Curate) on Jan 28, 2009 at 13:02 UTC
    The above tutorial is interesting, is there any other tutorial on perl regular expression with more examples and provides an in-depth coverage(from novice to professional), because even after some years of experience in perl I think there are some advanced concepts in regular expression which I must learn.( I am from a SED background).
    I have read perlre and perlretut manpages, but I would be happy if I have a tutorial with more examples and exercises.
    Please point out places like books and online tutorials(especially dealing with perl regular expressions).

    -- In accordance with the prarabdha of each, the One whose function it is to ordain makes each to act. What will not happen will never happen, whatever effort one may put forth. And what will happen will not fail to happen, however much one may seek to prevent it. This is certain. The part of wisdom therefore is to stay quiet.
        But I doubt whether it gives a comprehensive coverage(basic to advanced concepts+examples+exercises), any other suggestions?.

        -- In accordance with the prarabdha of each, the One whose function it is to ordain makes each to act. What will not happen will never happen, whatever effort one may put forth. And what will happen will not fail to happen, however much one may seek to prevent it. This is certain. The part of wisdom therefore is to stay quiet.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlmeditation [id://595955]
Approved by ww
Front-paged by planetscape
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others scrutinizing the Monastery: (5)
As of 2024-03-19 02:21 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found