Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Extended Patterns of Regular Expressions

by OM_Zen (Scribe)
on Feb 19, 2003 at 22:47 UTC ( #236866=perltutorial: print w/ replies, xml ) Need Help??

EXTENDED PATTERNS IN REGULAR EXPRESSIONS - Reference to Tutorial

The extended patterns in regular expressions document in perlre serves as a reference than a tutorial content. The perl programmers who are aspiring (like me who is trying ) to learn the language consummately are finding difficulties in the perlre document that is like a reference document than a tutorial document and hence I am attempting to elucidate the extended patterns of the regular expressions alone

The Extended patterns are named as they serve as extensions to the existing syntax that is already similar to those in awk and sed. The extended patterns are indicated by a syntax that begins with a parenthesis and a question mark and the extended pattern itself.

(?=\|)

The Perl document insists on two points as reasons for choosing the question mark as an indicator for the extended patterns.

The first being that the question marks are rare in regular expressions and the next being that question marks make one, stop and think over. I think like when it is used in backtracking to give non-greedy search patterns, one stops to think.

Some Extended patterns are considered experimental and may be removed at anytime and hence I shall try to discuss on the other extended patterns.

(?imsx-imsx)

This extended pattern is for having one or more pattern-match modifiers in the regular expression pattern.



Here as an example (?i) is for case insensitive pattern match

This pattern can be combined with the other patterns.

Practical Use:

This type of extended pattern match is used for matching dynamic file contents, user input strings that we do not really know or have a hold on the contents.

In a practical scenario, let us assume we have an oracle instance which has a schema that has a column named cust-code. This column has two parts, one is a fixed characters set and the other is a free textual part that indicates the text for the sub-contractors a customer has.

Let us say, we have a condition that should match a part of the string exactly with the cases intact and the other case insensitive (checking for the pattern in both capital and small alphabets) and do the required and hence the PATTERN that has to be matched is a combination of both case sensitive and case insensitive text.

To enforce such a pattern in the regular expression we could use:

my $cust_code = “SILICONEXSiliconex-wafers”; my $pattern = "SILICONEX((?i)siliconex-wafers)"; if ($cust_code =~ /$pattern/){ print "[$&]\n"; } __END__ [SILICONEXSiliconex-wafers]


SILICONEX is the following case sensitive text that forms the pattern and the other part is ((?!)siliconex-wafers) which is the extended pattern that matches the case insensitive text which is , siliconex-wafers in small and capital cases . The cust_code with the value, "SILICONEXSILICONEX-WAFERS" also matches along with the other small and capital alphabets for the pattern indicated.

To insist on the above a regular expression pattern, the one here shall match any small or capital alphabets of the string and not the exact string "SILICONEXsiliconex-wafers"

my $pattern = "(?i)SILICONEXsiliconex-wafers"; if ($cust_code =~ /$pattern/){ print "$&\n"; }


The value of "siliconexSILICONEX-WAFERS" would also match the pattern indicated by the above code because the case sensitive (?i) makes the whole string case insensitive and matches all case insensitive alphabets in $pattern, " SILICONEXsiliconex-wafers"

(?:pattern)

This type of pattern is for clustering into a sub-group forming a group of patterns

The (?:pattern) is essentially a grouping syntax rather than backreferencing as a group.

Practical use:

The text from a user entered form for an address could be any free textual form element or when you actually migrate from informix to oracle some columns that have the address or manager note that can have a free textual value can have some control M characters that translate to a carriage return characters thus separating the characters to different lines.

This can happen the other way around a free textual column value with carriage returns translate into ^M character when extracted to a unix ascii file that has to be loaded to Oracle schema.

For such complex strings, the pattern to match can be difficult without an extended pattern match that can give you the flexibility of having

• a pattern that is searched with a certain flag for a part of the pa +ttern and that which has no flags in the pattern • to sub-group the pattern and effectively use flags of (?m) as /m for mult-line strings, (?s) as /s for single line strings, (?x) as /x for extended search including comment lines and space + delimiters and (?i) as /i for case sensitive pattern notation or a combination of the flags


This (?:pattern) match can be used along with the previous (?imsx-imsx) by this syntax

(?imsx-imsx:pattern) to create a more flexible extended pattern .

Let the String from a schema or the free textual column

my $sub_contract="SILICONEXsiliconex-wafers ^M NOTE1-- a shipment in ^ +M backlog calc note for the shipping dates the contract is delivered "; my $pattern = "SILICONEX((?i)siliconex-wafers)"; if ($sub_contract =~ /$pattern/) { $sub_contract =~ s/(?m:(\^M|--))/ /g; my ($shipment_note, $sub_contract_keynote ) = reverse split(/(?i)NOTE/ ,$sub_contract); print "[ $sub_contract_keynote ] [ $shipment_note ]\n"; } __END__ [ for the shipping dates the contract is delivered ] [ 1 a shipment in back +log calc note ]


The

s/(?m:(\^M|--))/ /g
is involving both the checking of the ^M or | and changing it to a space and a - actually swithces of the option and then actually you split with a case insensitive "NOTE" as a string value in the split function

 split((?i)NOTE,$sub_contract);

(?=pattern) (?<=pattern) (?!pattern) (?<!pattern)

This part of the regular expression is a forward (look_ahead) and backward (look_behind) assertions pattern match .There is also positive and negative connotations to it in that whether you match a positive or negative assertion either in forward or backward direction to it.

(?= indicates a positive forward(look_ahead) assertion (?! indicates a negative forward(look_ahead) assertion (?<= indicates a positive backward(look_behind) assertion (?<! indicates a negative backward(look_behind) assertion


Then there is a concept of regular and variable width assertion.

The positive and negative forward (look_ahead) assertions allow variable width assertions
The positive and negative backward (look_behind) assertions allow fixed width assertions alone
Practical Use:

The contents of the extracted data from the schemas always is a delimited file in the os.

Usually if the extracted file is to be loaded to another database's schema, with a not null value for varchar2 fields in the schema, then you add the space for the fields in the data where it is empty .We had an instance where the delimiter was x|x as there were pipes also in the data.

The regular expression for the pattern of x|x, shall look as

$product = "MOTOROLAx|xx|xx|x01-01-1900x|xYx|x"; $product =~ s/x\|xx\|x/x\|x x\|x/g; print "[ $product ]\n"; __END__ [ MOTOROLAx|x x|xx|x01-01-1900x|xYx|x ]


But that which really happens is that if there are two fields that are empty together, then this expression of pattern match shall not do the required in both places, that which actually happens is



[ MOTOROLAx|x x|xx|x01-01-1900x|xYx|x ]


The Regular expression that usually is used on the wake does add space between the first occurrences of xx thus making x x, the next pattern xx that goes together with the first pattern is not added a space between.

This is because , as the first x|xx|x is matched and a space is added as x|x x|x . Then there is second empty string delimited by x|x that occurs together with the first match .The regular expression skips the match as it has already crossed the initial part of the pattern and hence does not match and the second occurrence is not spaced.

This is actually handled by look_ahead and look_behind assertions

$product = "MOTOROLAx|xx|xx|x01-01-1900x|xYx|x"; $product =~ s/(?<=\|)xx(?=\|)/x x/g; print "[ $product ]\n"; __END__ [ MOTOROLAx|x x|x x|x01-01-1900x|xYx|x ]


The pattern with the look_ahead and look_behind assertions are looking for a pattern of xx that is following a "|" and is followed by a "|" , but the $& is actually the xx and the action is changing it to x x which solves the pattern match.

The look_ahead pattern matches enable a variable width assertion in that

?= (\||$) or (?=\|*) is entertained and the string

$product = "MOTOROLAx|xx|xx|x01-01-1900x|xYx|xx"; $product =~ s/(?<=\|)xx(?=(\||$)/x x/g; print "[ $product ]\n"; __END__ [ MOTOROLAx|x x|x x|x01-01-1900x|xYx|x x ]


The great monks like tachyon , BrowserUk have helped me in an earlier node regarding this extended pattern .

The string that has the edge cases in the look_ahead pattern is not solved by the fixed width match stated with the first code , hence we use the variable width look_ahead assertion .

The variable width assertion is not entertained for the look_behind assertions and this actually means that one has to be defining a fixed pattern for the look_behind assertions . One can never try to use (?<=x*) or (?<!(x|^))

The experimental extended patterns of the regular expressions include

(?{code}) which enables including an anonymous block of code in the place of "code" (?{{code}}) which evaluates the "code" at runtime and considers that as a pattern as if it was there before the runtime ((?condition)yes pattern|no pattern) which is conditional pattern match and other strings (?>pattern) which is named as independent sub expression, that which matches a stand-alone pattern if it is anchored at the given position .

the extended patterns which might be removed

The elucidation of these, I have left it out giving the links to perlre, cause of the experimental nature of the patterns, the fellow monks have nodes where in they use these patterns of regular expressions. I consider this document complete only with the comments and explanations of anything or many things extra to the outlined tutorial content .

update (broquaint): title change (was EXTENDED PATTERNS OF REGULAR EXPRESSIONS)

Edit by tye, add READMORE

Comment on Extended Patterns of Regular Expressions
Select or Download Code
Re: Extended Patterns of Regular Expressions
by Anonymous Monk on Feb 20, 2003 at 00:58 UTC
Re: Extended Patterns of Regular Expressions
by OM_Zen (Scribe) on Feb 20, 2003 at 02:51 UTC
    Hi , The code for the variable width look_ahead assertion is
    $product = "MOTOROLAx|xx|xx|x01-01-1900x|xYx|xx"; $product =~ s/(?<=\|)xx(?=(\||$))/x x/g; print "[ $product ]\n"; __END__ [ MOTOROLAx|x x|x x|x01-01-1900x|xYx|x x ]


    The paranthesis was not there ,thanks Enlil for your help on finding this .

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perltutorial [id://236866]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (8)
As of 2014-07-11 08:38 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    When choosing user names for websites, I prefer to use:








    Results (222 votes), past polls