Help for regex

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Help for regex by davido (Cardinal) on Apr 01, 2012 at 05:34 UTC
So it's impossible that a newline, whitespace, commas (or other significant delimiters), quotes, escape sequences, or other tags could be embedded in the ID? That being the case this seems simple enough: `if( $string =~ m/<ID>([^<]+)<\/ID>/ ) { print "$1\n"; }` [download] It gets a lot more complicated if the input turns out to be more complex. If you haven't done so already, please spend an hour with perlretut. After that you'll wonder why you needed to ask. Update: Added a backslash. ;) Dave	[reply] [d/l]
Re^2: Help for regex by Anonymous Monk on Apr 01, 2012 at 05:37 UTC
Can you please explain "(^<+)"?	[reply]
Re^3: Help for regex by davido (Cardinal) on Apr 01, 2012 at 05:50 UTC
Certainly. `[^...]` is a negated character class. If `[...]` allows you to enumerate what characters WILL match at a given position, `[^...]` allows you to say 'match any character except for these characters, at this position'. Negated character classes are discussed in perlretut under the heading Using character classes. `+` is a quantifier. Quantifiers are discussed in perlretut. It says to match one or more characters that meet the criteria of the preceding character class. And the `(...)` are capturing parenthesis. Capturing parens are discussed in perlretut. They say to capture whatever happens to match the pattern within. Since this is the first capture, it will be placed in `$1` Putting it all together: Match anything that is not '<', as many characters as possible, and capture them into $1. `$1` and other capture variables are discussed in perlretut. Now would be a good time to follow my suggestion to read perlretut. ...you are looking to learn about regexes right? It should take about an hour or two to get the basics. Dave	[reply] [d/l] [select]
Re^3: Help for regex by Anonymous Monk on Apr 01, 2012 at 05:41 UTC
The delimiters matter, so use YAPE::Regex::Explain; print YAPE::Regex::Explain->new( qr{<ID>([^<]+)</ID>} )->explain; __END__ The regular expression: (?-imsx:<ID>([^<]+)</ID>) matches as follows: NODE EXPLANATION ---------------------------------------------------------------------- (?-imsx: group, but do not capture (case-sensitive) (with ^ and $ matching normally) (with . not matching \n) (matching whitespace and # normally): ---------------------------------------------------------------------- <ID> '<ID>' ---------------------------------------------------------------------- ( group and capture to \1: ---------------------------------------------------------------------- [^<]+ any character except: '<' (1 or more times (matching the most amount possible)) ---------------------------------------------------------------------- ) end of \1 ---------------------------------------------------------------------- </ID> '</ID>' ---------------------------------------------------------------------- ) end of grouping ---------------------------------------------------------------------- [download]	[reply] [d/l]
Re: Help for regex by bitingduck (Chaplain) on Apr 01, 2012 at 05:41 UTC
If you just need to do it once, and have data that is absolutely guaranteed to be well formed (so that the regex will be reliable) you can use something like this: `/<ID>(.*?)<\/ID>/` [download] But if you have to do it regularly or on files where you can't guarantee that they'll be well formed, use an XML or HTML parser (e.g. XML::Parse or HTML::TokeParser). If it's not well formed XML, then an HTML parser is likely to be more forgiving.	[reply] [d/l]
Re^2: Help for regex by Anonymous Monk on Apr 01, 2012 at 18:00 UTC
For this kind of thing I find XML::Simple to be, well, SIMPLE! ;) Using REs for ML parsing is risky because so much is permissible in XML/SGML/HTML. You have to be concerned with character sets, entities, etc. That said, I have often done it. But do take a look at XML::Simple, which answers most "trivial" cases quite well (and is built on the more robust XML libraries, so you can move to those if you need to).	[reply]
Re: Help for regex by FloydATC (Deacon) on Apr 01, 2012 at 18:41 UTC
If all you really want to do is remove the tags, this should be enough: `my $string = "-<ID>A8W11200031</ID>"; $string =~ s/<.+?>//g; print $string . "\n";` [download] Ofcourse, if your actual input contains other tags than what your sample input shows, you'll get funny results. -- Time flies when you don't know what you're doing	[reply] [d/l]
Re: Help for regex by cursion (Pilgrim) on Apr 02, 2012 at 13:47 UTC
You can do something like this to avoid getting too escape happy. `if ( $string =~ m#<ID>(.*)</ID># ) { ... }` [download]	[reply] [d/l]


Perl Monk, Perl Meditation
	PerlMonks