Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Help for regex

by Anonymous Monk
on Apr 01, 2012 at 05:19 UTC ( #962834=perlquestion: print w/ replies, xml ) Need Help??
Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi ,Need some help to get a regex for the following,need to remove the tags and just get ID,Sample input and output below.Please help

INPUT:-<ID>A8W11200031</ID> OUTPUT:-A8W11200031

Comment on Help for regex
Download Code
Re: Help for regex
by davido (Archbishop) on Apr 01, 2012 at 05:34 UTC

    So it's impossible that a newline, whitespace, commas (or other significant delimiters), quotes, escape sequences, or other tags could be embedded in the ID? That being the case this seems simple enough:

    if( $string =~ m/<ID>([^<]+)<\/ID>/ ) { print "$1\n"; }

    It gets a lot more complicated if the input turns out to be more complex.

    If you haven't done so already, please spend an hour with perlretut. After that you'll wonder why you needed to ask.

    Update: Added a backslash. ;)


    Dave

      Can you please explain "(^<+)"?

        The delimiters matter, so
        use YAPE::Regex::Explain; print YAPE::Regex::Explain->new( qr{<ID>([^<]+)</ID>} )->explain; __END__ The regular expression: (?-imsx:<ID>([^<]+)</ID>) matches as follows: NODE EXPLANATION ---------------------------------------------------------------------- (?-imsx: group, but do not capture (case-sensitive) (with ^ and $ matching normally) (with . not matching \n) (matching whitespace and # normally): ---------------------------------------------------------------------- <ID> '<ID>' ---------------------------------------------------------------------- ( group and capture to \1: ---------------------------------------------------------------------- [^<]+ any character except: '<' (1 or more times (matching the most amount possible)) ---------------------------------------------------------------------- ) end of \1 ---------------------------------------------------------------------- </ID> '</ID>' ---------------------------------------------------------------------- ) end of grouping ----------------------------------------------------------------------

        Certainly. [^...] is a negated character class. If [...] allows you to enumerate what characters WILL match at a given position, [^...] allows you to say 'match any character except for these characters, at this position'.

        Negated character classes are discussed in perlretut under the heading Using character classes.

        + is a quantifier. Quantifiers are discussed in perlretut. It says to match one or more characters that meet the criteria of the preceding character class. And the (...) are capturing parenthesis. Capturing parens are discussed in perlretut. They say to capture whatever happens to match the pattern within. Since this is the first capture, it will be placed in $1

        Putting it all together: Match anything that is not '<', as many characters as possible, and capture them into $1. $1 and other capture variables are discussed in perlretut.

        Now would be a good time to follow my suggestion to read perlretut. ...you are looking to learn about regexes right? It should take about an hour or two to get the basics.


        Dave

Re: Help for regex
by bitingduck (Friar) on Apr 01, 2012 at 05:41 UTC

    If you just need to do it once, and have data that is absolutely guaranteed to be well formed (so that the regex will be reliable) you can use something like this:

    /<ID>(.*?)<\/ID>/

    But if you have to do it regularly or on files where you can't guarantee that they'll be well formed, use an XML or HTML parser (e.g. XML::Parse or HTML::TokeParser). If it's not well formed XML, then an HTML parser is likely to be more forgiving.

      For this kind of thing I find XML::Simple to be, well, SIMPLE! ;) Using REs for ML parsing is risky because so much is permissible in XML/SGML/HTML. You have to be concerned with character sets, entities, etc. That said, I have often done it. But do take a look at XML::Simple, which answers most "trivial" cases quite well (and is built on the more robust XML libraries, so you can move to those if you need to).
Re: Help for regex
by FloydATC (Chaplain) on Apr 01, 2012 at 18:41 UTC
    If all you really want to do is remove the tags, this should be enough:
    my $string = "-<ID>A8W11200031</ID>"; $string =~ s/<.+?>//g; print $string . "\n";
    Ofcourse, if your actual input contains other tags than what your sample input shows, you'll get funny results.

    -- Time flies when you don't know what you're doing
Re: Help for regex
by cursion (Monk) on Apr 02, 2012 at 13:47 UTC

    You can do something like this to avoid getting too escape happy.

    if ( $string =~ m#<ID>(.*)</ID># ) { ... }

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://962834]
Approved by planetscape
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (16)
As of 2014-12-18 14:27 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (55 votes), past polls