Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

RegEx Doubt

by mecrazycoder (Sexton)
on Sep 30, 2009 at 11:34 UTC ( [id://798296]=perlquestion: print w/replies, xml ) Need Help??

mecrazycoder has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks, I am having a line like
<xyz>a</xyz><xyz>a3</xyz><xyz>a2</xyz><xyz>a1</xyz>
. Now i want to write a regex to fetch a,a3,a2,a1 alone.How can i do that one. Regex i wrote was something like
$text=`/<xyz>(.*)<\/xyz>/
This doesn't seems to work.Please guide me

Replies are listed 'Best First'.
Re: RegEx Doubt
by moritz (Cardinal) on Sep 30, 2009 at 11:45 UTC
    Use an XML or HTML parser (like XML::Twig or HTML::TreeBuilder) - parsing HTML with regexes can be very painful and time consuming.
    Perl 6 - links to (nearly) everything that is Perl 6.
Re: RegEx Doubt
by ccn (Vicar) on Sep 30, 2009 at 11:37 UTC
      Tanx man
Re: RegEx Doubt
by Bloodnok (Vicar) on Sep 30, 2009 at 11:45 UTC
    TIMTOWTDI - why not use split ...
    $ perl -e '@a = split /<\/?xyz>/, q(<xyz>a</xyz><xyz>a3</xyz><xyz>a2</ +xyz><xyz>a1</xyz>); print qq/@a\n/' a a3 a2 a1
    Update:

    Ahhh, maybe I see one reason, using Data::Dumper to print the output gives:

    $ perl -MData::Dumper -e '@a = split /\<\/?xyz\>/, q(<xyz>a</xyz><xyz> +a3</xyz><xyz>a2</xyz><xyz>a1</xyz>); print Dumper \@a' $VAR1 = [ '', 'a', '', 'a3', '', 'a2', '', 'a1' ];
    Question, for me at least is: why doesn't split swallow the sub-strings on which the string is split ? I'm obviously missing something, but can't see it - any enlightenment appreciated.

    TIA

    A user level that continues to overstate my experience :-))

      Hm, its non-zero-width, so it's still nice and easy: You've multiple 'split-points' in sequence in the source. Try either grep /./ on the split results or use split /(?:...)+/ to 'combine' them into one 'split-point'.

      Aren't they cute, those little regexes? Remembering apocalypse5 fondly :).

        ...use split /(?:...)+/ to 'combine' them into one 'split-point'. and then grep for empty lines i.e. grep /./, ..., since the first element is still empty, so might as well use grep /./, ... on the lot to start with.

        I tried the zero capture approach, but a) transposed the '?' and the ';' and b) didn't use '+' ... doh !!!

        Update:

        To reduce any confusion, the transposition to which I referred in the above was entirely down to the paucity of my typing i.e. I typed ':?' instead of '?:' and didn't notice .oO(Maybe I ought to use a larger font...) ;-)

        A user level that continues to overstate my experience :-))
      It puzzled me for a minute, but I found explanation. You have two sub-strings every time. Separated by nothing. </xyz>_nothing_<xyz>. And this nothing is what you find in your output.
      It does swallow the substrings on which the string is split. I don't see any <\/?xyz>s in the Dumper output.

      The blank entries being returned are the zero-length substrings in the middle of </xyz><xyz> - that combination is two matches of the split pattern with nothing in between.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://798296]
Approved by planetscape
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others sharing their wisdom with the Monastery: (6)
As of 2024-04-18 08:52 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found