RegEx Doubt

mecrazycoder has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: RegEx Doubt by moritz (Cardinal) on Sep 30, 2009 at 11:45 UTC
Use an XML or HTML parser (like XML::Twig or HTML::TreeBuilder) - parsing HTML with regexes can be very painful and time consuming. Perl 6 - links to (nearly) everything that is Perl 6.	[reply]
Re: RegEx Doubt by ccn (Vicar) on Sep 30, 2009 at 11:37 UTC
`my @a = $text =~ m{<xyz>([^<]*)</xyz>}g;` [download] see more: perldoc perlrequick perldoc perlretut perldoc perlop	[reply] [d/l]
Re^2: RegEx Doubt by mecrazycoder (Sexton) on Sep 30, 2009 at 12:10 UTC
Tanx man	[reply]
Re: RegEx Doubt by Bloodnok (Vicar) on Sep 30, 2009 at 11:45 UTC
TIMTOWTDI - why not use split ... `$ perl -e '@a = split /<\/?xyz>/, q(<xyz>a</xyz><xyz>a3</xyz><xyz>a2</ +xyz><xyz>a1</xyz>); print qq/@a\n/' a a3 a2 a1` [download] Update: Ahhh, maybe I see one reason, using Data::Dumper to print the output gives: `$ perl -MData::Dumper -e '@a = split /\<\/?xyz\>/, q(<xyz>a</xyz><xyz> +a3</xyz><xyz>a2</xyz><xyz>a1</xyz>); print Dumper \@a' $VAR1 = [ '', 'a', '', 'a3', '', 'a2', '', 'a1' ];` [download] Question, for me at least is: why doesn't split swallow the sub-strings on which the string is split ? I'm obviously missing something, but can't see it - any enlightenment appreciated. TIA A user level that continues to overstate my experience :-))	[reply] [d/l] [select]
Re^2: RegEx Doubt by jakobi (Pilgrim) on Sep 30, 2009 at 12:22 UTC
Hm, its non-zero-width, so it's still nice and easy: You've multiple 'split-points' in sequence in the source. Try either grep /./ on the split results or use split /(?:...)+/ to 'combine' them into one 'split-point'. Aren't they cute, those little regexes? Remembering apocalypse5 fondly :).	[reply]
Re^3: RegEx Doubt by Bloodnok (Vicar) on Sep 30, 2009 at 12:28 UTC
...use split /(?:...)+/ to 'combine' them into one 'split-point'. and then grep for empty lines i.e. `grep /./, ...`, since the first element is still empty, so might as well use `grep /./, ...` on the lot to start with. I tried the zero capture approach, but a) transposed the '?' and the ';' and b) didn't use '+' ... doh !!! Update: To reduce any confusion, the transposition to which I referred in the above was entirely down to the paucity of my typing i.e. I typed ':?' instead of '?:' and didn't notice .oO(Maybe I ought to use a larger font...) ;-) A user level that continues to overstate my experience :-))	[reply] [d/l] [select]
Re^2: RegEx Doubt by grizzley (Chaplain) on Oct 01, 2009 at 07:39 UTC
It puzzled me for a minute, but I found explanation. You have two sub-strings every time. Separated by nothing. `</xyz>_nothing_<xyz>`. And this nothing is what you find in your output.	[reply] [d/l]
Re^2: RegEx Doubt by dsheroh (Monsignor) on Oct 01, 2009 at 07:41 UTC
It does swallow the substrings on which the string is split. I don't see any `<\/?xyz>`s in the Dumper output. The blank entries being returned are the zero-length substrings in the middle of `</xyz><xyz>` - that combination is two matches of the split pattern with nothing in between.	[reply] [d/l] [select]


"be consistent"
	PerlMonks