Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Don't understand behavior of this split

by cormanaz (Chaplain)
on Aug 26, 2010 at 14:19 UTC ( #857456=perlquestion: print w/ replies, xml ) Need Help??
cormanaz has asked for the wisdom of the Perl Monks concerning the following question:

Howdy Bros. I am trying to split some html paragraphs into a list and keep only the text. So say the html is
<p>Paragraph 1</p> <p>Paragraph 2</p> <p>Paragraph 3</p>
I try to split it with the regex (<\/p>)?\s*<p> and what I get is a list with elements:
(null) Paragraph 1 </p> Paragraph 2 </p> Paragraph 3</p>
I understand why the last element has the close paragraph tag in there, but why are there elements between the paragraph ones with a close paragraph tag? I thought split was supposed to split on whatever it matches, and I know it matches the </p><p> sequence.

TIA

Steve

Comment on Don't understand behavior of this split
Select or Download Code
Re: Don't understand behavior of this split
by moritz (Cardinal) on Aug 26, 2010 at 14:25 UTC

    If the regex has a capturing group (and yours has, the (\/p>)), then the result of that capture (here $1) is interleaved with the split chunks.

    To avoid that, use the regex (?:<\/p>)?\s*<p> instead.

    See split and perlre for more details.

    Perl 6 - links to (nearly) everything that is Perl 6.
Re: Don't understand behavior of this split
by kennethk (Monsignor) on Aug 26, 2010 at 14:26 UTC
    As documented in split:

    If the PATTERN contains parentheses, additional list elements are created from each matching substring in the delimiter.

    You can get your expected behavior by changing to a non-capturing group:

    (?:<\/p>)?\s*<p>

Re: Don't understand behavior of this split
by Corion (Pope) on Aug 26, 2010 at 14:26 UTC

    See split. You don't show us the code you're using, but split returns captured items in the regex as well. Instead of splitting, maybe matching what you want to keep works better for you?

Re: Don't understand behavior of this split
by TomDLux (Vicar) on Aug 26, 2010 at 14:28 UTC

    From perldoc -f split:

    If the PATTERN contains parentheses, additional list elements are created from each matching substring in the delimiter. split(/([,-])/, "1-10,20", 3); produces the list value (1, '-', 10, ',', 20)

    Try using the non-capturing parentheses(?:)

    update: Boy you gotta type fast around here!

    update: changes from 'pre' to 'code', cause square brackets were being misinterpreted

    As Occam said: Entia non sunt multiplicanda praeter necessitatem.

Re: Don't understand behavior of this split
by psini (Deacon) on Aug 26, 2010 at 14:28 UTC

    From perlfunc for split operator:

    " If the PATTERN contains parentheses, additional list elements are created from each matching substring in the delimiter."

    So what you get is the expected behaviour

    Rule One: "Do not act incautiously when confronting a little bald wrinkly smiling man."

Re: Don't understand behavior of this split
by ikegami (Pope) on Aug 26, 2010 at 14:31 UTC
    You're trying to 1) separate the paragraphs, then 2) extract the data from those paragraphs. split could be use for the former, but it won't be sufficient. This will do both:
    my @matches = m{<p>\s*(.*?)\s*</p>}sg;

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://857456]
Approved by moritz
Front-paged by moritz
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (6)
As of 2014-08-29 10:20 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (277 votes), past polls