Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Re: HTML::TokeParser Frustration

by graff (Chancellor)
on Oct 17, 2008 at 05:18 UTC ( [id://717659]=note: print w/replies, xml ) Need Help??


in reply to HTML::TokeParser Frustration

I'm not sure what your ultimate goal is, but if you want to print out just the content of a "cdata" tag, with all other tags retained within that section of data, maybe something like this will do:
#!/usr/bin/perl use strict; use HTML::TokeParser; my $sample_HTML = <<EOD; <HTML> blah. <CDATA> Just some random whatever. It might have some <b>real</b> HTML like a +table or CSS styling or even some <H1>IMPORTANT</H1> words. Maybe even a form <form method= +post>...</form> </CDATA> </HTML> EOD my $p = HTML::TokeParser->new( \$sample_HTML ); my $in_cdata = 0; while ( my $token = $p->get_token ) { my ( $tkn_type, $tkn_content, @rest ) = @$token; if ( $tkn_type =~ /[SE]/ ) { $tkn_content = pop @rest; # last array element is full tag st +ring } print $tkn_content if ( $in_cdata and $tkn_content !~ /cdata/ ); if ( $tkn_content =~ /cdata/i ) { $in_cdata += ( $tkn_type eq 'S' ) ? 1 : -1; } }
That doesn't print the CDATA tags themselves, but it prints everything inside the CDATA tags, including other tags. To do that, the main loop has to process all "tokens" (all tags and all intervening text in the whole document) one token at a time, and a state variable has to keep track of when you're inside a cdata section as opposed to not being inside one.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://717659]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (5)
As of 2024-03-19 09:41 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found