Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Extract inline script from an XHTML with XML::Twig

by ambrus (Abbot)
on Oct 30, 2013 at 12:15 UTC ( #1060350=CUFP: print w/ replies, xml ) Need Help??

In the chatterbox, SagaraSouske has asked help about extracting text from script elements in an XHTML document. He wrote code using HTML::TreeBuilder. This node shows how to do the equivalent with XML::Twig. The XHTML example is directly from SagaraSouske.

#!perl use 5.014; use XML::Twig; my $twig = XML::Twig->new; $twig->parse(xmlinput()); for my $tr_elt ($twig->findnodes(q(//tr[@class='Odd']))) { if (my($script_elt) = $tr_elt->findnodes(q(td[1]/script))) { say "Script: ", $script_elt->text; } if (my($td2_elt) = $tr_elt->findnodes(q(td[2]))) { say "Other: ", $td2_elt->text; } } sub xmlinput { q{ <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http:/ +/www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html lang="en" xmlns="http://www.w3.org/1999/xhtml"> <body> <table border="0" cellpadding="4" cellspacing="0" class="DataGrid" wid +th="1000px"> <tr class="Odd"><td><script type="text/javascript">Decode("%31%31%39%2 +e%32%35%33%2e%36%31%2e%31%32%31")</script></td><td>Other Data</td></t +r> <tr class="Even"><td><script type="text/javascript">Decode("%32%30%33% +2e%31%35%36%2e%32%30%37%2e%32%34%39")</script></td><td>Other Data</td +></tr> <tr class="Odd"><td><script type="text/javascript">Decode("%32%32%32%2 +e%36%32%2e%32%30%37%2e%37%30")</script></td><td>Other Data</td></tr> <tr class="Even"><td><script type="text/javascript">Decode("%32%30%32% +2e%31%31%32%2e%31%31%37%2e%39%34")</script></td><td>Other Data</td></ +tr> <tr class="Odd"><td><script type="text/javascript">Decode("%35%38%2e%3 +2%30%2e%32%32%38%2e%32%32")</script></td><td>Other Data</td></tr> <tr class="Even"><td><script type="text/javascript">Decode("%31%31%39% +2e%32%35%33%2e%36%31%2e%31%32%30")</script></td><td>Other Data</td></ +tr> <tr class="Odd"><td><script type="text/javascript">Decode("%32%32%33%2 +e%38%37%2e%31%39%2e%35")</script></td><td>Other Data</td></tr> </table> </body> </html> }; } __END__

Update: here's the output:

Script: Decode("%31%31%39%2e%32%35%33%2e%36%31%2e%31%32%31") Other: Other Data Script: Decode("%32%32%32%2e%36%32%2e%32%30%37%2e%37%30") Other: Other Data Script: Decode("%35%38%2e%32%30%2e%32%32%38%2e%32%32") Other: Other Data Script: Decode("%32%32%33%2e%38%37%2e%31%39%2e%35") Other: Other Data

Comment on Extract inline script from an XHTML with XML::Twig
Select or Download Code
Replies are listed 'Best First'.
Re: Extract inline script from an XHTML with XML::Twig
by choroba (Canon) on Oct 30, 2013 at 14:09 UTC
    For completness, the equivalent script in XML::XSH2, a wrapper around XML::LibXML:
    register-namespace xh http://www.w3.org/1999/xhtml ; open 1.xhtml ; for //xh:tr[@class="Odd"] { my $s = xh:td/xh:script ; if $s echo Script: $s ; my $o = xh:td[2] ; if $o echo Other: $o ; }
    لսႽ ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
Re: Extract inline script from an XHTML with XML::Twig
by SagaraSouske (Acolyte) on Oct 30, 2013 at 12:38 UTC
    For everyone who has the same problem as me and uses HTML::Treebuilder here is the code that worked for me thanks to ambrus and others
    #!perl use strict; use HTML::TreeBuilder::XPath; use Data::Dumper; my $tree = HTML::TreeBuilder::XPath->new(); $tree->store_comments(1); my $html = do { local $/; <DATA> };; $tree->parse( $html ); my @nodes = $tree->findnodes( qw( //tr[@class='Odd'] ) ); for my $subtree ( @nodes ) { my($value) = $subtree-> findnodes( qw( td[1]/script ) ); my $script = join "", $value->content_list; my $other_data = $subtree->findvalue( qw( td[2] ) ); printf "Value: %s\n", $script; printf "Other Data: %s\n", $other_data; } __DATA__ <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http:/ +/www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html lang="en" xmlns="http://www.w3.org/1999/xhtml"> <body> <table border="0" cellpadding="4" cellspacing="0" class="DataGrid" wid +th="1000px"> <tr class="Odd"><td><script type="text/javascript">Decode("%31%31%39%2 +e%32%35%33%2e%36%31%2e%31%32%31")</script></td><td>Other Data</td></t +r> <tr class="Even"><td><script type="text/javascript">Decode("%32%30%33% +2e%31%35%36%2e%32%30%37%2e%32%34%39")</script></td><td>Other Data</td +></tr> <tr class="Odd"><td><script type="text/javascript">Decode("%32%32%32%2 +e%36%32%2e%32%30%37%2e%37%30")</script></td><td>Other Data</td></tr> <tr class="Even"><td><script type="text/javascript">Decode("%32%30%32% +2e%31%31%32%2e%31%31%37%2e%39%34")</script></td><td>Other Data</td></ +tr> <tr class="Odd"><td><script type="text/javascript">Decode("%35%38%2e%3 +2%30%2e%32%32%38%2e%32%32")</script></td><td>Other Data</td></tr> <tr class="Even"><td><script type="text/javascript">Decode("%31%31%39% +2e%32%35%33%2e%36%31%2e%31%32%30")</script></td><td>Other Data</td></ +tr> <tr class="Odd"><td><script type="text/javascript">Decode("%32%32%33%2 +e%38%37%2e%31%39%2e%35")</script></td><td>Other Data</td></tr> </table> </body> </html>
    Output:
    Value: Decode("%31%31%39%2e%32%35%33%2e%36%31%2e%31%32%31") Other Data: Other Data Value: Decode("%32%32%32%2e%36%32%2e%32%30%37%2e%37%30") Other Data: Other Data Value: Decode("%35%38%2e%32%30%2e%32%32%38%2e%32%32") Other Data: Other Data Value: Decode("%32%32%33%2e%38%37%2e%31%39%2e%35") Other Data: Other Data

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: CUFP [id://1060350]
Approved by Athanasius
Front-paged by Arunbear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (17)
As of 2015-07-29 18:08 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (267 votes), past polls