Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.
 
PerlMonks  

HTML::TreeBuilder::XPath returns things I don't need?

by szabgab (Priest)
on Oct 06, 2014 at 15:23 UTC ( [id://1102984]=perlquestion: print w/replies, xml ) Need Help??

szabgab has asked for the wisdom of the Perl Monks concerning the following question:

use 5.010; use HTML::TreeBuilder::XPath; my $tree= HTML::TreeBuilder::XPath->new; my $html = <<'HTML'; <html> <title>four</title> <head> <title>one</title> </head> <body> <title>two</title> </body> <title>three</title> </html> HTML $tree->parse($html); say $tree->findvalue( '/html/head/title');
I was expecting this to print 'one', but instead it printed 'fouronetwothree'. Am I misunderstand what XPath is supposed to do?

Should I use some other module?

Replies are listed 'Best First'.
Re: HTML::TreeBuilder::XPath returns things I don't need?
by choroba (Cardinal) on Oct 06, 2014 at 15:38 UTC
    The <title> tag is valid in the <head> only in HTML. The parser tries to fix the misplaced titles for you, try dumping $tree to see how.
    لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
      I see. I tried it inside the body with elements in different locations and it worked as I expected. Thanks.
Re: HTML::TreeBuilder::XPath returns things I don't need?
by toolic (Bishop) on Oct 06, 2014 at 15:46 UTC
    XML::Twig (by the same module author) can give you what you want:
    use warnings; use strict; use XML::Twig; my $xml = <<XML; <html> <title>four</title> <head> <title>one</title> </head> <body> <title>two</title> </body> <title>three</title> </html> XML my $twig = XML::Twig->new( twig_handlers => { 'html/head/title' => sub { print $_->text(), "\ +n" } }, ); $twig->parse($xml); __END__ one
      Thanks
Re: HTML::TreeBuilder::XPath returns things I don't need?
by Anonymous Monk on Oct 06, 2014 at 23:20 UTC

    Am I misunderstand what XPath is supposed to do?

    Trees are trees :)

    htmltreexpather.pl

    $ perl htmltreexpather.pl junktitle.html _tag title HTML::Element=HASH(0xcded54) 0.0.0 four /html/head/title /html/head/title /html/head/title ------------------------------------------------------------------ HTML::Element=HASH(0xcdea04) 0.0.1 one /html/head/title[2] /html/head/title[2] /html/head/title[2] ------------------------------------------------------------------ HTML::Element=HASH(0xcde954) 0.0.2 two /html/head/title[3] /html/head/title[3] /html/head/title[3] ------------------------------------------------------------------ HTML::Element=HASH(0xcde8e4) 0.0.3 three /html/head/title[4] /html/head/title[4] /html/head/title[4] ------------------------------------------------------------------ ##################################################################

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1102984]
Approved by toolic
Front-paged by toolic
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others meditating upon the Monastery: (3)
As of 2025-02-08 23:11 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Which URL do you most often use to access this site?












    Results (95 votes). Check out past polls.