Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Namespace error while parsing html document.

by mr_p (Scribe)
on Jun 30, 2010 at 21:04 UTC ( #847404=perlquestion: print w/ replies, xml ) Need Help??
mr_p has asked for the wisdom of the Perl Monks concerning the following question:

Hello Everyone,

I am still working on rss feed and came across another problem. I get the error message below. The exact code is posted below.

This is the Error.

/tmp/file.html:2: namespace error : Namespace prefix xmlns of attribute fb is not defined book.com/2008/fbml" xmlns:og="http://opengraphprotocol.org/schema" xml:lang="en" /tmp/file.html:2: namespace error : Namespace prefix xmlns of attribute og is not defined..........................

Please Help. Thanks again like always. You guys/gals are always there for me.

#!/usr/bin/perl -w #use strict; use warnings; use warnings; use XML::LibXML; use LWP::UserAgent; my $parser = XML::LibXML->new; my $html_link = "http://www.marketwatch.com/story/brazil-mexico-stocks +-face-quarterly-slides-2010-06-30?siteid=rss&rss=1"; $client = LWP::UserAgent->new(); my $capture = $client->get("$html_link", ":content_file" => "/tmp/file +.html") || die "$!\n"; my $doc = $parser->parse_html_file("/tmp/file.html") || die "Error par +sing: $!"; if (defined $doc ) { print "HTML Doc OK.\n";} else { print "HTML Doc BAD.\n";}

Comment on Namespace error while parsing html document.
Download Code
Re: Namespace error while parsing html document.
by ikegami (Pope) on Jun 30, 2010 at 21:19 UTC

    You seem to have forgotten to post the HTML.

      It put code in to download the html $client->get.

      Also,

      I was not able to pass $content->{_content} to parse_html_string(), does anyone know why. Isn't $content->{_content} suppose to have the the file downloaded? I ended up reading the file.

        Just as a note - please don't expect us to fetch HTML on your behalf. If you include the first two or three lines of the HTML as test data within your script instead of requiring us to write it to a temporary file in a location that might not even exist on our systems, you make things much more self-contained and easier for us to replicate. And if it is easier for us to replicate, that makes it easier for us to help you.

        No, there's nothing like that in the docs.
Re: Namespace error while parsing html document.
by ikegami (Pope) on Jun 30, 2010 at 21:44 UTC

    It's not HTML. HTML doesn't even have namespaces. It claims to be XHTML, so you should be using parse_file.

    Unfortunately, it's not well-formed XHTML. Lots of unescaped ampersands and less-than symbols, for starters.

    The following fixes the escaping errors, but there's still a bunch of duplicate ids and unmatched tags:

    #!/usr/bin/perl -w use strict; use warnings; use XML::LibXML qw( ); use LWP::UserAgent qw( ); sub text_to_xml { my ($text) = @_; $text =~ s/&/&amp;/g; $text =~ s/</&lt;/g; return $text; } my $url = "http://www.marketwatch.com/story/brazil-mexico-stocks-face- +quarterly-slides-2010-06-30?siteid=rss&rss=1"; my $ua = LWP::UserAgent->new(); my $response = $ua->get($url); $response->is_success() or die "Request failed: ". $response->status_line() ."\n"; my $xhtml = $response->decoded_content(charset => "none"); $xhtml =~ s{(<script[^>]*>)(.*?)(</script>)}{"$1".text_to_xml("$2")."$ +3"}sieg; $xhtml =~ s/&(?!#|[a-zA-Z0-9]{1,8};)/&amp;/g; my $parser = XML::LibXML->new; my $doc = $parser->parse_string($xhtml);

    Remaining errors

    There's a parser option that directs libxml to recover from errors if possible. It might help. If not, HTML::Parser is designed to handle garbage.

      Thanks for your help.

      surprised the HTML is this bad. Browser renders it pretty good compare to LibXml.

        Browsers are very forgiving. LibXML expects XML or HTML. I suggested two options.

      Do you know why I was not able to pass $content->{_content} to parse_html_string(), It was blank. Isn't $content->{_content} suppose to have the the file downloaded?

      Also,

      The only reason I need to load the html page is because I wanted to parse stylesheet link out. For Instance:

      Can I use RegExp to parse the below string? I would need to parse all three hrefs in to array.

      <?xml-stylesheet href="perl1.css" type="text/css"?> <link href="//www.perl.org/css/perl1.css" rel="stylesheet"> <link href="/css/perl.css" rel="stylesheet">

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://847404]
Approved by ikegami
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (5)
As of 2014-12-19 04:44 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (70 votes), past polls