Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid

Namespace error while parsing html document.

by mr_p (Scribe)
on Jun 30, 2010 at 21:04 UTC ( #847404=perlquestion: print w/replies, xml ) Need Help??
mr_p has asked for the wisdom of the Perl Monks concerning the following question:

Hello Everyone,

I am still working on rss feed and came across another problem. I get the error message below. The exact code is posted below.

This is the Error.

/tmp/file.html:2: namespace error : Namespace prefix xmlns of attribute fb is not defined" xmlns:og="" xml:lang="en" /tmp/file.html:2: namespace error : Namespace prefix xmlns of attribute og is not defined..........................

Please Help. Thanks again like always. You guys/gals are always there for me.

#!/usr/bin/perl -w #use strict; use warnings; use warnings; use XML::LibXML; use LWP::UserAgent; my $parser = XML::LibXML->new; my $html_link = " +-face-quarterly-slides-2010-06-30?siteid=rss&rss=1"; $client = LWP::UserAgent->new(); my $capture = $client->get("$html_link", ":content_file" => "/tmp/file +.html") || die "$!\n"; my $doc = $parser->parse_html_file("/tmp/file.html") || die "Error par +sing: $!"; if (defined $doc ) { print "HTML Doc OK.\n";} else { print "HTML Doc BAD.\n";}

Replies are listed 'Best First'.
Re: Namespace error while parsing html document.
by ikegami (Pope) on Jun 30, 2010 at 21:44 UTC

    It's not HTML. HTML doesn't even have namespaces. It claims to be XHTML, so you should be using parse_file.

    Unfortunately, it's not well-formed XHTML. Lots of unescaped ampersands and less-than symbols, for starters.

    The following fixes the escaping errors, but there's still a bunch of duplicate ids and unmatched tags:

    #!/usr/bin/perl -w use strict; use warnings; use XML::LibXML qw( ); use LWP::UserAgent qw( ); sub text_to_xml { my ($text) = @_; $text =~ s/&/&amp;/g; $text =~ s/</&lt;/g; return $text; } my $url = " +quarterly-slides-2010-06-30?siteid=rss&rss=1"; my $ua = LWP::UserAgent->new(); my $response = $ua->get($url); $response->is_success() or die "Request failed: ". $response->status_line() ."\n"; my $xhtml = $response->decoded_content(charset => "none"); $xhtml =~ s{(<script[^>]*>)(.*?)(</script>)}{"$1".text_to_xml("$2")."$ +3"}sieg; $xhtml =~ s/&(?!#|[a-zA-Z0-9]{1,8};)/&amp;/g; my $parser = XML::LibXML->new; my $doc = $parser->parse_string($xhtml);

    Remaining errors

    There's a parser option that directs libxml to recover from errors if possible. It might help. If not, HTML::Parser is designed to handle garbage.

      Thanks for your help.

      surprised the HTML is this bad. Browser renders it pretty good compare to LibXml.

        Browsers are very forgiving. LibXML expects XML or HTML. I suggested two options.

      Do you know why I was not able to pass $content->{_content} to parse_html_string(), It was blank. Isn't $content->{_content} suppose to have the the file downloaded?


      The only reason I need to load the html page is because I wanted to parse stylesheet link out. For Instance:

      Can I use RegExp to parse the below string? I would need to parse all three hrefs in to array.

      <?xml-stylesheet href="perl1.css" type="text/css"?> <link href="//" rel="stylesheet"> <link href="/css/perl.css" rel="stylesheet">
Re: Namespace error while parsing html document.
by ikegami (Pope) on Jun 30, 2010 at 21:19 UTC

    You seem to have forgotten to post the HTML.

      It put code in to download the html $client->get.


      I was not able to pass $content->{_content} to parse_html_string(), does anyone know why. Isn't $content->{_content} suppose to have the the file downloaded? I ended up reading the file.
        No, there's nothing like that in the docs.

        Just as a note - please don't expect us to fetch HTML on your behalf. If you include the first two or three lines of the HTML as test data within your script instead of requiring us to write it to a temporary file in a location that might not even exist on our systems, you make things much more self-contained and easier for us to replicate. And if it is easier for us to replicate, that makes it easier for us to help you.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://847404]
Approved by ikegami
[marto]: time, the enemy of us al
[marto]: err, all
[marto]: zoiks
[erix]: yeah, that was in the pipeline for a while, not good
[erix]: more damage from Commander Covfefe
[marto]: sadly, the pipeline will be impacted. "That's a nice pipleline you have, shame if something where to happen to it..."

How do I use this? | Other CB clients
Other Users?
Others having an uproarious good time at the Monastery: (12)
As of 2017-12-14 20:53 GMT
Find Nodes?
    Voting Booth?
    What programming language do you hate the most?

    Results (410 votes). Check out past polls.