Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Extracting stylesheet links or url from HTML Page

by mr_p (Scribe)
on Jun 23, 2010 at 21:18 UTC ( [id://846178]=perlquestion: print w/replies, xml ) Need Help??

mr_p has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks,

I have been trying to extract only 'stylesheet' links from html page and have not been able to do so.

Example:

<!DOCTYPE html> <html lang="en"> <head> <meta charset="utf-8"> <title>Header</title> <link href="//www.perl.org/perl.css" rel="stylesheet"> <link href="/css/perl1.css" rel="stylesheet"> <link href="/perl/cpan/index.html" rel="index_page"> </head> <body></body> <html>

I would like to only extract the href from rel=stylesheet. It is case insensetive. I can have Rel=StyleSheet too.

I have tried SimpleLinkExtor, TreeBuilder, Text::Balanced and LinkExtor. I do not wish to use TreeBuilder because it is too expensive.

Please help. I have spent 2 days on this.

Replies are listed 'Best First'.
Re: Extracting stylesheet links or url from HTML Page
by Your Mother (Archbishop) on Jun 23, 2010 at 22:11 UTC

    This may do. Lightly tested. XML::LibXML seems to normalize attribute names to lowercase. (update: cleaned up loop a little.)

    use warnings; use strict; use XML::LibXML; my $parser = XML::LibXML->new; $parser->recover_silently(1); my $doc = $parser->parse_html_file(+shift||die "give an HTML file\n"); for my $link ( $doc->findnodes('//link[@rel]') ) { next unless lc($link->getAttribute("rel")) eq "stylesheet"; print $link->getAttribute("href"), $/; }
      The 'stylesheet' constraint could have been put straight into the XPath:
      $doc->findnodes( q(//link[@rel='stylesheet']) )
Re: Extracting stylesheet links or url from HTML Page
by ww (Archbishop) on Jun 23, 2010 at 22:39 UTC
    You do have an answer (above) which may serve... but for best results, post the code you've written; the error messages it produces (or describe the failure with great specificity).

    And, while it's NOT customarily advisable, this is one that could be tackled with a simple regex: search from the start of the .html file to </head> for rel="stylesheet" or even for .css and, when found, use a capturing regex to get the (relative) link. Then, of course, for that to be much use, you'll need to concat the site address...

    And just BTW, that's not guaranteed to produce a valid link to the stylesheet (identifying the reason is left as an exercise for the pupil). OTOH, since I'm hard pressed to find a practical use for your desired code, (unless you collect style sheets the way NodeReaper collects toe tags) perhaps this is homework... which should also be disclosed when posting here.

    Update: s/and way/the way/ in last para.

Re: Extracting stylesheet links or url from HTML Page
by ambrus (Abbot) on Jun 24, 2010 at 10:29 UTC

    I recommend using the XML::Twig module. This uses HTML::Tree under the hood to parse the html, and you say that's expensive, but I don't know what you mean by that: is it slow, does it use too much memory, is it hard to install?

    Here's an example script. Run it with the filename of the html as an argument.

    use warnings; use strict; use XML::Twig; my $tw = XML::Twig->new; $tw->parsefile_html($ARGV[0]); for my $t ($tw->findnodes("//link")) { if ("stylesheet" eq lc($t->att("rel"))) { warn "found stylesheet: ", $t->att("href"); } }

    Note that HTML::Tree already lowercases the element and attribute names for you (because this is HTML, not XML), but it does not lowercase the attribute value, so you have to do that yourself.

      I am concerned about speed.

      I was able also able to parse using HTML::TokeParser::Simple. So now I have three ways of doing this. Can you please guide me to which is the best way for me to do this based on speed and reliability (updates) etc..

      1. use HTML::TokeParser::Simple 2. use XML::Twig; 3. use XML::LibXML;

      Thanks.

        Not really!

        The specific criteria present enough of a challenge; the "etc." is the skunk in the woodpile. Unless we know which elements of et cetera you really care about, we can't chose for you.

        That said, another valid answer is "Pick one and if it works for you, its probably right for you... at least for now."

        You can't profile, test, or (uh-oh; here it comes: ) etc. until you've written the code.

      I am getting following error. using XML::Twig

      Parsing of undecoded UTF-8 will give garbage when decoding entities at /usr/lib/perl5/vendor_perl/5.8.8/XML/Twig.pm line 731

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://846178]
Approved by ww
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others contemplating the Monastery: (5)
As of 2024-03-30 05:26 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found