Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Best way to parse/evaluate HTML page contents for apparent image size

by Anonymous Monk
on Dec 14, 2012 at 00:49 UTC ( #1008751=perlquestion: print w/ replies, xml ) Need Help??
Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

TL;DR:

What libraries do the PMs suggest to determine the size of an image that a graphical web browser would actually render? I'm aware of Selenium and WWW::Selenium, but only that one. Are there better, pure-perl ways to do this, or alternative libraries to that one?

More Background:

I'm trying to extract images from HTML pages to build a "zeitgeist" of what's happening on a subset of very specific web pages that are not built by me. For the most part, it's working well enough just using the CPAN module HTML::TokeParser and determining image sizes either from the actual image dimensions or from html attributes.

However, my approach suffers from a couple of drawbacks, namely those webmasters who use huge images and scale them down in the browser, and those webmasters who don't specify width/height attributes or only provide one of them making it hard to know what the actual rendered size actually is.

If I can avoid having to call an external graphical browser and stay "pure perl" that would be ideal. Lacking that, I could consider WWW::Selenium, but am hoping for the broader wisdom of the PerlMonks to see if there are any alternatives to it that are recommended.

TIA

Comment on Best way to parse/evaluate HTML page contents for apparent image size
Re: Best way to parse/evaluate HTML page contents for apparent image size
by GrandFather (Cardinal) on Dec 14, 2012 at 01:07 UTC

    What about those "web masters" who specify image size in terms of percent (most likely just providing width) so the pages resize nicely without nasty scroll bars appearing?

    And assuming you manage to work out how to handle that, does it matter that Internet Explorer ignores suggested image size and generates the nasty scroll bars in any case?

    True laziness is hard work
Re: Best way to parse/evaluate HTML page contents for apparent image size
by sundialsvc4 (Monsignor) on Dec 14, 2012 at 04:22 UTC

    Unfortunately, I find your question to be ... just too vague.   “The size of an image that would actually render” might be specified absolutely in the HTML, e.g. width="720px", or it might be relative, "50%", or it might not be specified at all, in which case it would depend on whatever the size of the browser window happened to be at the time.   That is to say, the actual image size is very much contextual, and in light of that, could you be a little bit more explicit as to what you need here?   (Sure, there are terrific HTML parsers available which can, with great precision, tell you precisely what the HTML says.   But that’s only part of the question.)

      With all due respect, the question isn't about various HTML development practices. The question is, could someone who has used perl to automate graphical web browsers (or their underlying rendering engines) recommend a library to work with the rendered image sizes in a given browser context. I'm guessing that if you have never used (for example, from doing a search on various rendering engines in CPAN) Mozilla::Mechanize, or WWW::Mechanize, or WWW::WebKit, or something else like it I'm unaware of, you'll be of limited help in making a recommendation.

        And my sincere opinion here is that if you want to determine the effective size of an image on the browser screen, there are many factors which influence this.   (And yes, I hadn’t even considered CSS.)   Sure, all these tools exist for getting to the HTML, etc., but that’s only the tip of the question being posed as I read it.   There are so many potentially influencing variables here, apart from the HTML itself, that the objective is at best difficult to answer.   I found the question as-stated to be vague, as in, difficult to ascertain what potential factors were intended to be in-bounds or not.   Getting to the HTML and/or CSS elements in Perl, fortunately, is a technical matter that has been very-well solved by CPAN.

Re: Best way to parse/evaluate HTML page contents for apparent image size
by Anonymous Monk on Dec 14, 2012 at 07:31 UTC

    But you forgot about CSS!

    There is only one way, just do it!

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1008751]
Approved by GrandFather
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others studying the Monastery: (14)
As of 2014-07-31 17:08 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (249 votes), past polls