BMaximus has asked for the wisdom of the Perl Monks concerning the following question:

I need to write a module that will go to a specified web page or web site, slurp up the contents of a page posing as a random browser client so that if someone tries to be sneaky and use cloaking that the server will still cough up the cloaked page. It will parse through the HTML and find any JavaScript code and then parse through that to see if the page will produce any PopUps. To write the module I thought it would be best to use WWW::Mechanize to grab the pages and as I see that HTML::Parse has been marked as deprecated I'm not sure if there is something better to parse out the HTML page. Is there a module to parse out JavaScript or would I have to create one with Parse::RecDescent. I've looked all through cpan and have found nothing that will parse through JavaScript. It would be great to see some suggestion as to go about this.


Replies are listed 'Best First'.
Re: PopUp Detection
by tachyon (Chancellor) on Aug 04, 2003 at 13:54 UTC

    In short you can't get perl to Pretend to be a 'Real' Web Browser ie IE/Mozilla/NS/Opera. You can fake all the behaviour except for the Javascript/DOM redirection part. You can fake some javascript support but to do it all you need the DOM.

    Here is a list of some of the issues you will need to deal with to get the 'real' pages.

    1. Use LWP::UserAgent to get the pages, works in vanilla form for > 90% pages
    2. Add a random agent string so LWP pretends to be IE 5/5.5/6. The easiest way to get them is to grep your apache access logs. There are also plent of lists on the net.
    3. Add in support for meta-refresh redirects (there are about 6 different 'valid' syntaxes - where valid means that browsers accept them)
    4. Add in frames support (vital)
    5. Add in cookie support as this is often tested for.

    Once you have done all that the only 'rejects/cloaking' you will get will involve javascript redirects. There are numerous different variations of window.location = blah, window.location(blah), href.location = blah, href.location(blah), etc, etc.

    Some of these you can parse and follow. Some you can't as they concat bits of the DOM into the redirect string.

    When it comes to parsing the HTML HTML::Parser will cough up the javascript either in the comments or text (depending on how it is wrapped) so it is sub optimal. If you are only interested in popups you are basically looking for and a few other strings. You can parse these out reasonably reliably with REs

    We implemented all of the above on a current project, but eventually ended up hacking IE so that it is a headless, windowless, slave that goes and does our bidding. The nice part of that solution is that it really is IE doing the fetching so ..... no-one can tell it isn't IE. IE parses the HTML, sets the DOM, runs the javascript etc. We just gather up the HTML data from the parent and any child windows. You can hack Mozilla in a similar fashion.




Re: PopUp Detection
by valdez (Monsignor) on Aug 04, 2003 at 12:52 UTC
Re: PopUp Detection
by jeffa (Bishop) on Aug 04, 2003 at 14:01 UTC