Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling

"Web Automation" -- your input is greatly desired!

by Dice (Beadle)
on May 05, 2003 at 20:16 UTC ( #255739=perlquestion: print w/replies, xml ) Need Help??

Dice has asked for the wisdom of the Perl Monks concerning the following question:

Hello, Monks!

I'm working on a book project now where the topic is web automation -- that is, how to write programs that act as web clients, and what you can do with them. It seems to me like there are a few major categories along these lines:

  • Spiders
  • automated testing of web applications ("regression testing")
  • Screen scraping (to and from the web)

I would love the input of people on the list both in terms of either coming up with new top-level categories and also in "filling in the details" of these ones: descriptions of sub topics of the above, "case studies" or examples of this kind of thing that you've come across in your work, etc.

- Richard

  • Comment on "Web Automation" -- your input is greatly desired!

Replies are listed 'Best First'.
Re: "Web Automation" -- your input is greatly desired!
by Aristotle (Chancellor) on May 05, 2003 at 20:26 UTC
    I've written a screen scraper to download and then remove private messages from an ikonBoard account, and one for the message archive and file archive each for Yahoo Groups. The former was done with plain LWP and judicious use of HTML::LinkExtor, the latter are WWW::Mechanize apps. Even so I don't really have much to say about the issue - they were all completely cookie-cutter, entirely uneventful jobs with the aid of Perl and CPAN. If WWW::Mechanize::Shell worked for my Yahoo tasks, the respective scripts wouldn't even have required any manual "reverse engineering" of pages and so would have been downright boringly straigthforward work.

    Makeshifts last the longest.

Re: "Web Automation" -- your input is greatly desired!
by newrisedesigns (Curate) on May 05, 2003 at 20:28 UTC

    Under Spiders, don't forget those that use LWP and such to read newsfeeds or straight HTML to keep themselves informed. jcwren has a lot of tools for checking your XP here on PerlMonks.

    This might be more of a research topic, however, I've been finding that a lot of websites might be faking the referrer as some sort of secret ad to the webmaster. I keep getting one or two hits for different sites, but I never find anything that links to my site.

      Just a thought:
    • Spiders
      • News gathers
      • Broken link finders
      • Bad Spiders that Overindex/Look for Holes
    • Automation
      • Testing CGI scripts
      • Checking for updated content
      • others...
    • Scraping
      • Using Perl and the HTML:: Modules
      • Making Clean HTML (easy to parse/scrape)
      • Using data-oriented methods (XML/RSS)

    This list is nowhere near complete. If I'm off on something, post a reply and set me straight.

    John J Reiser

Re: "Web Automation" -- your input is greatly desired!
by kvale (Monsignor) on May 05, 2003 at 20:21 UTC
Re: "Web Automation" -- your input is greatly desired!
by jgallagher (Pilgrim) on May 05, 2003 at 20:22 UTC
Re: "Web Automation" -- your input is greatly desired!
by Cody Pendant (Prior) on May 05, 2003 at 23:05 UTC
    One project I undertook, (and found interesting) in this area, was a script using LWP which rendered my friends' blogs in a Palm-compatible format.

    It was rather clunky, but essentially it hit on their websites and got their blog front pages, then used regexes to reduce the complexity of the HTML, removing tables and so on, re-writing their blogs into pages on my website.

    Of course there are Palm browsers (I use AvantGo) which will attempt to reformat and render HTML for you, but I found it easier to run the reformatting script, getting any error messages along the way, just before a sync. And of course I could tweak the reformatting by adjusting a list of allowed and disallowed tags.

    “Every bit of code is either naturally related to the problem at hand, or else it's an accidental side effect of the fact that you happened to solve the problem using a digital computer.”
    M-J D
Re: "Web Automation" -- your input is greatly desired!
by BrowserUk (Pope) on May 05, 2003 at 21:10 UTC

    When my brother was looking for work early last year, we wrote a small application (using Java:(, I hadn't discovered perl back then) that would trawl a list of company websites (supplied from a file) and look for their "Positions vacant", "Jobs", "Personnel required", "Work for us" pages and download any new ones it found for later consideration. It was pretty crude, but it saved him some time going through them manually.

    One thing I can tell you, it would have been a whole heap easier to write in Perl.

    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
Re: "Web Automation" -- your input is greatly desired!
by LameNerd (Hermit) on May 05, 2003 at 20:22 UTC
    Well, where I work we use LWP scripts to upload and download flat files.
    If your writing a book are you going to give credit to the Perlmonks in it?
Re: "Web Automation" -- your input is greatly desired!
by Abigail-II (Bishop) on May 05, 2003 at 23:38 UTC
    I'm a bit confused. You post this in "Seekers of Perl wisdom", but I fail to see anything vaguely Perl related to your question. Or do you honestly think Python, Java and C programmers will use totally different categories of web clients?



      I'm sorry, I should have specified my question further. The book is meant to be a Perl programming book, and I'm particularly interested in solutions that other people have obtained in Perl. This means that, among other things, I'm looking for suggestions of (e.g.) CPAN modules that might help out along these lines. A few are pretty obvious -- LWP, WWW::Mechanize come to mind. But there are no doubt goodies lurking in CPAN that could be helpful but that I would not know. Also, there might be creative or novel uses of other modules out there that could apply to this problem domain. (e.g. HTML::* modules for the construction of a tree that can be traveled, depending on node contents)

      You are correct that programmers of other languages aren't likely to have different classes of problems in this domain than Perl programmers. But I am interested in getting a wider view of the domain, especially as Perl programmers have approached it, including specific examples of Perl technology applied to those problems.


Re: "Web Automation" -- your input is greatly desired!
by tomhukins (Curate) on May 06, 2003 at 19:42 UTC

    I wrote a script last year to check a database of around a thousand external links: simple stuff using DBI and LWP. Each week, the script looks for problems with these sites and mails the database maintainers with any problems it encounters.

    We decided to implement a simple check initially, but we discussed possible future ideas and we've also come up with more based on our experience:

    • Differentiate between different types of errors (DNS lookup, server error, page not found or removed, permanent redirection). Maybe re-test links with temporary failures after a few hours.
    • Record in the database when the link last worked.
    • Allow maintainers to flag links as not working, and instead of reporting failure for such links, report when they succeed. Users searching the database should not see such links in response to their queries.
    • Use Net::Whois to notify changes in domain ownership and notify us in advance if a domain is about to expire. Certain unethical business people like to register newly expired domains and replace the content with things we don't want to link to.
    • Just because a site returns an HTTP success code, that dosn't mean everything works fine. At present, maintainers check the links manually every now and again. We don't want to alert the maintainers every time a page changes, especially for dynamic content, but we might come up with a useful heuristic that searches to see if certain key phrases still exist (or don't exist for phrases like "page removed").

    On a separate project, I found XML::LibXML more convenient than HTML::Parser for screen scraping by using its XPath querying method, which even works with badly formed XML and HTML. I find XPath really useful for this kind of thing.

Re: "Web Automation" -- your input is greatly desired!
by Jenda (Abbot) on May 06, 2003 at 18:27 UTC

    Not in Perl but ... we do web automation to fill in forms on other sites. That is if the site wount cooperate and accept the data (job offers) in a XML/CSV/plaintext/... file, we just fake a user clicking buttons, filling in fields, selecting pulldowns and radios, clicking links, ...

    We do this by creating an Internet Explorer object and controling it. It's a big can of worms, but it seems to be working fine most of the time. I agree WWW::Mechanize would be easier most of the time, the thing is the sites that do not accept the files are usualy the same ones that use crazy JavaScript (if not something worse) on their pages so we do need the browser object to allow the JavaScript to run. This way is slow, but works with almost any site.

    Always code as if the guy who ends up maintaining your code will be a violent psychopath who knows where you live.
       -- Rick Osborne

    The sites do know we are doing this (I believe). We (or our clients) pay for the job ads so they have no reason to complain.

    Edit by castaway: Closed small tag in signature

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://255739]
Approved by Aristotle
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (6)
As of 2020-10-26 13:37 GMT
Find Nodes?
    Voting Booth?
    My favourite web site is:

    Results (251 votes). Check out past polls.