Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

HTTP Scripting

by marinersk (Priest)
on Nov 29, 2002 at 02:21 UTC ( [id://216406]=perlquestion: print w/replies, xml ) Need Help??

marinersk has asked for the wisdom of the Perl Monks concerning the following question:

I've probably used the wrong terminology, highlighting the level of non-adeptship in my possession regarding network programming.

I'm trying to learn how to write a web parser or scripter. This would be something which can go to a web site, scan (or "scrape", as we used to call it in the old days) through the data that comes back for keywords and so on, parse for HREF links for the options necessary to accomplish things in a scripted manner that otherwise have to be done by hand.

Problem is, I haven't a clue where to begin, and poking around sites for HTML, HTTP and SOCKET keywords don't seem to be helping because waaay too much stuff qualifies for such general terms.

Can someone either point me at a good tutorial, basic info on how this is done, or provide a code snippet to demonstrate the use of some Perl module to accomplish this?

I have no problems with digging through the documentation to learn how to do this myself, but I can't even seem to find the right library, much less the right reference manual.

Thanks for taking the time to read this.

_______________
Steven K. Mariner
marinersk@earthlink.net
http://home.earthlink.net/~marinersk/

Replies are listed 'Best First'.
Re: HTTP Scripting
by cjf-II (Monk) on Nov 29, 2002 at 02:34 UTC
      Thanks, cjf-II, also excellent references. You guys rock!
(bbfu) Re: HTTP Scripting
by bbfu (Curate) on Nov 29, 2002 at 02:38 UTC

    And to do it in Perl, check out LWP, LWP::RobotUA, WWW::Robot, and such.

    Update: You can certainly do it using LWP and one of the HTML parsing modules (such as HTML::TokeParser mentioned above) but to do it most easily and quickly, I definitely recommend using one of the Robot modules I listed, as it will already implement most of the web retrieval stuff for you, and will probably catch a lot you might otherwise miss (such as honoring robots.txt, etc).

    bbfu
    Black flowers blossum
    Fearless on my breath

      It just keeps getting better. Thanks, bbfu!
Re: HTTP Scripting
by pg (Canon) on Nov 29, 2002 at 02:33 UTC
    For HTTP, visit RFC 2068, http 1.1 spec.

    For HTML, visit W3C, read the sections for HTML and XHTML.

    For Socket, just take some Perl book, for example the black book, there are some good stuffs. The latest version of that black book has sections cover socket/http/html etc. That is a good solution-driven book for newbies.
      Thanks, pg. excellent material.

      I was going to ask what the black book was, but a Google search on +perl +"black book" made it quite clear I'd be hailing myself as an ignoramus had I done so. :-)

Re: HTTP Scripting
by adrianh (Chancellor) on Nov 29, 2002 at 07:43 UTC

    You might be able to save yourself some effort by taking a look at WWW::Search::Scraper.

    I've not used it myself, but it looks like a nice customisable solution to this sort of problem.

      Wow, you guys are a true fountain of knowledge. It will be difficult to resist the temptation to ask every little Perl question here in the future... Thanks, adrianh!
Re: HTTP Scripting
by ajt (Prior) on Nov 29, 2002 at 12:30 UTC

    marinersk,

    Perl is good at this kind of thing. Perl has modules to connect to web servers (LWP), work with the cookies and passwords, and parse HTML (HTML).

    Perl has several HTML/XML parsers, some are general purpose parsers, and some are dedicated, e.g. link extractors, header parsers.

    You could argue that your choice is so wide that it becomes daunting!

    I would suggest the following books: Perl and LWP which is all about connecting to, collecting from, and parsing of web data. I would also suggest Data Munging with Perl, it's a little older and more generic (it's for more than just web automation), but it's a fine book and has good examples of web data mining. Web Client Programming with Perl is old and out of print, but it's freely available as an OpenBook from O'Reilly, and quite useful.

    I would also check out merlyn's columns as I think there are some good examples in there with good descriptions. There may also be something in Perl.com's article archive.


    --
    ajt
      Thanks, ajt, excellent book references, and things which will gladly join my growing library.
Re: HTTP Scripting
by dingus (Friar) on Nov 29, 2002 at 08:59 UTC
    Unless you are doing this for educational reasons then why reinvent the wheel? As well as the wonderful perl modules mentioned by people above, there is also a most excellent standalone tool: WGET

    Dingus


    Enter any 47-digit prime number to continue.
      Thanks, dingus, I'll check it out. Part of the motivation is educational, part of it is business related, but mostly I was unaware there was such a utility out there. Very cool.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://216406]
Approved by pg
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others admiring the Monastery: (5)
As of 2024-09-15 22:55 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    The PerlMonks site front end has:





    Results (21 votes). Check out past polls.

    Notices?
    erzuuli‥ 🛈The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.