Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Scrapping web site - printing - save to file

by locust (Sexton)
on Jul 30, 2012 at 14:14 UTC ( [id://984458]=perlquestion: print w/replies, xml ) Need Help??

locust has asked for the wisdom of the Perl Monks concerning the following question:

Hello.

I'm wondering how to go about the following:

I have a website where I need to print out a few press releases each month (up to 20 at most) and also save them all to one file. The website lists the press releases by month on one page with links to each press release.

Before I start creating this project, I wonder what the best method of going about this would be? Any suggestions? Like, what module should I use, etc. I don't expect you to write my code for me, but hoping someone with experience could point me in the right direction so I don't wast a whole lot of time.

Thanks! :)

Replies are listed 'Best First'.
Re: Scrapping web site - printing - save to file
by davido (Cardinal) on Jul 30, 2012 at 15:51 UTC

    Use a modern web framework such as Mojolicious or Dancer. Use a database such as SQLite to save the press releases so that your user doesn't experience the Schlemiel the Painter effect if they want to get to the last press release in the ever-growing file. Manage your DB connection with Mojolicious::Plugin::Database. Use Mojolicious::Plugin::Authentication along with a trusted CPAN digest module to deal with authenticating your administrative user (I use Class::User::DBI, but I'm highly biased, and it may be bigger than you need. Minimally, Authen::Passphrase is a nice starting point).

    The whole thing would probably fit nicely into a Mojolicious::Lite style framework, but if it does grow to the point that you need a little stronger separation of concerns you can easily inflate a Mojolicious::Lite application into a full app where you separate the templates into their own files, the controllers into their own classes, and the router as the bulk of the application class: Mojolicious::Guides::Growing.

    Update: I failed to mention earlier... If your application needs to do some scraping as well (did you mean scraping instead of scrapping?), then Mojolicious really is a good choice as a web framework, because it comes bundled with Mojo::UserAgent: A "Non-blocking I/O HTTP and WebSocket user agent", as well as Mojo::DOM, a "Minimalistic HTML5/XML DOM parser with CSS3 selectors", and Mojo::JSON, a "Minimalistic JSON" parser and generator. ...many of the important tools used in effective scraping.


    Dave

Re: Scrapping web site - printing - save to file
by aaron_baugher (Curate) on Jul 30, 2012 at 16:50 UTC

    A couple common tools for scraping would be LWP::UserAgent (or LWP::Simple) and WWW::Mechanize. WWW::Mechanize can automate the process of walking through the links for you, while with LWP you'll have to fetch the first page, parse out the links you want somehow, and then fetch those pages. Parsing can be done with a wide array of HTML/XML/DOM parsers/structure-builders. (I don't have one to recommend because there are so many popping up all the time that it seems like there's a new favorite every time I do such a task.) Parsing can also be done with regexes, but that's usually not recommended except for quick-and-dirty tasks. (Some will probably say that it's never recommended, but I recently had a task where I needed to parse a single value out of a page, and the overhead of loading the entire page into HTML::TreeBuilder caused a 10-fold increase in time used, compared to a single regex, so you have to decide each case for yourself.)

    In either case, printing a series of pages to a single file is no big deal; just open the file at the beginning, print each one to it, and then close it. However, it's unlikely that you really want to print all of each page to the file, since in most cases that will include a lot of cruft like the HTML <head> section and the page's headers and footers and menus and so on. So again, you're probably going to want a parser/tree-builder of some sort to help you pluck out the section of the page that you actually want to save.

    Aaron B.
    Available for small or large Perl jobs; see my home node.

Re: Scrapping web site - printing - save to file
by Kenosis (Priest) on Jul 30, 2012 at 16:44 UTC

    At the risk of being Schlemiel, and although davido provided excellent module suggestions, perhaps the following will also help:

    use Modern::Perl; use DateTime; my @pressReleasesLinks; # my @pressReleases = <*.pdf>; #read press release dir my @pressReleases = <DATA>; for my $i ( 1 .. 12 ) { for ( sort grep { /-(\d{2})-/; $1 == $i } @pressReleases ) { chomp; my ( $year, $month, $day ) = split '-', (/([^.]+)/)[0]; $day = $day + 0; my $monthName = DateTime->new( year => $year, month => $month )->month_name; $pressReleasesLinks[ $i - 1 ] .= qq|<a href="$_">$monthName $day, $year</a>\n|; } } do { say $pressReleasesLinks[$_] if defined $pressReleasesLinks[$_] } for 0 .. 11; # Updated: in case a month is skipped __DATA__ 2012-03-15.pdf 2012-03-05.pdf 2012-05-20.pdf 2012-05-01.pdf 2012-05-15.pdf 2012-01-01.pdf 2012-01-15.pdf 2012-02-01.pdf 2012-02-15.pdf

    Output:

    <a href="2012-01-01.pdf">January 1, 2012</a> <a href="2012-01-15.pdf">January 15, 2012</a> <a href="2012-02-01.pdf">February 1, 2012</a> <a href="2012-02-15.pdf">February 15, 2012</a> <a href="2012-03-05.pdf">March 5, 2012</a> <a href="2012-03-15.pdf">March 15, 2012</a> <a href="2012-05-01.pdf">May 1, 2012</a> <a href="2012-05-15.pdf">May 15, 2012</a> <a href="2012-05-20.pdf">May 20, 2012</a>

    This assumes that your press releases are in pdf format. Assuming the files are within a directory, you can use the above file naming scheme, which allows the script to create a set of month-clustered links to those documents.

    Update: I apologize for the above noise if I'm not correctly understanding the issue.

Re: Scrapping web site - printing - save to file
by Anonymous Monk on Jul 30, 2012 at 21:48 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://984458]
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others exploiting the Monastery: (5)
As of 2024-03-19 09:13 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found