This is what I have so far, it kinda works but lack finesses and is pretty seriously flawed, but hey, it's only one day into my Perl adventure yet so I think I've done OK so far.
use strict;
use warnings;
use LWP::Simple qw(get);
use File::Slurp ;
# pid = Persona ID, one of a players 3 identities.
# sid = Sortie ID, identifier for a mission taken by the persona.
# We want to crawl all sortie list pages and collect new sid's we
# have't seen before and then move on to next persona.
my $pbase = 'http://csr.wwiionline.com/scripts/services/persona/sortie
+s.jsp';
my $pcnt = 1;
my $pidfile = 'c:/scr/pidlist.txt';
# Open list of pid's and set first one as current pid.
open PIDLIST, "<", $pidfile or die "Could not open $pidfile: $!";
my $pid = <PIDLIST>;
chomp $pid;
print $pid;
# Grab and store sortie list pages for persona.
while (1) {
my $page = get "$pbase?page=$pcnt&pid=$pid";
# Store grabbed webpage into the file
append_file( "c:/scr/$pid.txt", $page ) ;
# Update page number and grab next.
$pcnt += 1;
};
# Close files
close PIDLIST or die $!;
print '\nDone!\n';
Flaws in this is that the server will quite happily keep giving you empty sortie list pages so just updating the page count and hoping for a fail to exit doesn't work (resulting in a huge file).
I want the loop to exit under either of two conditions, either the string "No more sorties" are found on the page (end of list) OR a sid string equal to the stored variable for the last one processed is reached. (sids are six digit strings that I need to collect from the collected pages).
This code is using LWP, but suggestions was for Mechanize so I need to rewrite to use that instead.
Also need to redo the load pid bit so it actually works it's way through the list of pids, it will also have to fetch two variables in pairs eventually (in addition to the pid the last processed sid).
Tried using Slurp to open and read the pidlist file, but that didn't work out as planned.
For some reason $pid isn't printed out as supposed any more.
When that's achieved comes the tricky part of collecting the actual sortie pages and extracting the data I need from them.
Any suggestions on good coding practices and habits to pick up s appreciated, might as well learn to do it right from start.
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.