Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

This is what I have so far, it kinda works but lack finesses and is pretty seriously flawed, but hey, it's only one day into my Perl adventure yet so I think I've done OK so far.

use strict; use warnings; use LWP::Simple qw(get); use File::Slurp ; # pid = Persona ID, one of a players 3 identities. # sid = Sortie ID, identifier for a mission taken by the persona. # We want to crawl all sortie list pages and collect new sid's we # have't seen before and then move on to next persona. my $pbase = 'http://csr.wwiionline.com/scripts/services/persona/sortie +s.jsp'; my $pcnt = 1; my $pidfile = 'c:/scr/pidlist.txt'; # Open list of pid's and set first one as current pid. open PIDLIST, "<", $pidfile or die "Could not open $pidfile: $!"; my $pid = <PIDLIST>; chomp $pid; print $pid; # Grab and store sortie list pages for persona. while (1) { my $page = get "$pbase?page=$pcnt&pid=$pid"; # Store grabbed webpage into the file append_file( "c:/scr/$pid.txt", $page ) ; # Update page number and grab next. $pcnt += 1; }; # Close files close PIDLIST or die $!; print '\nDone!\n';

Flaws in this is that the server will quite happily keep giving you empty sortie list pages so just updating the page count and hoping for a fail to exit doesn't work (resulting in a huge file).
I want the loop to exit under either of two conditions, either the string "No more sorties" are found on the page (end of list) OR a sid string equal to the stored variable for the last one processed is reached. (sids are six digit strings that I need to collect from the collected pages).

This code is using LWP, but suggestions was for Mechanize so I need to rewrite to use that instead.
Also need to redo the load pid bit so it actually works it's way through the list of pids, it will also have to fetch two variables in pairs eventually (in addition to the pid the last processed sid).
Tried using Slurp to open and read the pidlist file, but that didn't work out as planned.
For some reason $pid isn't printed out as supposed any more.

When that's achieved comes the tricky part of collecting the actual sortie pages and extracting the data I need from them.

Any suggestions on good coding practices and habits to pick up s appreciated, might as well learn to do it right from start.


In reply to Re^2: collect data from web pages and insert into mysql by SteinerKD
in thread collect data from web pages and insert into mysql by SteinerKD

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (5)
As of 2024-04-24 07:56 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found