Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

Ok, now that I've got your attention... what I'm wondering, is the best way to capture a bulk of PM for reading "offline", where there is no net connectivity (think WAY offline, as in.. no power for hundreds of miles), for potentially months at a time.

I'm going to start small, just try to see if its even feasible, and then expand it. I've done some similar projects in this area over the last few years, which have been quite successful.

I also looked around the monastery here, and found these somewhat-relevant nodes:

There are quite a few useful replies in there, and some referencing ThePen (which is down as I type this). Some talk about spidering the site, others about converting from XML to html, others to just pulling a database dump and reusing that.

Ideally, the best approach would be to dump the node tables and replies to some form of XML, like Wikimedia projects do. They have a tool called mwdumper (written in Java) that will take the XML export and pump it back into MySQL (I just did this for the latest Wikipedia database this weekend, it was over 4.5 million separate rows and took 20 hours to import, whew!).

But it doesn't have to be that complex... even just the XML dumps with some sort of linking to each of the replies, would be perfect.

Now I can also spider ThePen during off-hours (when it comes back online) and store the plain HTML that way, but that introduces load, latency, bandwidth issues and so on. I'd rather avoid that strain on someone else's server, because I know what its like when someone does it to my public servers.

Has there been any movement on the implementation of "nodeballs" yet in PM? The Everything Engine powering PerlMonks supports it, so I guess its just a matter of a concensus, and a vote, and enabling it?

What say ye?


In reply to Some light PerlMonks reading by the campfire by hacker

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others goofing around in the Monastery: (5)
As of 2024-03-28 16:46 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found