Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic

comment on

( #3333=superdoc: print w/replies, xml ) Need Help??

Ok, now that I've got your attention... what I'm wondering, is the best way to capture a bulk of PM for reading "offline", where there is no net connectivity (think WAY offline, as in.. no power for hundreds of miles), for potentially months at a time.

I'm going to start small, just try to see if its even feasible, and then expand it. I've done some similar projects in this area over the last few years, which have been quite successful.

I also looked around the monastery here, and found these somewhat-relevant nodes:

There are quite a few useful replies in there, and some referencing ThePen (which is down as I type this). Some talk about spidering the site, others about converting from XML to html, others to just pulling a database dump and reusing that.

Ideally, the best approach would be to dump the node tables and replies to some form of XML, like Wikimedia projects do. They have a tool called mwdumper (written in Java) that will take the XML export and pump it back into MySQL (I just did this for the latest Wikipedia database this weekend, it was over 4.5 million separate rows and took 20 hours to import, whew!).

But it doesn't have to be that complex... even just the XML dumps with some sort of linking to each of the replies, would be perfect.

Now I can also spider ThePen during off-hours (when it comes back online) and store the plain HTML that way, but that introduces load, latency, bandwidth issues and so on. I'd rather avoid that strain on someone else's server, because I know what its like when someone does it to my public servers.

Has there been any movement on the implementation of "nodeballs" yet in PM? The Everything Engine powering PerlMonks supports it, so I guess its just a matter of a concensus, and a vote, and enabling it?

What say ye?

In reply to Some light PerlMonks reading by the campfire by hacker

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?

    What's my password?
    Create A New User
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others chilling in the Monastery: (6)
    As of 2021-01-25 08:01 GMT
    Find Nodes?
      Voting Booth?