Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Public export of Perl Monks database

by zby (Vicar)
on Feb 21, 2003 at 15:46 UTC ( #237458=monkdiscuss: print w/ replies, xml ) Need Help??

I wonder what is the opinion of Perl Monks on this subject. I am well aware of the scary Copyrights problem - but I believe it might be something really usful. You could use it to create some statistics, or to develope a new search, or to make a Perl Monk's Bible or just to have your own fast search.

Of course it could be a restricted export.

Comment on Public export of Perl Monks database
Re: Public export of Perl Monks database
by VSarkiss (Monsignor) on Feb 21, 2003 at 16:04 UTC

    I'm not sure exactly what you're proposing. Are you saying an export of the entire database, including code, home nodes, passwords, etc? I don't think that would be a good idea.

    The Everything engine can import and export what are called "nodeballs". If you have a certain set of node_id's you want, gods have expressed willingness (in the context of pmdev) to create nodeballs of them. I can't speak for them, but they may be willing to do the same for you if you ask nicely. Of course, it would have to be a reasonable-sized set, with all the usual caveats about security, available time and resources, and so on.

    Or am I misinterpreting your question entirely?

      I did say some restricted export - so no I don't mean to publish password etc. I was thinking about something like a let's say weekly automatic dump in a publicly available directory.
Re: Public export of Perl Monks database
by zby (Vicar) on Feb 21, 2003 at 17:07 UTC
    As to the copyright issue - there could be a page where everybody could explicitely sign for a copyleft. And than only nodes created by them would be published.

    And the licence for the published material could be GNU Free Documentation License

(jeffa) Re: Public export of Perl Monks database
by jeffa (Chancellor) on Feb 21, 2003 at 17:22 UTC
    You don't have to have direct access to the database to make useful things such as statistics, new searches, bible, etc. All you need is a script to fetch nodes (i recommend fetching XML versions).
    use strict; use warnings; use Data::Dumper; use XML::Simple; use LWP::Simple; our $URL = 'http://www.perlmonks.org/index.pl'; our $PATH = '/path/to/perlmonks/nodes'; for (0 ... 666666) { my $node = get "$URL?node_id=$_&displaytype=xml"; my $xml = XMLin($node); next if $xml->{title} =~ /Permission\s+Denied/i; next if $xml->{title} =~ /Not\s+found/i; open FH, '>', "$PATH/$_.xml" or die "can't write: $!"; print FH $node; sleep 5; # play nice ;) }
    Very simple, could use some more work, but this will get the job done. Just be sure and run it during the weekend or other 'less busy' times. ;) I also have some code over at Node XML to HTML that transforms the XML into HTML ... it's not perfect either, but it's a start.

    jeffa

    L-LL-L--L-LL-L--L-LL-L--
    -R--R-RR-R--R-RR-R--R-RR
    B--B--B--B--B--B--B--B--
    H---H---H---H---H---H---
    (the triplet paradiddle with high-hat)
    
      Yes - I know that technically I can do that. I did not know that XML interface, but you can always use a HTML::Parser. What I am asking is if this is allowed. And beside that, this would generate quite some load on the server when downloading the whole database your way.

      I believe that when it is done my way - it would encourage people to think up new ways to use it.

        "What I am asking is if this is allowed"

        Well ... it's not not allowed.

        "...this would generate quite some load on the server..."

        Damn spiffy it will. See up there in my post where i said "run it during the weekend or other 'less busy' times"? However, due to the fact that the code only fetches each node as XML, it's not quite as much of a load as you might think. The server does not have to generate nodelets and such.

        "I believe that when it is done my way..."

        And that's why i posted. You might be waiting a loooong time for your idea to be implemented here, unless you want to become a god and do it yourself. :)

        For the record, i would love to have access to the database. From time to time i like to do a little history/research and that would make my life much easier. Until then, i just run a script similar to the one i cranked out above when there are very few users on the site.

        jeffa

        L-LL-L--L-LL-L--L-LL-L--
        -R--R-RR-R--R-RR-R--R-RR
        B--B--B--B--B--B--B--B--
        H---H---H---H---H---H---
        (the triplet paradiddle with high-hat)
        
Re: Public export of Perl Monks database
by valdez (Monsignor) on Feb 21, 2003 at 17:28 UTC

    Nice idea, zby++. Does someone know the rough size of such backup?

    Ciao, Valerio

    update: using data provided by jeffa, I did the following guess: given that tilly's nodes are ~1035 bytes, ~238000 nodes will be ~235Mb (uncompressed).

      I don't, but for what it is worth, i grabbed all of tilly's nodes a while after he announced his departure. His 2994 writeups total up to about 13 megabytes and he is only number three over at Our Best Users ...

      UPDATE:
      valdez tells me that the total megs on tilly's posts is only about 3. I ran du -h originally, but after thinking about this, 3 megs sounds more correct than 13. Thanks valdez. :)

      jeffa

      L-LL-L--L-LL-L--L-LL-L--
      -R--R-RR-R--R-RR-R--R-RR
      B--B--B--B--B--B--B--B--
      H---H---H---H---H---H---
      (the triplet paradiddle with high-hat)
      
Re: Public export of Perl Monks database
by pfaut (Priest) on Feb 21, 2003 at 18:53 UTC

    I'm not sure exactly what information you want to get out of the system but quite a bit is available through the XML generators. I'm currently using these to create my own newest nodes interface (login version, no login version). Part of this project is to keep a local cache of node header information in a PostgreSQL database. You should be able to get at most of the information you want this way. Just don't beat on the server by asking for all 237,000 nodes at once and try to grab information during off peak hours.

    --- print map { my ($m)=1<<hex($_)&11?' ':''; $m.=substr('AHJPacehklnorstu',hex($_),1) } split //,'2fde0abe76c36c914586c';
Re: Public export of Perl Monks database
by blm (Hermit) on Feb 22, 2003 at 02:33 UTC

    How big would the information be?

    Consider that as of 2003-01-28 16:14:30 there were 21341 registered users of which 6232 have actually created write-ups. From this page we can calculate the total number of writeups as 190817.

    Now Tilly has left (last login Mar 31, 2002 at 09:27 GMT-10) and jeffa has already noted that downloading tilly's nodes took up 3 MB on his hard drive. This was for 2986 posts. so the average node size was about 1053 bytes.

    Assuming this average post size is representative of the entire perlmonks database one would estimate the size of the database containing writeups to be about 190817 x 1053 = 200930301 bytes or about 191 megabytes

    Before anyone flames me know this: I know I have made some big assumptions. It should be noted that Tilly had a lot to offer so the size of his writeups would be larger then alot of others.

    Most of my data came from the perlmonks stats site. A big thanks to jcwren! The total size of tilys writeups was from (jeffa) 2Re: Public export of Perl Monks database

    Could anyone imagine a PerlMonks Compendium that could be sold to raise money for to fund the perlmonks web site? Would people be interested in that?

    UPDATE: In the time that it took to write this several others have already posted this information

Re: Public export of Perl Monks database
by castaway (Parson) on Feb 23, 2003 at 08:28 UTC
    It's an interesting idea..
    I'm just wondering how much actual work it would take to make it useful for anyone who's looking for answers to a certain problems.. i.e. someone will have to do quite a bit of sorting and categorising to make it suitable for any sort of publication. Most of the people that keep up Perl Monks (and hooray for them) seem to have enough to do already :)
    (There's a whole lot of redudant stuff, posts that say the same, posts that are inacurate, because of misunderstanding the question etc. And who's to judge whats 'good' and 'useful' and whats not?)

    Having said that, maybe the exported bundle of actual node data will be useful to someone.. Apart from setting up a mirror, I can't think of a real use at the moment. If anything, it'd be nice to put on a CD to make sure it doesn't get lost.. Though I hope that PM does backups anyway...

    C.

      Perhaps you don't see - but I do. I won't post the ideas here - they are still very vague, and I would like to test them befor that, but I am sure others will find other ideas. The important thing is to open the database so that everyone could test his own ideas. And I believe there are many ways to distill interesting information from it.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: monkdiscuss [id://237458]
Approved by gmax
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (6)
As of 2014-11-26 23:01 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My preferred Perl binaries come from:














    Results (176 votes), past polls