Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

More PM Stats

by Limbic~Region (Chancellor)
on Feb 15, 2004 at 17:23 UTC ( [id://329141]=monkdiscuss: print w/replies, xml ) Need Help??

All:
Judging by the number of posts in Newest Nodes a few days ago, it seemed like a slow day. I asked in the CB if anyone had done statistical analysis on the number and type of posts on any given day. The general response was no, but all the tools are there to do it if you want. I really didn't want to. I wanted to be lazy and enjoy someone else's work.

I was provided with a 24MB 51MB XML file (thanks James) that sat in my inbox until last night. Seeing an opportunity to improve my non-existant SQL and CGI skills, I decided to load the information into a database, create a few nifty queries, and generate some HTML reports. It took me a lot longer than I expected to even get started because to call my SQL fu non-existant was far too kind.

Using the "eyes were bigger than my stomach" analogy, I quickly found myself in over my head. The ideas I had, while all possible, seemed like they would take far more work than the benefit I would get from them. Besides, judging from the Christmas Report, the 24MB XML file didn't contain all the records anyway.

This is where you come in. Take a look at this and tell me what you think. tye and diotalevi helped me get this far. If you feel that continuing on would be a worth while endeavour than I will. If you have any ideas for the type of reports you would like to see, please let me know.

Each record contains the following fields:

  • Year
  • Month
  • Day
  • Hour
  • Type (PMDiscussion, Poetry, etc)
  • Root (to determine type and if sub-note)
  • Day of Week
  • Holiday (I used US Federal Holidays + Valentines)

    Cheers - L~R

    Update: It turned out that I originally received an old file, but have since rebuilt the database and populated current information.

    Update 2: Thanks to diotalevi it now takes 1 minute instead of 1 hour to rebuild the database from scratch (yeah COPY). He already pointed out that the hours should be adjusted to GMT and James mentioned that this may require using two different zones since the Monastery has moved since its inception. Rest assured when I get time I will be doing this. I also removed "system" nodes from the statistics as tachyon feels that the average number of root nodes is twice the actual average.

  • Replies are listed 'Best First'.
    Re: More PM Stats
    by BazB (Priest) on Feb 15, 2004 at 17:42 UTC

      How about some graphics?

      mojotoad's chatterbox stats uses PISG and seems to work nicely.
      Of course, that particular package might not be right for this job.

      I'm not sure how much work is involved in representing data graphically, rather than lists of numbers, however I find it's much easier to understand this kind of data when graphed.


      If the information in this post is inaccurate, or just plain wrong, don't just downvote - please post explaining what's wrong.
      That way everyone learns.

        BazB,
        How about some graphics?

        That's one of the things that I envisioned but found myself lacking in the skills department. If this sounds like an interesting project and someone wants to collaborate - I am all for it.

        Cheers - L~R

    Re: More PM Stats
    by diotalevi (Canon) on Feb 15, 2004 at 17:42 UTC
      How did you handle timezones? PM is in -05 and I maintain a standard skew between my own system and perlmonks.org. How about sharing your SQL and data somewhere?
        diotalevi,
        I didn't handle timezones. I used the information in the XML in tact. My rationale was that even if the hours were "off", they would be universally off. If you feel that adjusting to GMT will be worth it, I can certainly do that. As far as the code and data - see below, but don't laugh:
        Code to build database Code to generate HTML (SQL) Sample from the 51MB XML provided by James

        Cheers - L~R

    Re: More PM Stats
    by CountZero (Bishop) on Feb 15, 2004 at 17:47 UTC
      Interesting, but I would not concentrate on statistics "by the hour" because we have Monks all over the world, so one Monk's night will be another Monk's day and that will tend to throw the statistics off I assume (unless the hour reported is the local time.

      Perhaps some nice graphs to add to the data? And some more statistical figures (not only average, but also median, standard deviation, ...)

      CountZero

      "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

        CountZero,
        While I understand where you are coming from, try to think of it in a different light. While we do have monks from all over the world, the timestamp is a standard offset from GMT. That allows you to see heavy "usage" regardless of where monks are in the world. It is also not that difficult to adjust to local time.

        I would love to add graphics, but that is a bit beyond my capabilities at this point. I had no idea how useful/wanted this type of thing would be and that is why I threw it out here for discussion. If someone would like to collaborate, I am more than happy to have them on board.

        I am not sure what added benefit knowing things like the standard deviation would be, but I am willing to have a go at it if you think it is. I was expecting to hear people make requests like "Can you create a report where I can get all the day for a specific day i.e. my birthday?".

        Cheers - L~R
          Indeed, I was of course looking to see a statistic on when the Monk's are most active (are we living/working at night or during the day?) in their own timezone.

          Silly me, I overlooked the fact that server stats deal with load on the server. Mea culpa

          For "easy" graphics have a look at Apache::GD::Graph. I used it with great success and I'm no graphic artist either.

          All these statistical numbers help you to better understand the data as a whole. You can have different data with the same average value but a different standard deviation and this will tell you something about the spread of the values through the data. So -IMHO- it does add something.

          CountZero

          "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

    Re: More PM Stats
    by tachyon (Chancellor) on Feb 15, 2004 at 21:00 UTC

      Eyeballing your root node figures they are double the reality. ie you find and average of 50 new root nodes during the week and 25 on the weekend. Actually root nodes (not inculding new user root nodes) are more like 25 during the week and 12 on the weekend. We only get that many new nodes if you are including the new users nodes.

      cheers

      tachyon

        tachyon,
        I am throwing out "user nodes" so that isn't skewing the nubmers. The code can be found here. I can tell you that I initially received an old dump of the database, but that I have subsequently rebuilt and regenerated with little to no change over the daily averages. My back of the envelope math shows that this is for about 4.2 years worth of data. I freely admit I may have made a mistake, but I don't see it.

        Cheers - L~R

        Update: After removing "system" nodes and putting in an explicit code to drop "user" nodes, the daily average is down to 35 for a weekday and 16 for a weekend.

    Re: More PM Stats
    by valdez (Monsignor) on Feb 15, 2004 at 20:29 UTC

      Well done :) I'd like to see stats by node type, especially for user node type. I'd like also to have that XML file, where can we get it? Thank you very much!

      Ciao, Valerio

        valdez,
        I'd like to see stats by node type

        If there was interest, I was going to do reports by node type (and will when I have time). Ultimately what I wanted to do was generate graphs using different colors for each of the different node types and what not. Since this is a "spare time" project and I already have another unfinished PM project, I will likely be slow in adding graphics.

        especially for user node type

        I am not sure what you mean by "user node" type unless you are talking about monk's homenodes. I actually threw that data away when I built the database, but it should not be hard to add it back in again.

        I'd like also to have that XML file, where can we get it?

        I got it from one of the gods. I would direct your request there.

        Cheers - L~R

    Re: More PM Stats
    by Anonymous Monk on Feb 16, 2004 at 07:21 UTC
        That is amazing! That site looks like a Suize Knife!

        I had worked for some years in ratings (statistics partitioned by commercial purposes).

        People used to be interested in graphics that added any sort of 'adjectives' to the facts exposed. Those 'adjectives' mainly consisted in crossreferencing anything.

        There is not as much of maths as there is of imagination.

        So having all the possible numbers, it could be possible to get the graphs of any crossreference of data.

        I imagine having a table where one might choose any row and column of data and even limit more the sort of data to show if required.

        Then in some cases, GMT time would be required (4x. rating the average of success of author nodes). And in others, the local time might matter (4x. average of users hours of main participation or hours of participation by country and by month).

    Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Domain Nodelet?
    Node Status?
    node history
    Node Type: monkdiscuss [id://329141]
    Approved by kutsu
    Front-paged by kutsu
    help
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this?Last hourOther CB clients
    Other Users?
    Others scrutinizing the Monastery: (4)
    As of 2024-03-19 07:25 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      No recent polls found