Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?

Analysing five years of blogging

by BioGeek (Hermit)
on Nov 02, 2004 at 15:07 UTC ( #404621=perlnews: print w/replies, xml ) Need Help??

Hey Monks,, the weblog from Tom Coates, exists for 5 years, and to celebrate that he has made available a full data dump of all his writings and the links hidden within it so others can analyze/visualize/datamine it.

So far, the response hasn't been that overwhelming, so I thougth I should spread the meme here. I'm sure that with the combined powers available in this community, we can come up with some clever ways to rip the data apart and find some hidden gems in it.

A reader of the site suggested to use Perl to make a graph of the readability score of the blog over time, but Monks who know all those Lingua modules better than I do must be able to come up with something funkier.

Let's give Tom some nice birthday present for his blog, and show Tom -and the rest of the blogosphere- what Perl can do.

Replies are listed 'Best First'.
Re: Analysing five years of blogging
by biosysadmin (Deacon) on Nov 03, 2004 at 01:33 UTC
    Being a BioGeek, you must surely know about Markov Chains. Perhaps you could use that volume of writing to generate transition state statistics from word to word and see how well it performs at writing articles similar to Tom.

    Other options include simple word frequency counts, and possibly analysis of the domains to which he links (just a simple frequency count based on the second-level domain name might be interesting).

    Unfortunately I'm working on my thesis, so I don't have spare time to play with more data. Best of luck with the project. :)

      I went to Wikipedia to find out about Markov Chains. Being neither a mathematician nor BioInformationcist, I could not see any apparent application of Markov Chains and readability. It looked to me that they are more useful in modeling than analysis.

      I'm not asking you to write up an example of how they could be used in this context, but just to elaborate (for us non-scholars).

      I did find The Gunning Fog Index, which looks as though it might be a reasonable method of establishing a readability index. That article mentions Flesch Algorithms for scoring reading levels of material. Are there any Perl Monks with a background in educational pyschology?

      P.S. This is too interesting a topic to lose in a flame war.


      These later index methods are available in the CPAN Module Lingua::EN::Fathom. I haven't had a chance to even read the whole POD, but it mentions Fog, Flesch and Kincaid indices.
Re: Analysing five years of blogging
by TedPride (Priest) on Nov 02, 2004 at 15:23 UTC
    Blogs are primarily a form of self-expression, and the only people who actually read them are those who know or want to know the author. I suppose you could dig up the longest string of words that's repeated at least once, or write something to rank associations between a specific word and all other words (preferably between a rare specific word and all other words), or something else along those lines, but I hope he isn't seriously expecting a lot of interest in his blog.
Re: Analysing five years of blogging
by Anonymous Monk on Nov 02, 2004 at 18:26 UTC
    Try asking on
A reply falls below the community's threshold of quality. You may see it by logging in.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlnews [id://404621]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (3)
As of 2021-03-08 07:20 GMT
Find Nodes?
    Voting Booth?
    My favorite kind of desktop background is:

    Results (123 votes). Check out past polls.