http://www.perlmonks.org?node_id=404808


in reply to Analysing five years of blogging

Being a BioGeek, you must surely know about Markov Chains. Perhaps you could use that volume of writing to generate transition state statistics from word to word and see how well it performs at writing articles similar to Tom.

Other options include simple word frequency counts, and possibly analysis of the domains to which he links (just a simple frequency count based on the second-level domain name might be interesting).

Unfortunately I'm working on my thesis, so I don't have spare time to play with more data. Best of luck with the project. :)

Replies are listed 'Best First'.
Re^2: Analysing five years of blogging
by perlcapt (Pilgrim) on Nov 03, 2004 at 13:12 UTC
    I went to Wikipedia to find out about Markov Chains. Being neither a mathematician nor BioInformationcist, I could not see any apparent application of Markov Chains and readability. It looked to me that they are more useful in modeling than analysis.

    I'm not asking you to write up an example of how they could be used in this context, but just to elaborate (for us non-scholars).

    I did find The Gunning Fog Index, which looks as though it might be a reasonable method of establishing a readability index. That article mentions Flesch Algorithms for scoring reading levels of material. Are there any Perl Monks with a background in educational pyschology?

    P.S. This is too interesting a topic to lose in a flame war.

    Update:

    These later index methods are available in the CPAN Module Lingua::EN::Fathom. I haven't had a chance to even read the whole POD, but it mentions Fog, Flesch and Kincaid indices.
    perlcapt
    -ben