Re^2: Dirtiest Data

by xdg (Monsignor)
in reply to Re: Dirtiest Data
in thread Dirtiest Data

The one tool I tend to use most often, as a first resort in the widest range of tasks, simply prints out a byte-value histogram, either as a 256-line list or as a nice 8-column x 32-row table, with an optional summary that counts up character categories like "printable ascii", "non-printable ascii", "8-bit", "iso-printable 8-bit", "digits", "whitespace", etc.

Wow. Too bad I can only ++ once.

Any chance there's a CPAN module to do that? If not, you should definitely write it! Moreover, this sounds like a great talk to give at seminar or conference. Or a great article for or The Perl Review.


Re^3: Dirtiest Data
by graff (Chancellor) on Jun 23, 2006 at 20:51 UTC
    Any chance there's a CPAN module to do that? If not, you should definitely write it!

    um... well, <confession> the tool I referred to there is one that I actually wrote in C (so long ago, it was before I learned Perl) </confession>. "It ain't broke", so I've had no need to rewrite it. I sincerely apologize if it was inappropriate to discuss it here.

    Obviously a good Perl version to do the same thing would be a lot fewer lines of code than my C version, and most likely would not be significantly slower. But for the time being, I'm sorry that I must "leave it as an exercise for the reader..."

    (Update: I'm happy to share the C code with anyone who might want to try it out -- you can download it here: -- again, please forgive me for straying off-topic to non-Perl tools, and accept it in the spirit of PerlMonks, as an opportunity to adapt and enhance it in Perl.)

