|Perl: the Markov chain saw|
Programatically reparagraphinating textby hacker (Priest)
|on Feb 16, 2007 at 01:24 UTC||Need Help??|
hacker has asked for the
wisdom of the Perl Monks concerning the following question:
Boy is that a mouthful...
What I'm trying to do, is take a series of old "e-zines" (phrack, t@p and such.. I have 288 of them, for a total of 9,899 issues) which are stored in plain old 7-bit ascii text (think BBS era), and reflow them so I can then wrap some XML around the elements, and convert them to HTML (yes, XML... then HTML).
Here's the catch.. unless I am going to go through them manually after they've been reformatted, with my human eyes, I'll never know if sections that should NOT have been touched, were.
For example, there are some that have ascii diagrams of pinouts, ascii representations of block diagrams and other things, which I'd like to keep intact.. but the paragraphs of text prior and after them, should be reflowed. Here's an example:
And one more...
So some rudimentary rules should be set... lines that end in say... \w\s+\w$\w, are probably the end of sentences.. and not part of a diagram.
I'm not really asking for the actual code, and I know this'll be a huge pile of regexes and unit tests, but what I AM asking for, is a list of the proper modules that I can use at my disposal to do this. Things like Text::Wrap, XML::LibXML, Text::Autoformat, and others. Thanks in advance, my fellow brethren...