|Just another Perl shrine|
"Biting" text files into managable sectionsby hacker (Priest)
|on Nov 11, 2002 at 20:41 UTC||Need Help??|
hacker has asked for the
wisdom of the Perl Monks concerning the following question:
I'm working on some scripts that can digest Project Gutenberg etext files into managable sections, which are then converted into Plucker format for use/reading on a Palm handheld device. What I'm trying to figure out, is how to "bite" the etext into sections I can manage in pieces, and I'm running into some trouble.
Basically, most of the Project Gutenberg etexts have this format:
Some of this is going to rely on an array of possible opening Gutenberg strings and delimiters, based on the trasnlator and year of translation, but what I'd like to do is stuff those sections into an array I can manipulate, then reassemble it back into a "sectional" document, so it ends up like this (pseudocode):
With the relevant sections in a series of arrays I can manipulate, I can then reassemble it into a document which can be turned into a clickable XHTML 1.0-compliant document, where the copyright, document body, etc. is all clickable from the "contents" page of the document when viewed on the Palm device. Right now, it would be a huge flat text file, where the first 10 or so pages are the copyright info, which doesn't scale well on a 160x160 screen. Having each "chapter" of the etext, including the Gutenberg Copyright in their own "page", clickable (tappable, via stylus/finger) is more preferred.
I'm also rewrapping the text, using Text::Autoformat to 55 columns wide, with full-justify, which looks very good, compared to the "chainsaw" effect of the original etext packed in natively. The code for that is very basic, and looks like:
The part I'm confused about, is how do I walk/seek through the text file/stream (assuming the file body itself is in an object, via Net::FTP or HTTP::Request), and bite those sections of into arrays? I'm able to pull the whole file into an array, but not specific delimited sections.
The next phase, if I can get this sorted out, is to try to build a table of heuristics where I can detect actual chapter breaks in the texts, and separate those into their own pages, but being able to separate header, etext, footer, into their own bits is important for this first pass.
TIA for any help and suggestions.