Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer

"Biting" text files into managable sections

by hacker (Priest)
on Nov 11, 2002 at 20:41 UTC ( #212079=perlquestion: print w/replies, xml ) Need Help??

hacker has asked for the wisdom of the Perl Monks concerning the following question:

I'm working on some scripts that can digest Project Gutenberg etext files into managable sections, which are then converted into Plucker format for use/reading on a Palm handheld device. What I'm trying to figure out, is how to "bite" the etext into sections I can manage in pieces, and I'm running into some trouble.

Basically, most of the Project Gutenberg etexts have this format:

The Project Gutenberg Etext of <etext title> by <etext author> Copyright (c) <copyright holder> **This is a COPYRIGHTED Project Gutenberg Etext, Details Below** <Project Gutenberg Header> ... # Gutenberg Copyright information <known range of delimiting text> <actual body of etext> ... # etext body/document/book is here End of the Project Gutenberg Etext of <etext title> by <etext author> Copyright (c) <copyright holder>

Some of this is going to rely on an array of possible opening Gutenberg strings and delimiters, based on the trasnlator and year of translation, but what I'd like to do is stuff those sections into an array I can manipulate, then reassemble it back into a "sectional" document, so it ends up like this (pseudocode):

my $pg_author = "<etext author>"; my $pg_title = "<etext title>"; my @pg_header = "<Project Gutenberg Header>"; my @etext_body = "<actual body of etext>"; my @pg_footer = "<Project Gutenberg Footer>";

With the relevant sections in a series of arrays I can manipulate, I can then reassemble it into a document which can be turned into a clickable XHTML 1.0-compliant document, where the copyright, document body, etc. is all clickable from the "contents" page of the document when viewed on the Palm device. Right now, it would be a huge flat text file, where the first 10 or so pages are the copyright info, which doesn't scale well on a 160x160 screen. Having each "chapter" of the etext, including the Gutenberg Copyright in their own "page", clickable (tappable, via stylus/finger) is more preferred.

I'm also rewrapping the text, using Text::Autoformat to 55 columns wide, with full-justify, which looks very good, compared to the "chainsaw" effect of the original etext packed in natively. The code for that is very basic, and looks like:

use strict; use CGI qw(:standard); use Text::Autoformat; my $file = "pg_exext.txt"; open(PGW, "<$file") or die $!; local $/ = undef; my $data = <PGW>; my $formatted = autoformat $data, {justify =>'full', left => 4, right => 55, all => 1}; print pre("$formatted"); my $data =~ m/End of Project Gutenberg Etext (.*)/s;

The part I'm confused about, is how do I walk/seek through the text file/stream (assuming the file body itself is in an object, via Net::FTP or HTTP::Request), and bite those sections of into arrays? I'm able to pull the whole file into an array, but not specific delimited sections.

The next phase, if I can get this sorted out, is to try to build a table of heuristics where I can detect actual chapter breaks in the texts, and separate those into their own pages, but being able to separate header, etext, footer, into their own bits is important for this first pass.

TIA for any help and suggestions.

Replies are listed 'Best First'.
Re: "Biting" text files into managable sections
by dingus (Friar) on Nov 12, 2002 at 08:00 UTC
    The part I'm confused about, is how do I walk/seek through the text file/stream (assuming the file body itself is in an object...

    Why do you not save the downloaded text file locally first and then munge as a separate action? This would seem to be easier as you can easily delimit lines etc.? Its alsogot the benefit of flexibility with regards to time and net access because you can download it anytime before you munge it.


    Enter any 47-digit prime number to continue.

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://212079]
Approved by Mr. Muskrat
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (6)
As of 2021-11-29 14:56 GMT
Find Nodes?
    Voting Booth?

    No recent polls found