Beefy Boxes and Bandwidth Generously Provided by pair Networks Frank
P is for Practical
 
PerlMonks  

Re: Random email parsing

by ZZamboni (Curate)
on Jun 08, 2001 at 18:54 UTC ( #86919=note: print w/ replies, xml ) Need Help??


in reply to Random email parsing

Parsing "free form" documents is always tricky, and you may have to end up encoding some special cases. But the trick is to find regularities. As a first step, here's my take:

The news items themselves include the titles, so I would say you can just skip everything up to the line of equal signs. Then, you can read each news item as a paragraph, and consider the first line to be the title.

The following snippet of code stores the news items in %db, using the title as the key, containing the "body" (they could as well be stored in an array, if you want to preserve the order).

use strict; my $f=0; my %db; $/=""; while (<>) { $f=1,next if /^==========/; next unless $f; my @item=split /\n/, $_, 2; $db{$item[0]}=$item[1]; } foreach (keys %db) { print "Title: $_\nBody: $db{$_}"; }
A further step would be to parse the body. There again, the trick is to find any regularities. In the example data you gave, there are 3 lines of "headers" followed by the text. If this is always the case, something like this could do the trick:
@body=split /\n/, $body, 4;
And you would end up with the three headers in @body[0,1,2] and the text in $body. If the "three header lines" rule does not apply, you could use some other heuristic. For example, are header lines always less than 40 characters in length? Then you could use something like this: (untested):
my @lines=split /\n/, $body; my @hdr; my $l; while ($l=shift(@lines)) { last if length($l)>40 push @hdr, $l; } $body=join("\n", $l, @lines);
Which would leave all the initial shorter-than-40 character lines in @hdr, and the rest re-joined with newlines in $body.

--ZZamboni


Comment on Re: Random email parsing
Select or Download Code

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://86919]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (12)
As of 2014-04-21 12:59 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (495 votes), past polls