Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister

Converting a fixed-width report to an Excel-friendly format

by loris (Hermit)
on Oct 25, 2006 at 15:11 UTC ( #580596=perlquestion: print w/replies, xml ) Need Help??
loris has asked for the wisdom of the Perl Monks concerning the following question:

Hello all,

I have been given what seems to me a slightly hairy task. I have a report with fixed-width fields similar to this:

Line containing uninteresting information Another line containing uninteresting information More uninteresting information Transaction Employee --- Date Name Amount ------------ ------------------ 111/222/ABC 23-JAN-06 Baker, Abel 111.11 24-JAN-06 Baker, Abel 222.22 25-JAN-06 Baker, Abel 333.33 26-JAN-06 Baker, Abel 444.44 27-JAN-06 Baker, Abel 555.55 Line containing uninteresting information Another line containing uninteresting information Transaction Employee --- Date Name Amount ------------ ------------------ 333/444/DEF 23-JAN-06 Dog, 111.11 Charlie 24-JAN-06 Dog, 222.22 Charlie 25-JAN-06 Dog, 333.33 Charlie

Note the spurious ^L characters, which appear as boxes in my browser, at the beginning of some lines and the over-wide fields running into the following line.

This needs to be converted to Excel-readable format, e.g. CSV, containing the information from the table-like parts with information from the line after the headers in additional colums, like this:

Code1|Code2|Transaction Date|Employee Name|Amount 111|ABC|23-JAN-06|Baker, Abel|111.11 111|ABC|24-JAN-06|Baker, Abel|222.22 111|ABC|25-JAN-06|Baker, Abel|333.33 111|ABC|26-JAN-06|Baker, Abel|444.44 111|ABC|27-JAN-06|Baker, Abel|555.55 333|DEF|23-JAN-06|Dog, Charlie|111.11 333|DEF|24-JAN-06|Dog, Charlie|222.22 333|DEF|25-JAN-06|Dog, Charlie|333.33

I would approach this in the following way

  • slurp the file
  • for each line, remove any spurious characters
  • then, check whether any of a number of regexps match
  • then, do the appropriate thing for the identified line type

Does any one have any less naive ideas? I wondered whether there is there something like format for reading files?



"It took Loris ten minutes to eat a satsuma . . . twenty minutes to get from one end of his branch to the other . . . and an hour to scratch his bottom. But Slow Loris didn't care. He had a secret . . ."

Replies are listed 'Best First'.
Re: Converting a fixed-width report to an Excel-friendly format
by Corion (Pope) on Oct 25, 2006 at 16:00 UTC

    I've used the practice you outline to process human-list-reports into nice, machine-treatable reports to great success, except without the slurping part:

    package Parser::ReportFoo; use strict; sub new { ... $self->columns = [qw(date customer amount)]; }; my @handlers = [ # Line containing uninteresting information [qr/^Line containing uninteresting information/ => \&discard], [qr/^------------/ => \&discard], [qr/^(\d\d-[A-Z]{3}-\d\d) (..........) (\d+\.\d+)/ => \&captur +e_transaction], [qr/^\s*$/ => \&flush ], [qr/^ ) (..........)\s+$/ => \&capture_name2], ]; sub discard {}; sub flush { my ($self) = @_; my @row = map { $self->$_ } (@{ $self->columns }); print join "\t", @row; }; sub capture_transaction { my ($self,$date,$customer,$amount) = @_; $self->date($date); $self->customer($customer); $self->amount($amount); }; sub capture_name2 { my ($self,$customer) = @_; $self->customer($self->customer . " " . $customer); $self->flush(); }; sub parse { my ($self,$file) = @_; my $fh = open "<", $file or die "$file: $!"; while (defined my $line = <$fh>) { # First, check which regex my $handled; for (@handlers) { my ($re,$code) = @$_; if (my @match = ($line =~ /$re/)) { $code->($self, @match); $handled = 1; last; }; }; warn "Unhandled line >>$line<<"; }; };

    The separation of flush() and capture_foo is because I have sometimes reports where one logical row spans several lines or where the transaction date is noted at the top of the "page", so some information has to persist before a whole line can be printed to the results.

    I use tab separated files with a file extension of .xls which is the most Excel friendly thing you can get without involving Win32::OLE or SpreadSheet::WriteExcel.

    Update: monarch spotted a typo/error in sub parse, I was using @_ where it should have been @$_.

      My excel here recognizes files with extention of .tab by default as text with tab seperated values (it's also my fav delimited format) :)
Re: Converting a fixed-width report to an Excel-friendly format
by jZed (Prior) on Oct 25, 2006 at 16:04 UTC
    Given the variety of non-standard things (stray characters, lines you aren't interested in, two-line column headings, etc.), I don't think you can avoid running regexen to clean up the file. My advice would be to do this in two steps 1) do your slurping and cleaning such that you come out with a valid fixed-width flat file 2) use AnyData or DBD::AnyData to make the CSV file. These modules handle both fixed-width and CSV formats so you'd simply loop through the cleaned up file, letting AnyData convert fields and records from one format to fields and records in the other.
Re: Converting a fixed-width report to an Excel-friendly format
by djp (Hermit) on Oct 26, 2006 at 04:18 UTC
    No-one seems to have commented that you're solving the wrong problem here, if at all possible you should be using the same source that produced the report to create your spreadsheet.
Re: Converting a fixed-width report to an Excel-friendly format
by john_oshea (Priest) on Oct 25, 2006 at 18:14 UTC
    I wondered whether there is there something like format for reading files?

    Template::Extract is along those lines. That said, I think the solutions that Corion and jZed have already given will work better for this particular case.

Re: Converting a fixed-width report to an Excel-friendly format
by Anonymous Monk on Oct 25, 2006 at 20:06 UTC
    I would use unpack(). as a for-example, something like:
    $LINEFORMAT = "A10 A20 A15 A12"; # adjust to your field widths my @fields; # presuming the input is stdin. otherwise, chop it into lines however +you like while (<>) { my @line = unpack $LINEFORMAT, $_; if ($line[0]) { # new record if ($fields[0]) { # already have a record read in, print + it print join ("|", @fields), "\n"; @fields = (); } @fields = @line; $fields[0] =~ s/^\r//; # strip linefeeds. } else { $fields[$_] .= " $line[$_]" for 0..$#line; } } # print the last record print join ("|", @fields), "\n";
      Forgot to add at the top of the while loop:
      next if /$uninteresting/; # or next unless /$interesting_line/;
Re: Converting a fixed-width report to an Excel-friendly format
by graff (Chancellor) on Oct 26, 2006 at 03:42 UTC
    Corion's solution looks very interesting, and now that I've seen it, I might try that first. Had I not seen that, my first instinct would have been to rely on the apparently consistent "syntactic" cues in your example, and process the data line-by-line, using known expectations:

    1. When I see /Transaction\s+Employee\s+-{3}\s/ I know the next two lines will complete the display of column headings.

    2. Having reached the end of the headings, the next line contains your "Code1/.../Code2" string; that's easy enough.

    3. After that, lines matching /^\s*\d{1,2}-[A-Z]{3}-\d{2}\s+.*?\s\d+\.\d{2}\s*$/ contain column data, and lines matching /^\s{10,}(.*)/ (if any) contain continuation data for the middle column;¹ these latter need to be contiguous with (directly after) the former, and the continuation is terminated by a blank line.

    4. Once a blank line is followed by something that doesn't match the 3-column data pattern, that's the end of the table.

    So it's just a fairly well-constrained state machine: watch for start of headings, then get "code" fields, then iterate through sets of row data (3-column line, 0-or-more continuation lines, blank line); when that iteration is done, go back to watching for start of headings.

    ¹ update: Counting initial whitespace characters on the continuation lines might be tricky, if the input uses tabs and spaces in funny or inconsistent ways.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://580596]
Approved by ChemBoy
Front-paged by diotalevi
[shmem]: karlgoethebier: of course
[shmem]: has there been any single day where nothing went wrong? not in my memory
[shmem]: discovered a fuel tank leakage at my second hand motorbike. Some bloke drove a screw too long through the fairing
[shmem]: hard to fix, and plans wrecked. And more... won't tell.
[shmem]: the leakage has been "fixed" with a piece of bycicle tube. Bleargh.
[karlgoethebier]: shmem: yes. my first ride tomorrow is questionable because of lumbago. shit

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (8)
As of 2017-06-25 19:56 GMT
Find Nodes?
    Voting Booth?
    How many monitors do you use while coding?

    Results (569 votes). Check out past polls.