Working with fixed length files

vendion has asked for the wisdom of the Perl Monks concerning the following question:

I need some advice with working with a fixed length file that uses two different formats, the format switches every other line in the file. Here is an example of the file contents

03002068454210482                            000000004204.572011-04-14
+ 19:53:41I
NTERNET  C  750467375                   ^M
0214833                                                               
+          
                    G02042954           ^M
03002068703214833                            000000002558.662011-04-15
+ 08:17:19I
NTERNET  C  761212737                   ^M
0211561                                                               
+          
                    05601207284         ^M
03002068802911561                            000000001463.702011-04-15
+ 08:40:52I
NTERNET  C  719807216                   ^M
029911                                                                
+          
                    00100275296         ^M
[download]

The lines that have "03" as the first two characters match the following patterns: 02:10:33:15:19:10:3:18:6:4 While the other lines match this pattern: 02:98:11:9 Is using read and unpack the best way to approach this? The way that I am currently working with this I don't think this will work here becuase it loads in a file that has the pattern that matches the currently loaded file, in all I am working with four files and this is the only one that differs like this.

#!/usr/bin/perl
use strict;
use warnigns;

my $filename = 'fixedfile.txt';
my $datname = $filename;
$datname =~ s/\.txt//g;
my $fc = 0;
my @fla;
my @fna;
open my $dat, '<', "$datname.dat";
while (<$dat>) {
    chomp;
    my @fields = split(/\|/);
    foreach(@fields) {
        my ($field, $length) = split(/\:/, $_);
        if ( $length < 1 ) {
            $length = 1;
        }
        $fla[$fc] = $length;
        $fna[$fc] = $field;
        $fc++;
    }
}
close $dat;
open my $fixedfile, '<', "$filename";
while (<$fixedfile>) {
    chomp;
    s/\r|\n//g;
    s/^M//g;
    s/^\s*//;
    s/\s*$//;
    my $line = $_;
    my $dc = 0;
    my $start = 0;
    foreach (@fna) {
        my $garbage = substr($line,$start,$fla[$dc]);
        $garbage =~ s/\'//g;
        $garbage =~ s/\"//g;
        $garbage =~ s/\\//g;
        $garbage =~ s/\(//g;
        $garbage =~ s/\)//g;
        $garbage =~ s/^\s*//;
        $garbage =~ s/\s*$//;
        $start = $start + $fla[$dc];
        $dc++;
    }
    close $fixedfile;
}
[download]

Comment on Working with fixed length files Select or Download Code

Replies are listed 'Best First'.
Re: Working with fixed length files by ikegami (Patriarch) on Apr 27, 2011 at 20:53 UTC
Besides looking at it like pairs of records of 122 bytes, you could look at it as records of 244 bytes. `local $/ = \(2122); binmode($fixedfile); while (<$fixedfile>) { my @fields = unpack( "A2 A10 A33 A15 A19 A10 A3 A18 A6 A4 x2" . "A2 A98 A11 A9 x2", $_ ); ... my $total = $fields[3]; my $timestamp = $fields[4]; ... }` [download] Replace "A" for "x" if you want to ignore a field. As for how to store the fields, you could use a hash or scalars. `my %fields; @fields{qw( ... total timestamp ... )} = unpack(...);` [download] `my ( ... $total, $timestamp, ... ) = unpack(...);` [download] Update: Added mention of "x". Update*: Added alternative storage strategies.	[reply] [d/l] [select]
Re: Working with fixed length files by BrowserUk (Patriarch) on Apr 27, 2011 at 22:26 UTC
Here is a different strategy for tackling the problem that can have some serious performance advantages: #! perl -slw use strict; my $rec = chr(0) x 123; my @type3l = split ':', '02:10:33:15:19:10:3:18:6:4'; my $n = 0; my @type3o = map{ $n += $_; $n - $_; } @type3l; my @type3 = map \substr( $rec, $type3o[ $_ ], $type3l[ $_ ] ), 0 .. $# +type3o; my @typeOl = split ':', '02:98:11:9'; $n = 0; my @typeOo = map{ $n += $_; $n - $_; } @typeOl; my @typeO = map \substr( $rec, $typeOo[ $_ ], $typeOl[ $_ ] ), 0 .. $# +typeOo; while( <DATA> ) { substr( $rec, 0 ) = $_; if( /^03/ ) { print join '/', map $$_, @type3; } else { print join '\|', map $$_, @typeO; } } __DATA__ 03002068454210482 000000004204.572011-04-14 + 19:53:41INTERNET C 750467375 ^M 0214833 + G02042954 ^M 03002068703214833 000000002558.662011-04-15 + 08:17:19INTERNET C 761212737 ^M 0211561 + 05601207284 ^M 03002068802911561 000000001463.702011-04-15 + 08:40:52INTERNET C 719807216 ^M 029911 + 00100275296 ^M [download] Produces: `c:\test>junk92 03/0020684542/10482 /000000004204.57/2011-0 +4-14 19:53:41/INTERNET /C /750467375 / / 02\|14833 + \|G02042954 \| 03/0020687032/14833 /000000002558.66/2011-0 +4-15 08:17:19/INTERNET /C /761212737 / / 02\|11561 + \|05601207284\| 03/0020688029/11561 /000000001463.70/2011-0 +4-15 08:40:52/INTERNET /C /719807216 / / 02\|9911 + \|00100275296\| [23:23:38.05] c:\test>` [download] Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l] [select]
Re^2: Working with fixed length files by Tux (Canon) on Apr 28, 2011 at 06:17 UTC
In theory ikegami's `unpack` approach should be multitudes faster than the `substr` approach, as `unpack` is one single `OP`. This reference approach should be somewhere in between. I'm curious how a Benchmark would relate the three on the original sized files and if disk-io actually minimizes the effect of the parsing speed difference. Enjoy, Have FUN! H.Merijn	[reply] [d/l] [select]
Re^3: Working with fixed length files by BrowserUk (Patriarch) on Apr 28, 2011 at 07:31 UTC
Ike's code assumes a one-to-one correspondence between the two record types. Well founded based on the OPs sample, but these type of mainframe 'carded' records often have multiple secondary records to each primary record. If the OP confirmed that they were one-to-one, then you could also do a single read for both record types and pre-partition also. The problem with unpack is that the template must be re-parsed for every record. And recent fairly extensive additions to the format specifications have taken some toll on performance. With these short, simply structured records that doesn't exact too much of a penalty, but with longer, more complex records it can. The idea of pre-partitioning the input buffer with an array of substr refs is that simply assigning each record into the pre-partitioned buffer effectively does the parsing and splitting. I think the technique is worth a mention for its own sake. A quick run of the two posted programs over the same file shows mine to be a tad quicker, but insignificantly. If I adjust mine to the same assumptions as Ike's, (or Ike's to the same assumptions as mine), then mine comes in ~20% quicker. Only a couple of seconds on 1e6 lines, but could be worth having for 100e6. `c:\test>901649-buk 901649.dat >nul Took 9.283 for 1000000 lines c:\test>901649-ike 901649.dat >nul Took 11.305 for 1000000 lines` [download] Code tested: Read more... (2 kB) Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l] [select]
Re^4: Working with fixed length files by Tux (Canon) on Apr 28, 2011 at 08:47 UTC
Re^4: Working with fixed length files by Tux (Canon) on Apr 28, 2011 at 09:06 UTC
Re^5: Working with fixed length files by BrowserUk (Patriarch) on Apr 28, 2011 at 10:58 UTC
Some notes below your chosen depth have not been shown here
Re^5: Working with fixed length files by Anonymous Monk on Apr 28, 2011 at 09:27 UTC
Some notes below your chosen depth have not been shown here
Re^4: Working with fixed length files by Tux (Canon) on Apr 28, 2011 at 09:27 UTC
Re^2: Working with fixed length files by vendion (Scribe) on Apr 29, 2011 at 13:31 UTC
Your code looks really nice and I think I may be able to use it, or at least a approach similar to this, the only question that is flagged is how would this handle data from a file that uses one patteren through out the file? It seems this line in my OP was over looked "in all I am working with four files and this is the only one that differs like this." While one of the four is semicolen delimited so I am just doing a split and removing the extra whitespace, that leaves the file in which my sample output was from and two other files which have their own patteren A short break down of the files could be put as: File 1: semicolen delimited File 2: fixed 02:10:33:15:19:10:3:18:6:4 & 02:98:11:9 File 3: fixed 2:35:14:14:14:19:25:11:16 File 4: fixed 2:20:20:2:11:8:10:10:03:3:4 If it helps here is some sample output from file 3 and file 4 File 3: 028088 00000005402.6000000000000.0000000 +000000.002011-04-19 12:00:00ALICIA MARIA LOPEZ BAZZOC00101893559 0213262 00000000000.0000000000000.0000000 +000000.002011-04-19 12:00:00INDEGOLF S.A. 00101893559 029052 00000002927.4000000000000.0000000 +000000.002011-04-19 12:00:00INDEGOLF (ALICIA LOPEZ) 02800898617 027550 00000000000.0000000000000.0000000 +000000.002011-04-19 12:00:00ALICIA LOPEZ (INDEGOLF)02855262166 029051 00000000000.0000000000000.0000000 +000000.002011-04-19 12:00:00ALICIA MARIA LOPEZ BAZZOC02800898617 028085 00000000000.0000000000000.0000000 +000000.002010-10-20 12:00:00INDEGOLF, S. A. 00101893559 [download] File 4: `02CAFETERIA, ,MARI 0000000000000009822507+0009403.20201104150032018 +74313748210172100005 02RAMON, BRITO 0000000000000009817407+0108815.92201104150032018 +74413748210172100005 02EAST COAST CHART 0000000000000009851407+0002838.60201104150032211 +49915931210382100005 02INMOBILIARIA PAL 0000000000000009770507+0001345.18201104150029156 +70515250210202100005 02IGLESIA ESPIRITU 0000000000000009755607+0001031.74201104150032018 +60213748210172100005` [download] This is why I have it reading in the first file that way it loads the correct template for the data that I am parsing. I regret not giving output from the other files at the time I orginally posted, I wasn't even sure if my post would make it through the area that I live in was part of the area affected by the storms that went through the southeast U.S.	[reply] [d/l] [select]
Re: Working with fixed length files by Anonymous Monk on Apr 27, 2011 at 20:41 UTC
The lines that have "03" as the first two characters match the following patterns: 02:10:33:15:19:10:3:18:6:4 What does that mean?	[reply]
Re^2: Working with fixed length files by ikegami (Patriarch) on Apr 27, 2011 at 20:58 UTC
Field widths, not counting the trailing CR LF	[reply]


Clear questions and runnable code get the best and fastest answer
	PerlMonks