Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation

Re: How to process variables length fields in delimited file.

by liverpole (Monsignor)
on Oct 06, 2016 at 02:03 UTC ( #1173376=note: print w/replies, xml ) Need Help??

in reply to How to process variable length fields in delimited file.

Hi dbach355,

My first approach would be to define, programmatically (ie. with a data structure), what the input file contains on each line. Once that's in a script, you run it and prove to yourself that your data does in fact behave as expected.

Since each line is made up of space-delimited items, but some of them are count-prefixed, you could define your line format with an array containing an array reference for each item. Each array reference would hold the LABEL of the item (eg. 'ssn' for social-security, 'emp_num' for employee number, etc.), and a compiled regular expression (that's the qr/.../ syntax) used to parse the item.

In cases where the item is prefixed with a count, specifying the length of the item, you could use a string like 'COUNT' instead of a regex.

Here's an example for what you've defined:

my @line_format = ( [ 'ssn', qr/(\d{9})/ ], [ 'emp_num', qr/(\d+)/ ], [ 'emp_name', 'COUNT' ], [ 'hire_date', qr/(\d{8})/ ], [ 'city', 'COUNT' ], [ 'state', qr/([A-Z]{2})/ ], [ 'city', 'COUNT' ], [ 'zip', qr/(\d{5})/ ], );
Then you write a subroutine parse_line that you call for each line of your input file. (I would also pass in the line number, in case the line doesn't match your formula, so you can die with an error saying which line was invalid).

For each array ref in @line_format you either parse the COUNT, and pull off that number of characters, or you apply the next regex. If the data validates, you assign it into a hash local to the subroutine, with the label as the key. When the subroutine completes successfully, you pass back a reference to that hash.

Here's how you might write the parse_line subroutine:

sub parse_line { my ($line, $linenum) = @_; my %parsed = ( ); foreach my $format (@line_format) { my ($label, $expected) = @$format; if ($expected eq 'COUNT') { # Pull the COUNT off the beginning of the line and apply i +t if ($line !~ s/\s*(\d+) //) { die "Error #1 parsing item '$label' (line #$linenum)\n +"; } my $count = $1; if ($line !~ s/(.{$count})//) { die "Error #2 parsing item '$label' (line #$linenum)\n +"; } $parsed{$label} = $1; } else { # Pull of the next non-space word, and test with the regex if ($line !~ s/^\s*(\S+)//) { die "Error #3 parsing item '$label' (line #$linenum)\n +"; } $parsed{$label} = $1; } } return \%parsed; }

When I call that subroutine with the data you defined for a single line:

use Data::Dumper::Concise; my $line = "123445678 45612 11 Steve Smith 11012015 16 1001 Main + Street GA 7 Atlanta 30553"; my $result = parse_line($line, 1); die Dumper $result;

This simple program dumps as its result:

{ city => "Atlanta", emp_name => "Steve Smith", emp_num => 45612, hire_date => 11012015, ssn => 123445678, state => "GA", zip => 30553 }

So I know I'm on the right track.

The next steps would be something like;

  1. Read all the lines in the file
  2. Call the subroutine parse_line on each line (and line number), getting back a hash ref
  3. Add that hash ref to an array (or do whatever you want with it)

Does that help?

Edit: fixed whom I'm responding to (thanks choroba)


Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1173376]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others scrutinizing the Monastery: (3)
As of 2019-05-22 07:37 GMT
Find Nodes?
    Voting Booth?
    Do you enjoy 3D movies?

    Results (138 votes). Check out past polls.

    • (Sep 10, 2018 at 22:53 UTC) Welcome new users!