Re: How to process variable length fields in delimited file.

Hi dbach355,

My goal is to output a delmited file with a unique delimiter such as \f.
I think that you will find that a CSV (Comma Separated Value) line using the "pipe" character, "|" as the delimiter will work out well. CSV is a generic term, you can use something other than a comma. I work with a few "|" separated DB's, some with a million+ records. If you use \f, "Form Feed", you will wind up with something that cannot be printed easily (one page per column is not too friendly!). This also has the problem of being "invisible". Using a tab character (\t) has the same visibility problem.

The real problem with your format are the embedded spaces. These first 10 columns can be handled in a number of ways. What do the other columns look like? Do they contain embedded spaces, like "John Smith"? Do they have a constant field width perhaps? Your goal is achievable. I just need a bit more info.

Update:
Once you have the data in "|" delimited form, Perl can process a line like that easily. An example is shown below. There are modules, like Txt::CSV that can be used. However, if the "|" does not appear anywhere in the data, there is no need for that. You are new to Perl and I don't want to overly complicate things if it is not necessary.

#!/usr/bin/perl
use strict;
use warnings;

my $line = "ssn|empNo|ncEmpName|empName|hireDate|ncAddr|addr|state|ncC
+ity|city|zip";

my @columns = split /\|/, $line;

print "@columns[-1,-4,4,3]\n"; # "zip state hireDate empName"
[download]

Update with code:
I thought some more about this problem. If you have fixed width fields interspersed with space separated fields, you have a big mess.

One way to describe the fields and implement this is shown below.
A 'v' field contains no embedded spaces and is variable in length, an "f" field, fixed field is a certain number of characters. This code builds a Regex (Regular Expression) and then executes that regex on the input. Anybody's brain would go crazy to write a regex with 100 terms, hence the program does that from the input table.

I do suspect that your problem can be solved "easier" than this, but without more info about the other ~90 columns, I am unsure.

#!/usr/bin/perl
use strict;
use warnings;

#empNo|ncEmpName|empName|hireDate|ncAddr|addr|state|ncCity|city|zip";
my $line2 = "123445678 45612 11 Steve Smith 11012015 16 1001 Main Stre
+et GA 7 Atlanta 30553 x y z";

# Note: Looks like ncEmpName is "45612 11", a fixed width field

my @format_spec = qw(
   v    empNo
   f8   ncEmpName
   f11  enpName
   v    hireDate
   v    ncAddr
   f16  addr
   v    state
   v    ncCity
   f7   city
   v    zip
   v    x
   v    y
   v    z
);

my $regex = "^";
while (@format_spec)
{
   my $format = shift @format_spec; # pair wise in List::Util possible
   my $name   = shift @format_spec; # here keep it simple
   
   if ($format =~ /v/)  #variable length (no embedded spaces)
   {
       $regex .= '\s*(\S+)';
   }
   
   elsif ( (my $width) = $format =~ /\s*f(\d+)/) # fixed length,means
                                                 # embedded spaces
   {
       $regex .= '\s*(.{' . "$width})";  # \s cannot be within "" 
   }
   print "$regex\n";  #for debug, comment this out later  
}


my (@tokens) = $line2 =~ /$regex/;
print join ("|", @tokens), "\n";

__END__
The regex is built like this:
^\s*(\S+)
^\s*(\S+)\s*(.{8})
^\s*(\S+)\s*(.{8})\s*(.{11})
^\s*(\S+)\s*(.{8})\s*(.{11})\s*(\S+)
^\s*(\S+)\s*(.{8})\s*(.{11})\s*(\S+)\s*(\S+)
^\s*(\S+)\s*(.{8})\s*(.{11})\s*(\S+)\s*(\S+)\s*(.{16})
^\s*(\S+)\s*(.{8})\s*(.{11})\s*(\S+)\s*(\S+)\s*(.{16})\s*(\S+)
^\s*(\S+)\s*(.{8})\s*(.{11})\s*(\S+)\s*(\S+)\s*(.{16})\s*(\S+)\s*(\S+)
^\s*(\S+)\s*(.{8})\s*(.{11})\s*(\S+)\s*(\S+)\s*(.{16})\s*(\S+)\s*(\S+)
+\s*(.{7})
^\s*(\S+)\s*(.{8})\s*(.{11})\s*(\S+)\s*(\S+)\s*(.{16})\s*(\S+)\s*(\S+)
+\s*(.{7})\s*(\S+)
^\s*(\S+)\s*(.{8})\s*(.{11})\s*(\S+)\s*(\S+)\s*(.{16})\s*(\S+)\s*(\S+)
+\s*(.{7})\s*(\S+)\s*(\S+)
^\s*(\S+)\s*(.{8})\s*(.{11})\s*(\S+)\s*(\S+)\s*(.{16})\s*(\S+)\s*(\S+)
+\s*(.{7})\s*(\S+)\s*(\S+)\s*(\S+)
^\s*(\S+)\s*(.{8})\s*(.{11})\s*(\S+)\s*(\S+)\s*(.{16})\s*(\S+)\s*(\S+)
+\s*(.{7})\s*(\S+)\s*(\S+)\s*(\S+)\s*(\S+)


The "|" separated line is like this:
123445678|45612 11|Steve Smith|11012015|16|1001 Main Street|GA|7|Atlan
+ta|30553|x|y|z
[download]

Of course the fixed length fields can have trailing spaces, but that is easy to get rid of:

@tokens = map{s/\s*$//; $_;}@tokens;  #delete trailing spaces
[download]

or some such similar formulation. Also, a very long but simple (no back-tracking) regex can execute quite quickly. I doubt that a regex approach will be a performance problem even if the regex is so long that it is incomprehensible to a human.

Comment on Re: How to process variable length fields in delimited file. Select or Download Code


"be consistent"
	PerlMonks