Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Re: How to process variable length fields in delimited file.

by Marshall (Canon)
on Oct 06, 2016 at 01:29 UTC ( [id://1173374]=note: print w/replies, xml ) Need Help??


in reply to How to process variable length fields in delimited file.

Hi dbach355,

My goal is to output a delmited file with a unique delimiter such as \f.
I think that you will find that a CSV (Comma Separated Value) line using the "pipe" character, "|" as the delimiter will work out well. CSV is a generic term, you can use something other than a comma. I work with a few "|" separated DB's, some with a million+ records. If you use \f, "Form Feed", you will wind up with something that cannot be printed easily (one page per column is not too friendly!). This also has the problem of being "invisible". Using a tab character (\t) has the same visibility problem.

The real problem with your format are the embedded spaces. These first 10 columns can be handled in a number of ways. What do the other columns look like? Do they contain embedded spaces, like "John Smith"? Do they have a constant field width perhaps? Your goal is achievable. I just need a bit more info.

Update:
Once you have the data in "|" delimited form, Perl can process a line like that easily. An example is shown below. There are modules, like Txt::CSV that can be used. However, if the "|" does not appear anywhere in the data, there is no need for that. You are new to Perl and I don't want to overly complicate things if it is not necessary.

#!/usr/bin/perl use strict; use warnings; my $line = "ssn|empNo|ncEmpName|empName|hireDate|ncAddr|addr|state|ncC +ity|city|zip"; my @columns = split /\|/, $line; print "@columns[-1,-4,4,3]\n"; # "zip state hireDate empName"

Update with code:
I thought some more about this problem. If you have fixed width fields interspersed with space separated fields, you have a big mess.

One way to describe the fields and implement this is shown below.
A 'v' field contains no embedded spaces and is variable in length, an "f" field, fixed field is a certain number of characters. This code builds a Regex (Regular Expression) and then executes that regex on the input. Anybody's brain would go crazy to write a regex with 100 terms, hence the program does that from the input table.

I do suspect that your problem can be solved "easier" than this, but without more info about the other ~90 columns, I am unsure.

#!/usr/bin/perl use strict; use warnings; #empNo|ncEmpName|empName|hireDate|ncAddr|addr|state|ncCity|city|zip"; my $line2 = "123445678 45612 11 Steve Smith 11012015 16 1001 Main Stre +et GA 7 Atlanta 30553 x y z"; # Note: Looks like ncEmpName is "45612 11", a fixed width field my @format_spec = qw( v empNo f8 ncEmpName f11 enpName v hireDate v ncAddr f16 addr v state v ncCity f7 city v zip v x v y v z ); my $regex = "^"; while (@format_spec) { my $format = shift @format_spec; # pair wise in List::Util possible my $name = shift @format_spec; # here keep it simple if ($format =~ /v/) #variable length (no embedded spaces) { $regex .= '\s*(\S+)'; } elsif ( (my $width) = $format =~ /\s*f(\d+)/) # fixed length,means # embedded spaces { $regex .= '\s*(.{' . "$width})"; # \s cannot be within "" } print "$regex\n"; #for debug, comment this out later } my (@tokens) = $line2 =~ /$regex/; print join ("|", @tokens), "\n"; __END__ The regex is built like this: ^\s*(\S+) ^\s*(\S+)\s*(.{8}) ^\s*(\S+)\s*(.{8})\s*(.{11}) ^\s*(\S+)\s*(.{8})\s*(.{11})\s*(\S+) ^\s*(\S+)\s*(.{8})\s*(.{11})\s*(\S+)\s*(\S+) ^\s*(\S+)\s*(.{8})\s*(.{11})\s*(\S+)\s*(\S+)\s*(.{16}) ^\s*(\S+)\s*(.{8})\s*(.{11})\s*(\S+)\s*(\S+)\s*(.{16})\s*(\S+) ^\s*(\S+)\s*(.{8})\s*(.{11})\s*(\S+)\s*(\S+)\s*(.{16})\s*(\S+)\s*(\S+) ^\s*(\S+)\s*(.{8})\s*(.{11})\s*(\S+)\s*(\S+)\s*(.{16})\s*(\S+)\s*(\S+) +\s*(.{7}) ^\s*(\S+)\s*(.{8})\s*(.{11})\s*(\S+)\s*(\S+)\s*(.{16})\s*(\S+)\s*(\S+) +\s*(.{7})\s*(\S+) ^\s*(\S+)\s*(.{8})\s*(.{11})\s*(\S+)\s*(\S+)\s*(.{16})\s*(\S+)\s*(\S+) +\s*(.{7})\s*(\S+)\s*(\S+) ^\s*(\S+)\s*(.{8})\s*(.{11})\s*(\S+)\s*(\S+)\s*(.{16})\s*(\S+)\s*(\S+) +\s*(.{7})\s*(\S+)\s*(\S+)\s*(\S+) ^\s*(\S+)\s*(.{8})\s*(.{11})\s*(\S+)\s*(\S+)\s*(.{16})\s*(\S+)\s*(\S+) +\s*(.{7})\s*(\S+)\s*(\S+)\s*(\S+)\s*(\S+) The "|" separated line is like this: 123445678|45612 11|Steve Smith|11012015|16|1001 Main Street|GA|7|Atlan +ta|30553|x|y|z
Of course the fixed length fields can have trailing spaces, but that is easy to get rid of:
@tokens = map{s/\s*$//; $_;}@tokens; #delete trailing spaces
or some such similar formulation. Also, a very long but simple (no back-tracking) regex can execute quite quickly. I doubt that a regex approach will be a performance problem even if the regex is so long that it is incomprehensible to a human.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1173374]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others surveying the Monastery: (2)
As of 2024-04-25 20:19 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found