http://www.perlmonks.org?node_id=1055650

Jalcock501 has asked for the wisdom of the Perl Monks concerning the following question:

Hey guys

I have an interesting one today, I am trying to read a delimited file and create an output file with the information in neat columns. Whats so interesting I hear you say...?

Well I only need to convert between the line starting with 99HEADER to 99TERMIN. The delimitation is a pipe ( | ) but I was told by a colleague that pipes cause perl to do weird things (funny if true), so I created a tiny Shell script to convert this into a .csv file. So here goes...
#! /usr/bin/perl -w #Read between lines 40 - 112 from *.in, cut the lines up using pipe de +limitation. use strict; my @files = <*.in>; for my $file (@files) { open(my $read, '<', $file) or die "Couldn't open: $!"; open(my $write, '>', "$file.read") or die "Couldn't open: $!"; my $lines; my($record, $code, $date, $calpercent$, $vehprn, $vehreg, $insrate +num, $insratetxt) print $write "Record\tCode\tDate\tCalculated Percentage\tVehicle P +RN\tVehicle Registration\tInsurer Rating Number\tInsurer Rating Text\ +n"; while(<$read>) { #@lines = split('', $_); if(/99HEADER/../99TERMIN/) { chomp; ($record, $code, $date, $calpercent$, $vehprn, $vehreg, $i +nsratenum, $insratetxt) = split(','); print $write "$record\t$code\t$date\t$calpercent$\t$vehprn +\t$vehreg\t$insratenum\t$insratetxt" } } }
And here is some sample data
0000000000000000000000000000000000000000000000000000000000000000000000 +000000000000000000000000000000000000000000000000000000000000000000000 +000000000000000000000000000000000000000000000000000000000000000000000 +00000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000000000 +000000000000000000000000000000000000000000000000000000000000000000000 +000000000000000000000000000000000000000000000000000000000000000000000 +000000000000000000000000000000000000000000000000000000000000000000000 +00000000000000000000000 |0010001401N01 YY0101000000YYYYYY010100 01201309192013091920130919N12011 KY2 6AB 0250 + Y 02PN 999 20001010106 0200 04 00 0000 +0 0811 00 GB N 0300030201309195204120102200309052003090520130918 05R005000 K +Y2 6 3 001200M 1 1 N N N + KY2 6AB + 062013091901NNNNN 062013091902NNNNN 062013091920NNNNN 101001002013091919531118MMP20010101FM 19531118 010000000000000001N +05NNN NNN 00 N + 1120130919E04049E Y 15201309193820060101N10010101 99HEADER|001|001| 99INSSCH|AVP0| 99POLCOM|||||PIP735628020||||| 99INSFAC|F1_0|| 99INSFAC|F2_N|| 99INSFAC|F3_N|| 99INSFAC|F4_0|| 99INSFAC|F5_0|| 99INSFAC|F6_IM|| 99INSFAC|F8_0|| 99INSFAC|F9_B|| 99INSFAC|F10_0|| 99INSFAC|F11_3|| 99INSFAC|F12_0|| 99INSFAC|F13_Y|| 99INSFAC|F14_N|| 99INSFAC|F15_5|| 99INSFAC|F16_30/11/2011|20111130| 99INSFAC|F17_31|| 99INSFAC|F18_NO|| 99INSFAC|F19_246|| 99INSFAC|F20_B|| 99INSFAC|F21_0|| 99INSFAC|F22_5H|| 99INSFAC|F23_P|| 99INSFAC|F24_M|| 99INSFAC|F25_1598|| 99INSFAC|F26_13|| 99INSFAC|F27_9|| 99INSFAC|F28_17|| 99INSSCH|248| 99POLCOM|3||CAP01|66|3301R7435459||||| 99INSFAC2|MSRA01_1||||||"LNI10708"| 99INSSCH|391| 99POLCOM|3||CAP01|66|3301R7435459||||| 99INSFAC2|MSGAL1||||||"W=P1|X=335|AB=0|BB=0|JF=569|HB=0|IB=0|AD=1|GC=I +1|KD=0|YE=335|BF=B|GF=0|KF=401,0,0,0|GSB=15|HSB=99.2468394|KDB=2377.3 +7934|LDB=A|UDB=0|ETB=155|HTB=51|URB=7"| 99INSSCH|116| 99POLCOM|3||CAP01|66|3301R7435459||||| 99INSFAC2|MSRZ01a||||||"00I10587"| 99INSFAC2|MSRZ01b||||||"335000 B"| 99INSSCH|216| 99POLCOM|3||CAP01|66|3301R7435459||||| 99INSFAC2|MSRZ01a||||||"00I10587"| 99INSFAC2|MSRZ01b||||||"335000 B"| 99HEADER|006|001 99INSSCH|091| 99POLCOM|1||IIL|62|22593465033322||||| 99INSFAC2|C00156||||||I1P82240,CCCN0000,INNA0000,FAAA0570,YANZ1000| 99INSSCH|084| 99POLCOM|1||IIL|62|22593465033322||||| 99INSFAC2|C00050||||||I1001569| 99INSSCH|052| 99POLCOM|1||IIL|62|22593465033322||||| 99INSFAC2|C00124||||||XAAX0800,YPAX8400,ZAAZ0401,VAAA0000,WZZA0000| 99INSSCH|222| 99POLCOM|1||IIL|62|22593465033322||||| 99INSFAC2|C00243||||||XAAX0800,YPAX8400,ZAAZ0401,VAAA0000,WZZA0000| 99TERMIN| ABC14480120070920 ABC24490120070920 AIL01620000000000 ALC06040120130902 ALC16050120130902 ALC26060120130902 ALC36070120130902 ALC46080120130902 ANS00720000000000 AVP01110119990101 AXT06750000000000 AXX00300120030430 AXX11680120030430 AXX35940120100509 AXX75190120080914 AXX95660120090819 B-Z43150000000000 B-Z53410000000000 B-Z64350000000000
Thanks Jim

Replies are listed 'Best First'.
Re: Reading File and Seperating into columns
by hdb (Monsignor) on Sep 25, 2013 at 11:00 UTC

    Pipes are perfectly ok as field separators in Perl. In fact, they are preferable to commas as they are less likely to be part of the field values than commas. In any case, you should be looking at Text::CSV and the associated documentation for your purposes.

Re: Reading File and Seperating into columns
by aaron_baugher (Curate) on Sep 25, 2013 at 13:27 UTC

    Not to be rude, but your friend sounds like he just discovered regular expressions last week and he's still unsure about what all the characters do, so he's trying to avoid using the tricky ones. The pipe character, like many other punctuation characters, is a special character in regex patterns, so you must escape it. For instance:

    my @fields = split /,/; # split on commas my @fields = split /\|/; # split on pipes

    That's all, no big deal. However, as hbd said, unless this is a one-time task with data you're very familiar with, you should use Text::CSV and tell it that your delimiter is the pipe character. That way, when you inevitably run into data containing a quoted delimiter in one of the fields even though you were sure that would never happen, the module can handle it correctly.

    Aaron B.
    Available for small or large Perl jobs; see my home node.

Re: Reading File and Seperating into columns
by ww (Archbishop) on Sep 25, 2013 at 11:43 UTC
    Futurely, you may want to take your colleague's knowledge of Perl with a grain pipeful of salt. ;-)

    Sure, pipes mis-used can create .... ah, let's call them 'interesting challenges' but that's true of almost anything in any language that's abused or mis-used through iggerance.

Re: Reading File and Seperating into columns
by Marshall (Canon) on Sep 25, 2013 at 12:33 UTC
    Your buddy has told you some absolute NONSENSE!

    The use of "|" is a very normal thing to separate data base fields.

    This is actually the most common way to do it.

    This "|" character does not appear in any names or addresses. And that is why it is done that way.

    The Comma Separated, quoted idea is extremely complex and requires some very complex code.

Re: Reading File and Seperating into columns
by marinersk (Priest) on Sep 25, 2013 at 16:10 UTC
    Your friend is horribly misinformed. Without handling any special CSV rules, a simple conversion from pipe to comma looks like this:

    my @inpelt = split /\|/, $inpbuf; my $outbuf = join ',', @inpelt;

    or this

    $trnbuf =~ s/\|/,/g;

    That said, if Text::CSV has grown up enough to be fully functional (and I suspect it has), you are very, very well advised to use it.

    The proposed, and essentially de facto, CSV standard has a fair number of special cases which a proper CPAN module should be handling for you. Rolling your own, such as using the shortcuts shown above, would only work if you were absolutely certain the data crossed none of the CSV special handling cases.

    A programmer writes code that works in a year; a software engineer writes code that works in ten years.

    Failing to use a fully functional CPAN module to handle the CSV special cases would be a fine example of programming, not engineering, in this context.

    Cheers!

Re: Reading File and Seperating into columns
by Tux (Canon) on Sep 25, 2013 at 18:12 UTC

    Using Text::CSV_XS or its slower Text::CSV, that would look like

    #!/usr/bin/perl use 5.14.2; use warnings; use Text::CSV_XS; #Read between lines 40 - 112 from *.in, cut the lines up using pipe de +limitation. for my $file (glob "*.in") { open my $read, "<", $file or die "Couldn't open $file: $!" +; open my $write, ">", "$file.read" or die "Couldn't open $file.read +: $!"; my $csv_in = Text::CSV_XS->new ({ binary => 1, sep_char => "|", +auto_diag => 1 }); my $csv_out = Text::CSV_XS->new ({ binary => 1, sep_char => "\t", +eol => "\n", auto_diag => 1 }); $csv_out->print ($write, [ qw( Record Code Date Calculated_Percent +age Vehicle_PRN Vehicle_Registration Insurer_Rating_Number Insurer_Ra +ting_Text )]); my $print = 0; while (my $row = $csv_in->getline ($read) { $row->[0] =~ m/^99HEADER/ and $print = 1; $print and $csv_out->print ($write, $row); $row->[0] =~ m/^99TERMIN/ and $print = 0; } close $_ for $read, $write; }

    Enjoy, Have FUN! H.Merijn
    \n, eol =\t
Re: Reading File and Seperating into columns
by keszler (Priest) on Sep 25, 2013 at 12:12 UTC
    I believe your colleague is confusing his own "pipes cause ... weird things" experiences with Perl's.
Re: Reading File and Seperating into columns
by boftx (Deacon) on Sep 25, 2013 at 22:58 UTC

    For starters, I noticed that 99HEADER appears twice in the data before 99TERMIN does (the second time is 13 lines before the 99TERMIN.)

    This is one of the strangest data files I have seen, but that aside, I suspect there might be a problem lurking in the wings by simply treating every delimited record as having the same data structure when clearly they don't. It seems to me that the first field is a record type code and each type has its own definition that should be honored. This is quite common in data transfer files, and I think it would be obvious if this was in XML instead of a bastardized fixed-length record structure. (leave it to the insurance industry to screw this up.)

    Thankfully you apparently don't need to break apart the data before the 99HEADER, but those records you are concerned with are in fact easy to work with as pipe (|) delimited and I wouldn't bother converting to CSV for all the reasons given above.

    What I would be concerned with are those type codes in the first field, especially since there are repeating field types, including the 99HEADER type.

    If I had to guess it looks like you are dealing with a single account record that has coverage information on multiple vehicles (with 99HEADER indicating the start of a new vehicle.)

    On a side note, Perl is ideal for processing this kind of data if you have the full API spec for the data file handy.

    On time, cheap, compliant with final specs. Pick two.
      Haha, you've almost nailed it, but its not different vehicles but different types of cover.

      I'm just stuck and don't know how I am going to separate the fields into a readable format. Thanks for pointing out that 99HEADER appeared twice that slipped by me.

      So this is what I have so far:
      #! /usr/bin/perl -w use strict; my @files = <*.in>; for my $file (@files) { open my $handle, '<', $file; chomp(my @lines = <$handle>); close $handle; open my $write, '>', "$file.sep"; my @enr_data = grep {/^99/} @lines; s/99/\n99/g for (@enr_data); close($handle); }
      This separates the lines I need from the file after more data analysis I realised that there are more areas with 99 Factors appear.I basically just need to cut fields up so that they can be read by your standard user.

        This is just a crude outline, but given the large number of different record types (judging by the values in field 1) I would use a hash with keys consisting of the various types you are interested in and the values being a hashref that includes formatting strings for sprint. Something like this:

        # This is NOT real code, but just a concept my %record_types = ( 99HEADER => { format => "%s %s", code => undef, }, 99INSFAC => { format => "%s %s %s %04.2f", code = >\&process_99insfac, }, }; for my $line ( @input_lines ) { my ($rec_type,@rec_data) = split(/|/,$line); next unless exists $record_types{$rec_type); # Call a pre-processor if present, maybe skip empty records. # Note, the syntax for a proper dispatch table might be wrong here +. You # will probably need to play with this a bit, but it is nifty when + it works. next if defined( $record_types{$rec_type}{code} ) && !&{$record_types{$rec_type}{code}}( data => \@rec_data ); say sprint($record_types{$rec_type}{format},@rec_data); } exit; sub process_99insfac { my %args = @_; my @rec_data = @{$args{data}}; return unless $rec_data[2]; # no date? nothing to do $rec_data[3] = some calculation; # do something nifty here return 1; }

        This is a very crude presentation, but I think you can get the idea and can see that you can take advantage of the type hash by adding more info such as code references to sub-routines to do any special processing if needed. You would need to track entering and leaving each new record structure, but I doubt you would have much trouble with that logic. This approach should give you a lot of flexibility for layout and dealing with the different sub-record types.

        Update: added example for a preprocessor code ref.

        On time, cheap, compliant with final specs. Pick two.