Parsing text file Help!

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Parsing text file Help! by roboticus (Chancellor) on Nov 22, 2010 at 16:31 UTC
For files like this, I tend to use `unpack` or `substr` to pull the fields apart. For example: `my $t=<<EOTEXT; 2009122233388675647 9988230 JOE DOE JR 20091222333886756 99882308K JOE DOE EOTEXT my ($ID, $fld2, $name) = unpack "A34A14A20", $t; print "ID: $ID\nFLD2: $fld2\nName: $name\n";` [download] To do so, you just count the field widths, and use the widths to unpack the fields. You can do the same with substr, but generally I find unpack to handier. NOTE: Untested, ... ...roboticus	[reply] [d/l] [select]
Re: Parsing text file Help! by kennethk (Abbot) on Nov 22, 2010 at 16:37 UTC
The issue is that this is fixed-width data rather than delimited data. Rather than using split, you should be using unpack. For a rough intro to the utility, see perlpacktut. #!/usr/bin/perl use strict; use warnings; while (<DATA>) { chomp; my ($a, $b, $c, $d, $e, $f, $g, $h) = unpack "A34A14A15A27A57A28A4 +8", $_; # print data to get it ready to be inserted into db print "$a - $b - $c - $d - $e - $f - $g - $h\n"; } __DATA__ 2009122233388675647 9988230 2009-01-01 JOE DOE + JR JOEDOEJRWX@EMAIL.COM + COMPANY JOES C. LTD., CORP. 1900-01-01 00:00:00.000 1900-01-0 +1 00:00:00.000 20091222333886756 99882308K 2010-01-01 JOE DOE + JOEDOEJRWX@TEST.COM + COMP INS / CORP. LTDA 1900-01-01 00:00:00.000 1900-01-0 +1 00:00:00.000 [download] Note that the above code issues warnings about an uninitialized value in a concatenation. This is because you expect 8 fields, but my unpack only has 7 fields encoded (A\d+). I suspect there is an additional field between e-mail and company, but have no way of knowing without a record that contains that information. Note as well you can drop the $_ from unpack's argument list with perl >= 5.10.	[reply] [d/l]
Re^2: Parsing text file Help! by Anonymous Monk on Nov 22, 2010 at 16:59 UTC
Thank you, here is an "unpack" that worked! `unpack "A33A14A15A31A61A31A24A23"` [download] I liked it!!!	[reply] [d/l]
Re: Parsing text file Help! by fisher (Priest) on Nov 22, 2010 at 16:28 UTC
I suggest to read perldoc perlretut - it's small and usefull. If you explain in english what patterns do you see in these lines, I think, we could provide you a turnkey solution. UPDATE. If starting positions for these rows in lines are non-variable, you can use unpack, as roboticus said below.	[reply]
Re: Parsing text file Help! by Khen1950fx (Canon) on Nov 22, 2010 at 18:35 UTC
I used split, an old script by johngg, and YAML::Dumper. It's easier for me to read. #!/usr/bin/perl use strict; use warnings; use YAML; use YAML::Dumper; my @lines; while (<DATA>) { chomp; my @fields = split m{\s+}, $_, 8; my $rest = pop @fields; push @fields, reverse map { $_ = reverse } split m{\s+}, reverse($rest), 8; push @lines, \@fields; } my $dumper = YAML::Dumper->new; $dumper->indent_width(4); print $dumper->dump( {dump => \@lines} ); __DATA__ 2009122233388675647 9988230 2009-01-01 JOE DOE + JR JOEDOEJRWX@EMAIL.COM + COMPANY JOES C. LTD., CORP. 1900-01-01 00:00:00.000 1900-01-0 +1 00:00:00.000 20091222333886756 99882308K 2010-01-01 JOE DOE + JOEDOEJRWX@TEST.COM + COMP INS / CORP. LTDA 1900-01-01 00:00:00.000 1900-01-0 +1 00:00:00.000 [download]	[reply] [d/l]
Re: Parsing text file Help! by raybies (Chaplain) on Nov 22, 2010 at 16:43 UTC
Are we sure the text isn't just tab delimited? Sure looks that way. Pretty trivial if it is.	[reply]
Re^2: Parsing text file Help! by kennethk (Abbot) on Nov 22, 2010 at 16:47 UTC
Note the stops do not occur at regular spacing. Even if the OP's machine were mangling the tabs to spaces, all field widths would have to have a non-unity greatest common denominator.	[reply]
Re: Parsing text file Help! by sundialsvc4 (Abbot) on Nov 22, 2010 at 19:15 UTC
In my experience, text-parsing of a simple file boils down to four concerns: Identifying each line (and ignoring the uninteresting lines). Splitting apart each line, usually using a different set of rules for each line-type. Recognizing when you have accumulated enough information to completely act-upon. Not forgetting about the (expletive!) last record! `:-D` Deal with each one in turn. Your “first line” is probably “one that begins with so-many consecutive digits starting at column #1.” The “next line(s)” could simply be, in a regularly-structured file, “the next n lines following the latest first-line.” State-machine logic can help. You simply initialize a local scalar variable (say, `$state`) to some value like `INITIAL_STATE`, and from time to time examine that variable as you decide what to do next, and update its value to reflect what you’ve done (or seen...) recently. I always code file-grokkers very defensively. I figure that the program’s purpose is not only “to get information out of the file,” but also “to demonstrate that the format of the file has not changed, and that the file was built correctly by whoever built it.” Not too many engagements ago, I was dealing with a large system that didn’t always produce the right answers and no one was quite sure why. This system relied upon a data-feed from an external source, which, I was assured, “hadn’t changed in years.” But it had, because at some time in the past a bug had crept into the program that the data-supplier was using to build the feed. My “data importer from Missouri (the ‘Show Me’ `$state`™)” flushed it out.	[reply]
Re^2: Parsing text file Help! by Anonymous Monk on Nov 22, 2010 at 19:41 UTC
How does this answer the OP? Do you ever post solutions, rather than platitudes?	[reply]
Re^3: Parsing text file Help! by TomDLux (Vicar) on Nov 23, 2010 at 03:44 UTC
I don't like telling people how to solve their problem; it feels like doing their homework even if it isn't homework. Far better to teach them some resources so they can solve their own problem, today, tomorrow and next week. As Occam said: Entia non sunt multiplicanda praeter necessitatem.	[reply]
Re^4: Parsing text file Help! by Anonymous Monk on Dec 24, 2010 at 21:43 UTC


Pathologically Eclectic Rubbish Lister
	PerlMonks