Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Parsing text file Help!

by Anonymous Monk
on Nov 22, 2010 at 16:23 UTC ( [id://872996]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,
I am trying to parse a text file with has probably two thousand lines or more sometimes and I cant figure it out the best way of doing this since the spaces between each row can be different. Any one has any suggestions? I need to insert this data into a database after it all done. I have included a 2 lines text test file to show how these lines are.
#!/usr/bin/perl use strict; use warnings; open (FILE, 'my_file.txt'); while (<FILE>) { chomp; my ($a, $b, $c, $d, $e, $f, $g, $h) = (split)[0,1,2,3,4,5,6,7,8]; # it + doesn't work here cause of the spaces in the names it # print data to get it ready to be inserted into db print "$a - $b - $c - $d - $e - $f - $g - $h\n"; } close (FILE); exit; <code><br> Test file: <code> 2009122233388675647 9988230 2009-01-01 JOE DOE + JR JOEDOEJRWX@EMAIL.COM + COMPANY JOES C. LTD., CORP. 1900-01-01 00:00:00.000 1900-01-0 +1 00:00:00.000 20091222333886756 99882308K 2010-01-01 JOE DOE + JOEDOEJRWX@TEST.COM + COMP INS / CORP. LTDA 1900-01-01 00:00:00.000 1900-01-0 +1 00:00:00.000
Thanks for looking!

Replies are listed 'Best First'.
Re: Parsing text file Help!
by roboticus (Chancellor) on Nov 22, 2010 at 16:31 UTC

    For files like this, I tend to use unpack or substr to pull the fields apart. For example:

    my $t=<<EOTEXT; 2009122233388675647 9988230 JOE DOE JR 20091222333886756 99882308K JOE DOE EOTEXT my ($ID, $fld2, $name) = unpack "A34A14A20", $t; print "ID: $ID\nFLD2: $fld2\nName: $name\n";

    To do so, you just count the field widths, and use the widths to unpack the fields. You can do the same with substr, but generally I find unpack to handier.

    NOTE: Untested, ...

    ...roboticus

Re: Parsing text file Help!
by kennethk (Abbot) on Nov 22, 2010 at 16:37 UTC
    The issue is that this is fixed-width data rather than delimited data. Rather than using split, you should be using unpack. For a rough intro to the utility, see perlpacktut.

    #!/usr/bin/perl use strict; use warnings; while (<DATA>) { chomp; my ($a, $b, $c, $d, $e, $f, $g, $h) = unpack "A34A14A15A27A57A28A4 +8", $_; # print data to get it ready to be inserted into db print "$a - $b - $c - $d - $e - $f - $g - $h\n"; } __DATA__ 2009122233388675647 9988230 2009-01-01 JOE DOE + JR JOEDOEJRWX@EMAIL.COM + COMPANY JOES C. LTD., CORP. 1900-01-01 00:00:00.000 1900-01-0 +1 00:00:00.000 20091222333886756 99882308K 2010-01-01 JOE DOE + JOEDOEJRWX@TEST.COM + COMP INS / CORP. LTDA 1900-01-01 00:00:00.000 1900-01-0 +1 00:00:00.000
    Note that the above code issues warnings about an uninitialized value in a concatenation. This is because you expect 8 fields, but my unpack only has 7 fields encoded (A\d+). I suspect there is an additional field between e-mail and company, but have no way of knowing without a record that contains that information. Note as well you can drop the $_ from unpack's argument list with perl >= 5.10.
      Thank you, here is an "unpack" that worked!
      unpack "A33A14A15A31A61A31A24A23"
      I liked it!!!
Re: Parsing text file Help!
by fisher (Priest) on Nov 22, 2010 at 16:28 UTC
    I suggest to read perldoc perlretut - it's small and usefull. If you explain in english what patterns do you see in these lines, I think, we could provide you a turnkey solution.
    UPDATE. If starting positions for these rows in lines are non-variable, you can use unpack, as roboticus said below.
Re: Parsing text file Help!
by Khen1950fx (Canon) on Nov 22, 2010 at 18:35 UTC
    I used split, an old script by johngg, and YAML::Dumper. It's easier for me to read.
    #!/usr/bin/perl use strict; use warnings; use YAML; use YAML::Dumper; my @lines; while (<DATA>) { chomp; my @fields = split m{\s+}, $_, 8; my $rest = pop @fields; push @fields, reverse map { $_ = reverse } split m{\s+}, reverse($rest), 8; push @lines, \@fields; } my $dumper = YAML::Dumper->new; $dumper->indent_width(4); print $dumper->dump( {dump => \@lines} ); __DATA__ 2009122233388675647 9988230 2009-01-01 JOE DOE + JR JOEDOEJRWX@EMAIL.COM + COMPANY JOES C. LTD., CORP. 1900-01-01 00:00:00.000 1900-01-0 +1 00:00:00.000 20091222333886756 99882308K 2010-01-01 JOE DOE + JOEDOEJRWX@TEST.COM + COMP INS / CORP. LTDA 1900-01-01 00:00:00.000 1900-01-0 +1 00:00:00.000
Re: Parsing text file Help!
by raybies (Chaplain) on Nov 22, 2010 at 16:43 UTC
    Are we sure the text isn't just tab delimited? Sure looks that way. Pretty trivial if it is.
      Note the stops do not occur at regular spacing. Even if the OP's machine were mangling the tabs to spaces, all field widths would have to have a non-unity greatest common denominator.
Re: Parsing text file Help!
by sundialsvc4 (Abbot) on Nov 22, 2010 at 19:15 UTC

    In my experience, text-parsing of a simple file boils down to four concerns:

    1. Identifying each line (and ignoring the uninteresting lines).
    2. Splitting apart each line, usually using a different set of rules for each line-type.
    3. Recognizing when you have accumulated enough information to completely act-upon.
    4. Not forgetting about the (expletive!) last record!   :-D

    Deal with each one in turn.   Your “first line” is probably “one that begins with so-many consecutive digits starting at column #1.”   The “next line(s)” could simply be, in a regularly-structured file, “the next n lines following the latest first-line.”

    State-machine logic can help.   You simply initialize a local scalar variable (say, $state) to some value like INITIAL_STATE, and from time to time examine that variable as you decide what to do next, and update its value to reflect what you’ve done (or seen...) recently.

    I always code file-grokkers very defensively.   I figure that the program’s purpose is not only “to get information out of the file,” but also “to demonstrate that the format of the file has not changed, and that the file was built correctly by whoever built it.”   Not too many engagements ago, I was dealing with a large system that didn’t always produce the right answers and no one was quite sure why.   This system relied upon a data-feed from an external source, which, I was assured, “hadn’t changed in years.”   But it had, because at some time in the past a bug had crept into the program that the data-supplier was using to build the feed.   My “data importer from Missouri (the ‘Show Me’ $state™)” flushed it out.

      How does this answer the OP? Do you ever post solutions, rather than platitudes?

        I don't like telling people how to solve their problem; it feels like doing their homework even if it isn't homework. Far better to teach them some resources so they can solve their own problem, today, tomorrow and next week.

        As Occam said: Entia non sunt multiplicanda praeter necessitatem.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://872996]
Approved by kennethk
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others goofing around in the Monastery: (5)
As of 2024-04-16 03:58 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found