Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Re^3: Design hints for a file processor

by moritz (Cardinal)
on Jul 07, 2008 at 12:32 UTC ( [id://695965]=note: print w/replies, xml ) Need Help??


in reply to Re^2: Design hints for a file processor
in thread Design hints for a file processor

If you want structure, use a real parser. Here is one, albeit a bit hacked up:
#!/usr/bin/perl use strict; use warnings; use Carp qw(confess); use Data::Dumper; my @lines = <DATA>; for (@lines) { $_ =~ s/^\s+//; chomp; } close DATA; sub match_block { my $first = shift @lines; $first =~ m/BEGIN\s+(\S+)/ or confess "Invalid data format in line + <<$first>>, expected a BEGIN"; my $block_name = $1; my @contents; while (@lines && $lines[0] !~ m/^END/){ if ($lines[0] =~ m/^BEGIN/){ push @contents, match_block(); # recurse here } else { my $current = shift @lines; push @contents, ['LINE', split(m/\s+/, $current, 2)]; } } if (@lines){ my $terminator = shift @lines; if ($terminator !~ m/^END\s*\Q$block_name\E/){ die "Syntax error: expected 'END $block_name', got '$termi +nator'"; } } return ['BLOCK', $block_name, @contents]; } print Dumper match_block(); __DATA__ BEGIN DSRECORD Identifier "ROOT" DateModified "1899-12-30" TimeModified "00.00.01" OLEType "CJobDefn" Readonly "0" Name "AP_CDBS_Vendor_Summary" Description "Collates all of the data for the customer master mi +gration." NextID "194" Container "V0" FullDescription "The first part of the routine gathers data from + the ABAP which extracts the necessary data from the SAP tables KNA1 +and KNB1 (NB the key s of the link between KNA1 and KNB1 will form th +e basis of all the ABAP queries )." JobVersion "50.0.0" ControlAfterSubr "0" Parameters "CParameters" BEGIN DSSUBRECORD Name "ROOT" Prompt "/home/migration/Dev root " Default "/home/migration/Dev" ParamType "0" ParamLength "0" ParamScale "0" END DSSUBRECORD BEGIN DSSUBRECORD Name "SITE" Prompt "Business Unit, ie WHUB" Default "CDBS" ParamType "0" ParamLength "0" ParamScale "0" END DSSUBRECORD BEGIN DSSUBRECORD Name "AOW" Prompt "Area of Work ie AP" Default "AP" ParamType "0" ParamLength "0" ParamScale "0" END DSSUBRECORD BEGIN DSSUBRECORD Name "DMR" Prompt "DMR/Spec ie Vendors" Default "Vendors" ParamType "0" ParamLength "0" ParamScale "0" END DSSUBRECORD MetaBag "CMetaProperty" BEGIN DSSUBRECORD Owner "APT" Name "AdvancedRuntimeOptions" Value "#DSProjectARTOptions#" END DSSUBRECORD NULLIndicatorPosition "0" IsTemplate "0" NLSLocale ",,,," JobType "0" Category "1.FSS\\2.AP\\6.CDBS\\1.Vendors\\3.Reports" CenturyBreakYear "30" END DSRECORD

It returns a sort of parse tree with an array ref for each block or line, where blocks look like ['BLOCK', $name_of_block, @lines_in_this_block] and lines look like ['LINE', $key, $value].

Depending on your exact data format and what you want to extract, hashes might be more suitable.

Replies are listed 'Best First'.
Re^4: Design hints for a file processor
by PhilHibbs (Hermit) on Jul 07, 2008 at 13:13 UTC
    The file is up to half a gigabyte, I'm not keeping all that in memory. I could split it up by job, I suppose.
      You don't have to keep it all in memory. My parser uses just one line of lookahead, you can easily refactor the shift @lines; and $lines[0] into subs that work on a file handle.
      { my $line = <DATA>; chomp $line; $line =~ s/^\s+//: # handles '$lines[0]' sub peek { return $line; } # handles 'shift @lines' sub next_line { my $tmp = $line; $line = <DATA>; chomp $line; $lines =~ s/^\s+//; return $tmp; } # handles boolean check for @lines sub is_exhausted { return !defined $line } }

      I didn't test it, but it should work along these lines.

      Instead of nitpicking details, think on the overall architecture and fix small issues for yourself.

        Instead of nitpicking details, think on the overall architecture and fix small issues for yourself.
        Sorry, didn't mean to nitpick, was just thinking out aloud.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://695965]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others examining the Monastery: (7)
As of 2024-04-18 15:54 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found