Logic trouble parsing a formatted text file into hashes of hashes (of hashes, etc.)

idnopheq has asked for the wisdom of the Perl Monks concerning the following question:

Hi, All!

It's been a while. Ellated to be back!

I have files (Juniper/NetScreen's NSM *.nml files, similar to CheckPoint management station *.W or object*.C files) I need parsed to retrieve specific information.

I'm looking to create hashes of hashes, or at least I think that is the way to go. I need a nudge (and please just a nudge as I'm trying to get my perl legs under me again) toward how to slurp such a file into something useful.

The tabs/parentheses are messing me up big time. Anther hurdle is creating arbitrary hashes (of hashes etc.) by a variable and properly handling relative depth. Lastly, this has to be maintainable (by someone with equal or lesser perl skills to my atrophied ones).

Basically, I don't think I'm approaching my source data properly.

I'd post what I have now (reading the whole file into one big ordered hash and positional searching a la Recipe 6.14) but its broken beyond repair, at best failed to execute with taint checking and -w and 'use strict;' enabled, and exhibits shamefull coding practices beyond that.

UPDATE 3: Reordered and added a timely readmore tag as I'm on the Gates now.

Here is a small piece of the data (with leading 8 character tabs, as indentation is perhaps crutial):

				:rip (
					:bind_interface (false)
					:enable (false)
					:poison_split_horizon (enable
						:enable (
							:poison (false)
						)
						:disable (null)
					)
					:metric (1)
					:passive_mode (false)
					:authentication (no-authentication
						:multiple-md5 (
							:md5-key-values ()
						)
					)
				)

UPDATE 2

Here is another snipit of the source file, by request. Please see idnopheq's scratchpad for a full source file.

        :members ()
        :global-pro (
                :report-manager (
                        :primary ("&0.server.1")
                        :alarm-attack (true)
                        :alarm-other (true)
                        :alarm-traffic (true)
                        :alarm-di (true)
                        :attack-stat (true)
                        :ethernet-stat (true)
                        :flow-stat (true)
                        :log-config (true)
                        :log-info (true)
                        :log-self (true)
                        :log-traffic (true)
                        :policy-stat (true)
                        :proto-dist (true)
                        :server-port (7800)
                        :user-service ()
                )
        )

For those who might wish to bring up the fw1rules tool, I've tried "liberating" portions thereof to achieve this end. However, it seems these types of files scream out for hashes of hashes (of hashes, etc.). Traversing every line and manually keeping track of tab depth seems a waste of resources (although I'm getting desperate enough to consider it again).

For those REALLY interested in the beast I need to tame, PM me and I will send you the full source data file.

UPDATE 0: a source data file is in idnopheq's scratchpad.

UPDATE 1: I will turn the solution into a module for general consumption once I'm able to get my mind around this. Maybe it'll even be flexible enough for CheckPoint and Juniper/NetScreen files!

TIA
--
idnopheq
Apply yourself to new problems without preparation, develop confidence in your ability to to meet situations as they arrise.

~ Shunryu Suzuki

Comment on Logic trouble parsing a formatted text file into hashes of hashes (of hashes, etc.)

Replies are listed 'Best First'.
Re: Logic trouble parsing a formatted text file into hashes of hashes (of hashes, etc.) by chromatic (Archbishop) on Oct 16, 2004 at 19:31 UTC
Unless there's a CPAN module to handle this type of file, I'd consider building a finite state machine in Perl to handle the parsing. That way, you can read in a line, handle it -- starting or ending a state as necessary -- and move on. One trick that might help is to keep a global or semi-global stack of the current position. If you're familiar with references, this would be a reference to the current position in the master hash. When you enter a new state, push a new hash reference on the stack. When you leave a state, pop the hash reference off. (Don't forget to store it in the master structure though.) Does that make sense?	[reply]
Re^2: Logic trouble parsing a formatted text file into hashes of hashes (of hashes, etc.) by idnopheq (Chaplain) on Oct 16, 2004 at 19:44 UTC
No. I am totally baffled by your answer. But perhaps its the "nudge" I need, and I'll read the link you provided and knock some more dust off. I cannot thus far find a CPAN module to help me (beyond using Net::Telnet::Netscreen; visiting each box insecurely is no fun what so ever). THX!!! -- idnopheq Apply yourself to new problems without preparation, develop confidence in your ability to to meet situations as they arrise. ~ Shunryu Suzuki	[reply]
Re^3: Logic trouble parsing a formatted text file into hashes of hashes (of hashes, etc.) by SpanishInquisition (Pilgrim) on Oct 18, 2004 at 19:18 UTC
He's speaking of building a SAX-style parser for your file format. Recognize tokens, call functions when you find them, and keep track of context with package variables... Another alternative might be Parse::RecDescent or the YAPP module...	[reply]
Re: Logic trouble parsing a formatted text file into hashes of hashes (of hashes, etc.) by ambrus (Abbot) on Oct 16, 2004 at 20:31 UTC
This is a tricky question, especially because some of the parenthisized entries in the input contain only a word, some both a word and :foo(...) pairs, some only :foo(...) pairs, so it's not obvious what data structure to use. Here's my guess for interpreting it (you might want to tidy it a bit of course, like changing what's allowed in words and what's not, or adding `my` vars). `use Data::Dumper; $s = \%p; @s = (); while (<>) { while (/\G\s(?:([-\ +w.]+)\|:([-\w.]+)\s$\|($))/gc) { if (defined($1)) { defined($$s{""}) + and die "parse error: two"; $$s{""} = $1; } elsif (defined($2)) { pu +sh @s, $s; $s = $$s{$2} = {}; } elsif (defined($3)) { @s or die "pars +e error: close"; $s = pop @s; } } /(\S.*)/g and die "parse error: jun +k: $1"; } $! and die "read error"; $s == \%p or die "parse error: ope +n"; print Dumper(\%p);` [download] Update 2006 jun 2: this works for the examples in the node only, not the full example int appears.	[reply] [d/l] [select]
Re^2: Logic trouble parsing a formatted text file into hashes of hashes (of hashes, etc.) by idnopheq (Chaplain) on Oct 16, 2004 at 21:32 UTC
THX, ambrus! I'm running with chromatic's idea for the moment. I like yours as well at first glance, since it only requires Data::Dumper, which every perl-enabled machine has by dafault. THX -- idnopheq Apply yourself to new problems without preparation, develop confidence in your ability to to meet situations as they arrise. ~ Shunryu Suzuki	[reply]
Re^2: Logic trouble parsing a formatted text file into hashes of hashes (of hashes, etc.) by ambrus (Abbot) on Jul 30, 2006 at 22:36 UTC
Here's a corrected version that can parse the long sample on your scratchpad (which I also copy below so that it wouldn't disappear unexpextedly). The original script (that in the parent thread) couldn't parse the longer sample because of features it had that I couldn't have guessed from the small samples on the node, and by the time I wrote that you didn't give us the long sample. There are three differences: firstly, this script expects that the data starts with an opening parenthesis, secondly, it accepts a lone colon instead of a colon with a keyword after it, thirdly, it accepts double-quoted strings. `perl -we 'use Data::Dumper; $s = \%p; @s = (); while (<>) { our $f++ o +r $_ = ": " . $_; while (/\G\s(?:([-\w.]+\|"[^"]")\|:([-\w.])\s$\|( +$))/gc) { if (defined($1)) { defined($$s{""}) and die "parse error: +two"; $$s{"@"} = $1; } elsif (defined($2)) { push @s, $s; $s = $$s{$2 +} = {}; } elsif (defined($3)) { @s or die "parse error: close"; $s = +pop @s; } } /(\S.)/g and die "parse error: junk: $1"; } $! and die " +read error"; $s == \%p or die "parse error: open"; print Dumper(\%p); +'` [download] A historical note. I did the correction because someone has asked on an irc channel how to parse a file of this exact format. Here's the long sample Read more... (58 kB) Update:* a version of the above converted to a real script (not a one-liner using global variables) is here. This one also removes double-quotes from double-quoted strings and accepts multi-line strings. The file format has backslash-escaped double quotes in double-quoted strings it seems, and possibly other things this can't parse. use warnings; use strict; use Data::Dumper; sub parse { my($f) = @_; my($s, %p, @s, $b); $s = \%p; while (<$f>) { $b++ or $_ = ": " . $_; while (/\G\s(?:([-\w.]+)\|"([^"])"\|("[^"]$)\|:([-\w.] +)\s$\|($))/gc) { if (defined($1) \|\| defined($2)) { defined($$s{""}) and die "parse error: + two"; $$s{"@"} = defined($1) ? $1 : $2; } elsif (defined($3)) { $_ = $+ . <$f>; } elsif (defined($4)) { push @s, $s; $s = $$s{$+} = {}; } elsif (defined($5)) { @s or die "parse error: close"; $s = pop @s; } } /(\S.)/g and die "parse error: junk: $1"; } $! and die "read error"; $s == \%p or die "parse error: open"; \%p; } my $p = parse(ARGV); print Dumper($p); __END__ [download] Update:* `defined($$s{""}) and die "parse error: two";` shoud be changed to `defined($$s{"@"}) and die "parse error: two";` in both scripts I belive.	[reply] [d/l] [select]
Re: Logic trouble parsing a formatted text file into hashes of hashes (of hashes, etc.) by CountZero (Bishop) on Oct 16, 2004 at 20:02 UTC
It looks a bit like a tree, with all nodes not starting with '`:`' being terminal nodes and the other (those with a '`:`') being non-terminal branches. I don't think the number of tabs is significant as you have the same information in the '`(`' and '`)`'. I would have a look into CPAN modules which deal with trees. Another solution might be to transform this structure into XML and then deal with it through all XML-related modules and tools. A hand crafted XML of (part of) the above structure might read: `<rip> <bind_interface>false</bind_interface> <enable>false</enable> <poison_split_horizon>enable <enable> <poison>false</poison> </enable> <disable>null</disable> </poison_split_horizon> <metric>1</metric> .... </rip>` [download] The trick will be to maintain a stack with the name of the tags you need to close. CountZero "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law	[reply] [d/l] [select]
Re^2: Logic trouble parsing a formatted text file into hashes of hashes (of hashes, etc.) by idnopheq (Chaplain) on Oct 16, 2004 at 20:34 UTC
XML was another approach I tried (sequentially with LDAP). It seems to come back to my inability to properly parse the nested data. Anticipating I ignore the tabs and focus on the related parenthesis, I could handle it. But how do I count the relative number of nested open/close parenthetical pairs while within such a pair? Perhaps there is a math-related module that can help me! THX!!! -- idnopheq Apply yourself to new problems without preparation, develop confidence in your ability to to meet situations as they arrise. ~ Shunryu Suzuki	[reply]
Re: Logic trouble parsing a formatted text file into hashes of hashes (of hashes, etc.) by idnopheq (Chaplain) on Oct 18, 2004 at 16:49 UTC
Hi, All! Thanks everyone for your tips. Taking bits from each, I created my own parser for the file. #!/usr/bin/perl -wT #--cperl-- use strict; my ( $laststate, $state, $nextstate ); open(SOURCE, "< $ARGV[0]") or die "Couldn't open $ARGV[0] for reading: $!\n"; # There are three possible line beginnings (ignoring whitespace): '(', + ':', and ')' # There are three possible line endings (ignoring whitespace): '(', ') +', and neither one while (<SOURCE>) { SWITCH: { # the first line in the file always starts with '(' # no other line will match this /^$/ && do { $state = $laststate = 0; $nextstate = 1; # elevate the state last SWITCH; }; # lines beginning with ':' always have whitespace before it # these lines either maintain or elevate the state, never lower th +e state /^\s+:/ && do { $laststate = $state; $state = $nextstate; # here is where we analyse the line endings STATE: { # if the line ends with '(', elevate the state /\($/ && do { $nextstate++; last STATE; }; # if the line ends with ')', maintain the state? /$$/ && do { last STATE; }; # other line endings elevate the state $nextstate++; } last SWITCH; }; # if the line contains only whitespace and ')', lower the state /^\s+\)$/ && do { $laststate = $state; $state--; $nextstate--; last SWITCH; }; } # analyse our state if ( $state != $laststate ) { # we changed state print "State Change!\t\t"; } print "$laststate\t$state\t$nextstate\n"; } close SOURCE; [download] I love the idea of DFA::Simple. The documentation for it sux, however. So I rolled my own based upon a review of my source file (which is still in idnopheq's scratchpad. Its just a shell so far. I'll move onto playing with real data now. THX everone! -- idnopheq Apply yourself to new problems without preparation, develop confidence in your ability to to meet situations as they arrise. ~ Shunryu Suzuki	[reply] [d/l]
Re: Logic trouble parsing a formatted text file into hashes of hashes (of hashes, etc.) by idnopheq (Chaplain) on Nov 01, 2004 at 17:21 UTC
Hi, All! Here is the working code. It needs a lot of love, sure. But here it is for the interested. When I'm done, this might be a module. #!/usr/bin/perl -wT #--cperl-- use strict; use Data::Dumper; my ( $laststate, $state, $nextstate ); # some states my $key; # some key my $value; # some value my %NML; # the master hash my @hohref = (); # the hash reference array my $href = \%NML; # the master hash reference open ( SOURCE, "< $ARGV[0]" ) or die "Couldn't open $ARGV[0] for reading: $!\n"; # There are three possible line beginnings (ignoring whitespace): # '(', ':', and ')' # There are three possible line endings (ignoring whitespace): # '(', ')', and neither one while ( <SOURCE> ) { SWITCH: { # the first line in the file always starts with '(' # no other line will match this /^\(/ && do { $state = $laststate = 0; # set the state $nextstate = 1; # elevate the state ( $key, # parse the line $value ) = ParseLine ( $_ ); $href->{"filename"} = $key; # slap it in the master hash re +ference last SWITCH; # move on to the next line }; # lines beginning with ':' always have whitespace before it # these lines either maintain or elevate the state, never lower th +e state /^\s+:/ && do { $laststate = $state; # set our old state $state = $nextstate; # set our new state # here is where we analyse the line endings STATE: { # if the line ends with '(', elevate the state /$$/ && do { $nextstate++; # elevate the state ( $key, $value ) = ParseLine ( $_ ); # parse the line push @hohref, $href; # track our master hash refe +rence if ( exists $$href{$key} ) { # if we all ready have an an +on hash ref $href = $$href{$key}; # then reuse it } elsif ( ! exists $$href{$key} ) { # if we don't have an anon has +h ref $href = $$href{$key} = {}; # so make a new one } last STATE; # move on }; # if the line ends with ')', maintain the state /$$/ && do { ( $key, $value ) = ParseLine ( $_ ); # parse the line $$href{$key} = array_ref() # get an anon array +ref unless ( ref ( $$href{$key} ) eq 'ARRAY' ); # unless we all re +ady have an array ref push @{ $$href{$key} }, $value; # add a value to the + array ref last STATE; # move on }; # other line endings elevate the state $nextstate++; # elevate the state ( $key, $value ) = ParseLine ( $_ ); # parse the line push @hohref, $href; # track our master hash referenc +e $href = $$href{$key} = {} # make a new anon hash ref tied +to the master unless defined $value; # unless we have a value $href = $$href{$value} = {} # make a new anon hash ref tied +to the master if defined $value; # with the value as the key } last SWITCH; # next line }; # if the line contains only whitespace and ')', lower the state /^\s+\)$/ && do { $laststate = $state; # set our old state $state--; # decriment our state $nextstate--; # decrement our next state $href = pop @hohref; # remove our old master hash reference last SWITCH; # next line }; } # analyse our state if ( $state != $laststate ) { # we changed state print "State Change!\t\t" if $ARGV[1]; } print "$laststate\t$state\t$nextstate\n" if $ARGV[1]; } close SOURCE; # we only care about device files # other file types are small, so I don't mind reading them in and then # discarding them die "$ARGV[0] is not a device file!\n" unless $NML{type}[0] eq "device"; print Dumper \%NML; sub ParseLine { # this could use some severe help # strip the leading colon $_[0] =~ s/://; # strip any quotes $_[0] =~ s/\"//g; # strip the open parenthesis $_[0] =~ s/$//; # strip the closing parenthesis $_[0] =~ s/$$//; # strip any whitespace $_[0] =~ s/^\s+//; $_[0] =~ s/\s+$//; # assign both values to variables if ( $_[0] =~ / /) { return ( split / /, $_[0], 2 ); # split on the white space and ret +urn two elements } else { return ( $_[0], undef ); # or return one and undef } } sub array_ref { # return an anonymous array reference my @array; return \@array; } [download] HTH -- idnopheq Apply yourself to new problems without preparation, develop confidence in your ability to to meet situations as they arrise. ~ Shunryu Suzuki	[reply] [d/l]


Come for the quick hacks, stay for the epiphanies.
	PerlMonks