Re: Logic trouble parsing a formatted text file into hashes of hashes (of hashes, etc.)
by chromatic (Archbishop) on Oct 16, 2004 at 19:31 UTC
|
Unless there's a CPAN module to handle this type of file, I'd consider building a finite state machine in Perl to handle the parsing. That way, you can read in a line, handle it -- starting or ending a state as necessary -- and move on.
One trick that might help is to keep a global or semi-global stack of the current position. If you're familiar with references, this would be a reference to the current position in the master hash. When you enter a new state, push a new hash reference on the stack. When you leave a state, pop the hash reference off. (Don't forget to store it in the master structure though.)
Does that make sense?
| [reply] |
|
| [reply] |
|
| [reply] |
Re: Logic trouble parsing a formatted text file into hashes of hashes (of hashes, etc.)
by ambrus (Abbot) on Oct 16, 2004 at 20:31 UTC
|
This is a tricky question,
especially because some of the parenthisized
entries in the input contain only a word, some both a word
and :foo(...) pairs, some only :foo(...) pairs,
so it's not obvious what data structure to use.
Here's my guess for interpreting it (you might want to tidy
it a bit of course, like changing what's allowed in words
and what's not, or adding my vars).
use Data::Dumper; $s = \%p; @s = (); while (<>) { while (/\G\s*(?:([-\
+w.]+)|:([-\w.]+)\s*\(|(\)))/gc) { if (defined($1)) { defined($$s{""})
+ and die "parse error: two"; $$s{""} = $1; } elsif (defined($2)) { pu
+sh @s, $s; $s = $$s{$2} = {}; } elsif (defined($3)) { @s or die "pars
+e error: close"; $s = pop @s; } } /(\S.*)/g and die "parse error: jun
+k: $1"; } $! and die "read error"; $s == \%p or die "parse error: ope
+n"; print Dumper(\%p);
Update 2006 jun 2: this works for the examples in the node only, not the full example int appears.
| [reply] [d/l] [select] |
|
THX, ambrus!
I'm running with chromatic's idea for the moment.
I like yours as well at first glance, since it only requires Data::Dumper, which every perl-enabled machine has by dafault.
THX
--
idnopheq
Apply yourself to new problems without preparation, develop confidence in your ability to to meet situations as they arrise.
| [reply] |
|
Here's a corrected version that can parse the long sample on your scratchpad (which I also copy below so that it wouldn't disappear unexpextedly). The original script (that in the parent thread) couldn't parse the longer sample because of features it had that I couldn't have guessed from the small samples on the node, and by the time I wrote that you didn't give us the long sample. There are three differences: firstly, this script expects that the data starts with an opening parenthesis, secondly, it accepts a lone colon instead of a colon with a keyword after it, thirdly, it accepts double-quoted strings.
perl -we 'use Data::Dumper; $s = \%p; @s = (); while (<>) { our $f++ o
+r $_ = ": " . $_; while (/\G\s*(?:([-\w.]+|"[^"]*")|:([-\w.]*)\s*\(|(
+\)))/gc) { if (defined($1)) { defined($$s{""}) and die "parse error:
+two"; $$s{"@"} = $1; } elsif (defined($2)) { push @s, $s; $s = $$s{$2
+} = {}; } elsif (defined($3)) { @s or die "parse error: close"; $s =
+pop @s; } } /(\S.*)/g and die "parse error: junk: $1"; } $! and die "
+read error"; $s == \%p or die "parse error: open"; print Dumper(\%p);
+'
A historical note. I did the correction because someone has asked on an irc channel how to parse a file of this exact format.
Here's the long sample
Update:
a version of the above converted to a real script (not a one-liner using global variables) is here. This one also removes double-quotes from double-quoted strings and accepts multi-line strings. The file format has backslash-escaped double quotes in double-quoted strings it seems, and possibly other things this can't parse.
use warnings; use strict;
use Data::Dumper;
sub parse {
my($f) = @_;
my($s, %p, @s, $b);
$s = \%p;
while (<$f>) {
$b++ or $_ = ": " . $_;
while (/\G\s*(?:([-\w.]+)|"([^"]*)"|("[^"]*$)|:([-\w.]
+*)\s*\(|(\)))/gc) {
if (defined($1) || defined($2)) {
defined($$s{""}) and die "parse error:
+ two";
$$s{"@"} = defined($1) ? $1 : $2;
} elsif (defined($3)) {
$_ = $+ . <$f>;
} elsif (defined($4)) {
push @s, $s;
$s = $$s{$+} = {};
} elsif (defined($5)) {
@s or die "parse error: close";
$s = pop @s;
}
}
/(\S.*)/g and die "parse error: junk: $1";
}
$! and die "read error";
$s == \%p or die "parse error: open";
\%p;
}
my $p = parse(*ARGV);
print Dumper($p);
__END__
Update:
defined($$s{""}) and die "parse error: two"; shoud be changed to defined($$s{"@"}) and die "parse error: two"; in both scripts I belive.
| [reply] [d/l] [select] |
Re: Logic trouble parsing a formatted text file into hashes of hashes (of hashes, etc.)
by CountZero (Bishop) on Oct 16, 2004 at 20:02 UTC
|
It looks a bit like a tree, with all nodes not starting with ':' being terminal nodes and the other (those with a ':') being non-terminal branches.I don't think the number of tabs is significant as you have the same information in the '(' and ')'. I would have a look into CPAN modules which deal with trees. Another solution might be to transform this structure into XML and then deal with it through all XML-related modules and tools. A hand crafted XML of (part of) the above structure might read:
<rip>
<bind_interface>false</bind_interface>
<enable>false</enable>
<poison_split_horizon>enable
<enable>
<poison>false</poison>
</enable>
<disable>null</disable>
</poison_split_horizon>
<metric>1</metric>
....
</rip>
The trick will be to maintain a stack with the name of the tags you need to close.
CountZero "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law
| [reply] [d/l] [select] |
|
XML was another approach I tried (sequentially with LDAP). It seems to come back to my inability to properly parse the nested data.
Anticipating I ignore the tabs and focus on the related parenthesis, I could handle it. But how do I count the relative number of nested open/close parenthetical pairs while within such a pair?
Perhaps there is a math-related module that can help me!
THX!!!
--
idnopheq
Apply yourself to new problems without preparation, develop confidence in your ability to to meet situations as they arrise.
| [reply] |
Re: Logic trouble parsing a formatted text file into hashes of hashes (of hashes, etc.)
by idnopheq (Chaplain) on Oct 18, 2004 at 16:49 UTC
|
Hi, All!
Thanks everyone for your tips. Taking bits from each, I created my own parser for the file.
#!/usr/bin/perl -wT
#-*-cperl-*-
use strict;
my (
$laststate,
$state,
$nextstate
);
open(SOURCE, "< $ARGV[0]")
or die "Couldn't open $ARGV[0] for reading: $!\n";
# There are three possible line beginnings (ignoring whitespace): '(',
+ ':', and ')'
# There are three possible line endings (ignoring whitespace): '(', ')
+', and neither one
while (<SOURCE>) {
SWITCH: {
# the first line in the file always starts with '('
# no other line will match this
/^\(/ && do {
$state = $laststate = 0;
$nextstate = 1; # elevate the state
last SWITCH;
};
# lines beginning with ':' always have whitespace before it
# these lines either maintain or elevate the state, never lower th
+e state
/^\s+:/ && do {
$laststate = $state;
$state = $nextstate;
# here is where we analyse the line endings
STATE: {
# if the line ends with '(', elevate the state
/\($/ && do {
$nextstate++;
last STATE;
};
# if the line ends with ')', maintain the state?
/\)$/ && do {
last STATE;
};
# other line endings elevate the state
$nextstate++;
}
last SWITCH;
};
# if the line contains only whitespace and ')', lower the state
/^\s+\)$/ && do {
$laststate = $state;
$state--;
$nextstate--;
last SWITCH;
};
}
# analyse our state
if ( $state != $laststate ) {
# we changed state
print "State Change!\t\t";
}
print "$laststate\t$state\t$nextstate\n";
}
close SOURCE;
I love the idea of DFA::Simple. The documentation for it sux, however. So I rolled my own based upon a review of my source file (which is still in idnopheq's scratchpad. Its just a shell so far. I'll move onto playing with real data now.
THX everone!
--
idnopheq
Apply yourself to new problems without preparation, develop confidence in your ability to to meet situations as they arrise.
| [reply] [d/l] |
Re: Logic trouble parsing a formatted text file into hashes of hashes (of hashes, etc.)
by idnopheq (Chaplain) on Nov 01, 2004 at 17:21 UTC
|
Hi, All!
Here is the working code. It needs a lot of love, sure. But here it is for the interested. When I'm done, this might be a module.
#!/usr/bin/perl -wT
#-*-cperl-*-
use strict;
use Data::Dumper;
my (
$laststate,
$state,
$nextstate
); # some states
my $key; # some key
my $value; # some value
my %NML; # the master hash
my @hohref = (); # the hash reference array
my $href = \%NML; # the master hash reference
open ( SOURCE, "< $ARGV[0]" )
or die "Couldn't open $ARGV[0] for reading: $!\n";
# There are three possible line beginnings (ignoring whitespace):
# '(', ':', and ')'
# There are three possible line endings (ignoring whitespace):
# '(', ')', and neither one
while ( <SOURCE> ) {
SWITCH: {
# the first line in the file always starts with '('
# no other line will match this
/^\(/ && do {
$state = $laststate = 0; # set the state
$nextstate = 1; # elevate the state
(
$key, # parse the line
$value
) = ParseLine ( $_ );
$href->{"filename"} = $key; # slap it in the master hash re
+ference
last SWITCH; # move on to the next line
};
# lines beginning with ':' always have whitespace before it
# these lines either maintain or elevate the state, never lower th
+e state
/^\s+:/ && do {
$laststate = $state; # set our old state
$state = $nextstate; # set our new state
# here is where we analyse the line endings
STATE: {
# if the line ends with '(', elevate the state
/\($/ && do {
$nextstate++; # elevate the state
(
$key,
$value
) = ParseLine ( $_ ); # parse the line
push @hohref, $href; # track our master hash refe
+rence
if ( exists $$href{$key} ) { # if we all ready have an an
+on hash ref
$href = $$href{$key}; # then reuse it
}
elsif ( ! exists $$href{$key} ) { # if we don't have an anon has
+h ref
$href = $$href{$key} = {}; # so make a new one
}
last STATE; # move on
};
# if the line ends with ')', maintain the state
/\)$/ && do {
(
$key,
$value
) = ParseLine ( $_ ); # parse the line
$$href{$key} = array_ref() # get an anon array
+ref
unless ( ref ( $$href{$key} ) eq 'ARRAY' ); # unless we all re
+ady have an array ref
push @{ $$href{$key} }, $value; # add a value to the
+ array ref
last STATE; # move on
};
# other line endings elevate the state
$nextstate++; # elevate the state
(
$key,
$value
) = ParseLine ( $_ ); # parse the line
push @hohref, $href; # track our master hash referenc
+e
$href = $$href{$key} = {} # make a new anon hash ref tied
+to the master
unless defined $value; # unless we have a value
$href = $$href{$value} = {} # make a new anon hash ref tied
+to the master
if defined $value; # with the value as the key
}
last SWITCH; # next line
};
# if the line contains only whitespace and ')', lower the state
/^\s+\)$/ && do {
$laststate = $state; # set our old state
$state--; # decriment our state
$nextstate--; # decrement our next state
$href = pop @hohref; # remove our old master hash reference
last SWITCH; # next line
};
}
# analyse our state
if ( $state != $laststate ) {
# we changed state
print "State Change!\t\t"
if $ARGV[1];
}
print "$laststate\t$state\t$nextstate\n"
if $ARGV[1];
}
close SOURCE;
# we only care about device files
# other file types are small, so I don't mind reading them in and then
# discarding them
die "$ARGV[0] is not a device file!\n"
unless $NML{type}[0] eq "device";
print Dumper \%NML;
sub ParseLine {
# this could use some severe help
# strip the leading colon
$_[0] =~ s/://;
# strip any quotes
$_[0] =~ s/\"//g;
# strip the open parenthesis
$_[0] =~ s/\(//;
# strip the closing parenthesis
$_[0] =~ s/\)$//;
# strip any whitespace
$_[0] =~ s/^\s+//;
$_[0] =~ s/\s+$//;
# assign both values to variables
if ( $_[0] =~ / /) {
return ( split / /, $_[0], 2 ); # split on the white space and ret
+urn two elements
}
else {
return ( $_[0], undef ); # or return one and undef
}
}
sub array_ref {
# return an anonymous array reference
my @array;
return \@array;
}
HTH
--
idnopheq
Apply yourself to new problems without preparation, develop confidence in your ability to to meet situations as they arrise.
| [reply] [d/l] |