Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Parsing structured text, problem with parsing logic

by hperange (Beadle)
on May 10, 2012 at 11:36 UTC ( #969796=perlquestion: print w/ replies, xml ) Need Help??
hperange has asked for the wisdom of the Perl Monks concerning the following question:

Hello,

I am trying to parse a text file, which has a certain structure, i would say its recursive; I tried to parse it using an iterative approach, going over each line and trying to figure out what action to take based on the current line and some state variables that I maintain/update as my program crunches over the lines. The problem I have is that I cannot figure out the correct logic to parse the input to the output format I desire. Could someone help me figure out the logic ( an algorithm, perhaps ) for parsing this structured text file ? I am not necessarily asking for code, just maybe a text description or ideas how to do it. I will attach my code ( which is not producing a correct output ), and try to explain it a few steps how it operates.

A sample input:
max_reformation_duration=300 cfs_online_timeout_changes=200 max_configured_packages=100 node:nodeA|name=nodeA node:nodeA|status=up node:nodeA|interface:lan900|name=lan900 node:nodeA|interface:lan900|status=up node:nodeA|interface:lan900|ip_address:IP_A|name=IP_A node:nodeA|interface:lan900|ip_address:IP_A|netmask=255.255.255.0 node:nodeA|interface:lan901|name=lan901 node:nodeA|interface:lan901|status=up node:nodeA|interface:lan901|ip_address:IP_C|name=IP_C node:nodeA|interface:lan901|ip_address:IP_C|netmask=255.255.255.0 node:nodeA|subnet:SUBNET_A|name=SUBNET_A node:nodeA|subnet:SUBNET_A|status=up node:nodeA|subnet:SUBNET_B|name=SUBNET_B node:nodeA|subnet:SUBNET_B|status=up node:nodeB|name=nodeB node:nodeB|status=up node:nodeB|interface:lan900|name=lan900 node:nodeB|interface:lan900|status=up node:nodeB|interface:lan900|ip_address:IP_B|name=IP_B node:nodeB|interface:lan900|ip_address:IP_B|netmask=255.255.255.0 node:nodeB|interface:lan901|name=lan901 node:nodeB|interface:lan901|status=up node:nodeB|interface:lan901|ip_address:IP_D|name=IP_D node:nodeB|interface:lan901|ip_address:IP_D|netmask=255.255.255.0 node:nodeB|subnet:SUBNET_A|name=SUBNET_A node:nodeB|subnet:SUBNET_A|status=up node:nodeB|subnet:SUBNET_B|name=SUBNET_B node:nodeB|subnet:SUBNET_B|status=up
Parsed, it should look like this:
%config = { max_reformation_duration => 300, cfs_online_timeout_changes => 200, max_configured_packages => 100, node => { nodeA => { name => nodeA, status => up, interface => { lan900 => { name => lan900, status => up, ip_adress => { IP_A => { name => IP_A, netmask => 255.255.255.0, }, }, }, lan901 => { name => lan901, status => up, ip_adress => { IP_C => { name => IP_C, netmask => 255.255.255.0, }, }, }, }, subnet => { SUBNET_A => { name => SUBNET_A, status => up, }, SUBNET_B => { name => SUBNET_B, status => up, }, }, }, nodeB => { name => nodeB, status => up, interface => { lan900 => { name => lan900, status => up, ip_adress => { IP_A => { name => IP_B, netmask => 255.255.255.0, }, }, }, lan901 => { name => lan901, status => up, ip_adress => { IP_C => { name => IP_D, netmask => 255.255.255.0, }, }, }, }, subnet => { SUBNET_A => { name => SUBNET_A, status => up, }, SUBNET_B => { name => SUBNET_B, status => up, }, }, }, }, }
My code:
#!/usr/bin/perl use strict; use warnings; use Data::Dumper; $Data::Dumper::Indent = 1; $Data::Dumper::Sortkeys = 1; sub read_input { open(my $io, '<', 'input.txt') or die $!; return <$io>; } my @lines = read_input(); my %config; my $hashref = \%config; # stack of hash references my @ref; # count of | characters in a respective line my $level = 0; # count of | characters in the previous line my $prev_level = 0; my $key; my $fkey; for (@lines) { chomp; my @f = split(/\|/); $level = @f - 1; if ($level == 0) { my @f = split(/=/); $hashref->{ $f[0] } = $f[1]; } else { my @q = split(/:/, $f[$level - 1]); if ($key && $level <= $prev_level) { if (/\Q$key\E/) { my @f = split(/=/, $f[$level]); $hashref->{ $f[0] } = $f[1]; $prev_level = $level; } elsif (/\Q$fkey\E/) { $hashref = $ref[-1]; $hashref->{$q[1]} = {}; $hashref = $hashref->{$q[1]}; $key = join(':', @q); $prev_level = $level; redo; } else { pop @ref; if ($level < $prev_level) { my $diff = $prev_level - $level; while ($diff--) { pop @ref; pop @ref; } } $hashref = pop @ref; $key = ''; $prev_level = $level; redo; } } else { if (exists $hashref->{$q[0]}) { push @ref, $hashref; $hashref->{$q[0]}->{$q[1]} = {}; $hashref = $hashref->{$q[0]}->{$q[1]}; } else { push @ref, $hashref; $hashref->{$q[0]} = {}; $hashref = $hashref->{$q[0]}; push @ref, $hashref; $hashref->{$q[1]} = {}; $hashref = $hashref->{$q[1]}; } $key = join(':', @q); $fkey = $q[0]; $prev_level = $level; redo; } } } print Dumper( \%config );
Output from the above code:
$VAR1 = { 'cfs_online_timeout_changes' => '200', 'max_configured_packages' => '100', 'max_reformation_duration' => '300', 'node' => { 'nodeA' => { 'interface' => { 'lan900' => { 'ip_address' => { 'IP_A' => { 'name' => 'IP_A', 'netmask' => '255.255.255.0' } }, 'name' => 'lan900', 'status' => 'up' }, 'lan901' => { 'ip_address' => { 'IP_C' => { 'name' => 'IP_C', 'netmask' => '255.255.255.0' } }, 'name' => 'lan901', 'status' => 'up' } }, 'name' => 'nodeA', 'status' => 'up' }, 'subnet' => { 'SUBNET_A' => { 'name' => 'SUBNET_A', 'status' => 'up' }, 'SUBNET_B' => { 'name' => 'SUBNET_B', 'status' => 'up' } } } };
Explanation:

  • the core of the algorithm/approach are $hashref, and @ref; $hashref being a reference to the hash where the key-value pairs will be inserted if found, @ref being a stack used to push and pop the current value of $hashref, for tracking the previous values, in case more elements will need to be inserted
  • if a line having the format 'key=value' is found it gets inserted in $hashref
  • if the line contains the | (pipe) character, it is split on this character, and the code tries to figure out whether to create a new subhash, or just insert the 'key=value' found at the end of this line

So, to sum it up, i am not looking for comments of my code - the reason for this is that even after several rewrites, it grows to a complexity level which is just too hard to understand, and debug ( i tried, i failed ); I am rather looking for a different approach to achieve my goal: transfer the input ( first code block ) to the desired output format ( second code block ). Thanks in advance to whoever takes the time to read through all of this.

Comment on Parsing structured text, problem with parsing logic
Select or Download Code
Re: Parsing structured text, problem with parsing logic
by BrowserUk (Pope) on May 10, 2012 at 12:02 UTC

    Try it this way:

    #! perl -slw use strict; use Data::Dump qw[ pp ]; my %config; while( <> ) { chomp; my @bits = split '[:|=]'; my $ref = \%config; $ref = $ref->{ shift( @bits ) } //= {} while @bits > 2; $ref->{ pop @bits } = pop @bits; } pp \%config;

    Produces:

    C:\test>969796.pl 969796.dat { "cfs_online_timeout_changes" => 200, "max_configured_packages" => 100, "max_reformation_duration" => 300, node => { nodeA => { interface => { lan900 => { ip_address => { IP_A => +{ name => "IP_A", netmask => "255.255.255.0" } }, name => "lan900", status => "up", }, lan901 => { ip_address => { IP_C => +{ name => "IP_C", netmask => "255.255.255.0" } }, name => "lan901", status => "up", }, }, name => "nodeA", status => "up", subnet => { SUBNET_A => { name => "SUBNET_A", st +atus => "up" }, SUBNET_B => { name => "SUBNET_B", st +atus => "up" }, }, }, nodeB => { interface => { lan900 => { ip_address => { IP_B => +{ name => "IP_B", netmask => "255.255.255.0" } }, name => "lan900", status => "up", }, lan901 => { ip_address => { IP_D => +{ name => "IP_D", netmask => "255.255.255.0" } }, name => "lan901", status => "up", }, }, name => "nodeB", status => "up", subnet => { SUBNET_A => { name => "SUBNET_A", st +atus => "up" }, SUBNET_B => { name => "SUBNET_B", st +atus => "up" }, }, }, }, }

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://969796]
Approved by BrowserUk
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (7)
As of 2014-12-27 15:23 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (177 votes), past polls