http://www.perlmonks.org?node_id=595928

Moron has asked for the wisdom of the Perl Monks concerning the following question:

The context is efficiently parsing huge XML files or streams that may have single tags enclosing gigabytes of nested tags, which I'll call "enclosing tags". I have a recursive method called gettag that parses an XML tag into a hash structure that necessarily differs from XML::Simple. Another recursive method puttag converts back from this structure to XML. In order to "ignore" enclosing tags, I used instance variable/parameter DEPTH, to denote at which depth gettag is to seek and find tags per call (to avoid attempting to build multi-gigabyte hashes).

I am now focussing on making gettag and puttag into perfect mutual inverse functions. I needed a way for gettag to store the enclosing tag separately that could be picked up by puttag on the reverse run. So I decided to give the object an instance variable ENVELOPE to be used by gettag to dump out the enclosing tag and puttag to pick it up to reconvert back into XML. In order to make this functionality as generic as possible, DEPTH can be anything so that ENVELOPE can have any depth of nesting.

BUT, because gettag is recursive, it will have to process inner enclosing tags before outer ones, i.e.:

sub gettag { # ... # ... only an extract ... my $push = !defined( $self -> { DEPTH } ); $push ||= ( ( $self -> { DEPTH } ) <= ( $self -> { THIS }{ DEP +TH } ) ); while ( !$simple # already deemed as nested type && ( $$bufref =~ /^\<(.)/ ) && ($1 ne '/' ) ) { my $subtag = $self -> gettag(); $push and push @subtags, $subtag; Throw( $self ); } my $tref = { $tag => { ASSMNTS => $assignments, SUBTAGS => \@subtags } }; $push or PutEnvelope( $self, $tref ); # i.e. it is 'marked' as + an enclosing tag # ... end of extract ... # ... }
... the PutEnvelope call happens necessarily for the deepest tag first and repeating outwards to the outermost tag. So my first (and untested) draft of PutEnvelope uses $; as a temporary placeholder for the empty outer levels of the hash...
sub PutEnvelope { # store "ignored" outer tag for later use by puttag my $self = shift; my $tref = shift; my $envtag = keys %$tref; my $depth = 1; my $eref = ( ( $self -> { ENVELOPE } ) ||= {} ); for ( ;( $depth < ( $self -> { THIS }{ DEPTH } ) ); $depth++ ) + { my $k = ( keys %$eref ) || $;; $eref = ( $eref -> { $k }{ SUBTAGS }[0] ||= undef() ); } $eref -> { $envtag } ||= ( $eref -> { $; } ||= undef() ); $eref -> { $envtag }{ ASSMNTS } = $tref -> { ASSMNTS }; # enve +lope tags must preserve assignments delete $eref -> { $; }; # replace any $; at same level. }
Before I commit time beyond this draft level, I wondered if anyone had alternative suggestions for how to manage a hash "backwards" that are proven and/or more elegant/reliable/comprehensive/...?

The most general way to pose this problem is probably the most publicly useful one: How can we invert any hash so that it is stored with its deepest keys at the top and its outer keys at the bottom of some structure?

-M

Free your mind