Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation

Nesting below emptiness and/or inverting a hash.

by Moron (Curate)
on Jan 22, 2007 at 15:23 UTC ( #595928=perlquestion: print w/replies, xml ) Need Help??
Moron has asked for the wisdom of the Perl Monks concerning the following question:

The context is efficiently parsing huge XML files or streams that may have single tags enclosing gigabytes of nested tags, which I'll call "enclosing tags". I have a recursive method called gettag that parses an XML tag into a hash structure that necessarily differs from XML::Simple. Another recursive method puttag converts back from this structure to XML. In order to "ignore" enclosing tags, I used instance variable/parameter DEPTH, to denote at which depth gettag is to seek and find tags per call (to avoid attempting to build multi-gigabyte hashes).

I am now focussing on making gettag and puttag into perfect mutual inverse functions. I needed a way for gettag to store the enclosing tag separately that could be picked up by puttag on the reverse run. So I decided to give the object an instance variable ENVELOPE to be used by gettag to dump out the enclosing tag and puttag to pick it up to reconvert back into XML. In order to make this functionality as generic as possible, DEPTH can be anything so that ENVELOPE can have any depth of nesting.

BUT, because gettag is recursive, it will have to process inner enclosing tags before outer ones, i.e.:

sub gettag { # ... # ... only an extract ... my $push = !defined( $self -> { DEPTH } ); $push ||= ( ( $self -> { DEPTH } ) <= ( $self -> { THIS }{ DEP +TH } ) ); while ( !$simple # already deemed as nested type && ( $$bufref =~ /^\<(.)/ ) && ($1 ne '/' ) ) { my $subtag = $self -> gettag(); $push and push @subtags, $subtag; Throw( $self ); } my $tref = { $tag => { ASSMNTS => $assignments, SUBTAGS => \@subtags } }; $push or PutEnvelope( $self, $tref ); # i.e. it is 'marked' as + an enclosing tag # ... end of extract ... # ... }
... the PutEnvelope call happens necessarily for the deepest tag first and repeating outwards to the outermost tag. So my first (and untested) draft of PutEnvelope uses $; as a temporary placeholder for the empty outer levels of the hash...
sub PutEnvelope { # store "ignored" outer tag for later use by puttag my $self = shift; my $tref = shift; my $envtag = keys %$tref; my $depth = 1; my $eref = ( ( $self -> { ENVELOPE } ) ||= {} ); for ( ;( $depth < ( $self -> { THIS }{ DEPTH } ) ); $depth++ ) + { my $k = ( keys %$eref ) || $;; $eref = ( $eref -> { $k }{ SUBTAGS }[0] ||= undef() ); } $eref -> { $envtag } ||= ( $eref -> { $; } ||= undef() ); $eref -> { $envtag }{ ASSMNTS } = $tref -> { ASSMNTS }; # enve +lope tags must preserve assignments delete $eref -> { $; }; # replace any $; at same level. }
Before I commit time beyond this draft level, I wondered if anyone had alternative suggestions for how to manage a hash "backwards" that are proven and/or more elegant/reliable/comprehensive/...?

The most general way to pose this problem is probably the most publicly useful one: How can we invert any hash so that it is stored with its deepest keys at the top and its outer keys at the bottom of some structure?


Free your mind

Replies are listed 'Best First'.
Re: Nesting below emptiness and/or inverting a hash.
by philcrow (Priest) on Jan 22, 2007 at 16:18 UTC
    Excuse me while I completely ignoring your interesting question about hash inversions. But, have you tried XML::Twig, it is usually a good choice for dealing with large XML trees.

    Returning to your question... usually you need a push down stack to handle recursive parsing like you are describing. When you see an opening tag, push it. When you see a closing tag pop it. In between walk up or down the stack's array to see where you are at present.


      A stack may help to invert and reconstruct the hash from recursive context - I will certainly see what I can do with that suggestion, thanks!

      (Update: but I will want to avoid any walking around the stack - "where I am" is easy in the particular case because I have an instance variable for it. As a rule it is best to stick to pushing and popping.)

      re XML::Twig, there were other requirments that needed to be met by various projects for my client, not just the size of the XML files and these were not met by XML::Parser or Expat. It was obvious fairly early on that I needed to be able to code features into the main part of the parser so I had to have control over it and new requirments since then have also deviated from what is currently availabkle in the XML namespace, such as tolerating extra characters between tags that do not conform to "standard" XML whatever that might be.

      XML::Twig suffers from my point of view in that respect from being built on XML::Parser.

      At some point I will have to make a list of the special features and think up a new name for the module. XML::Deviant perhaps ;)


      Free your mind

Re: Nesting below emptiness and/or inverting a hash.
by BrowserUk (Pope) on Jan 22, 2007 at 16:20 UTC
    How can we invert any hash so that it is stored with its deepest keys at the top and its outer keys at the bottom of some structure?i>

    I don't know about other people, butI'm having a hard time trying to visualise what you mean. Is it possible for you to 'draw a picture' or post an example? Or is it a case of needing to know the solution before you could do that?

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Sure. Suppose that you can be given any hash at all, but only one slice at a time. So given the hash:
      { a => { b1=> { c11 => vc11, c12 => vc12 }, b2 => { c21 => vc21, c22 = +> vc22 } } }
      you receive it one slice at a time in reverse order but with the level at which it is stored in the hash:
      { c21 => vc21, c22 => vc22 }, level=3 { c11 => vc11, c21 => vc12 }, level=3 { b2 => ?, b1 => ? }, level=2 { a => ? }, level =1.
      You could simply test for level = 1 and then process that recursively down to level 3 (avoiding processing at level 4 being an issue). But sometimes there are reasons why you can't wait or more likely, it is just too hard or messy to stack up all the context (in my case, this is specifically the other control and literal data within the stack frame at the point the slice is first interpreted inwards) and then have to unravel it later. I might end up having to do that though anyway unless someone has a better suggestion - my initial placeholder idea (just what first popped into my head) is just too inelegant I feel.


      Free your mind

Re: Nesting below emptiness and/or inverting a hash.
by bibliophile (Parson) on Jan 23, 2007 at 15:19 UTC
    I may be (in fact, probably am) missing something significant, but... couldn't you dump the parsed XML data into a database (with some kind of context info (eg - a field indicating a tag's parent)), and then rebuild the data structure afterwards?

      That would be prohibitively slow. The existing version in development can parse a gigabyte in less than a minute. I don't want to slow it down with database calls otherwise I am defeated from the outset.


      Free your mind

        Ah. Yes, that would be the "something significant" that I missed.


Re: Nesting below emptiness and/or inverting a hash.
by Moron (Curate) on Jan 25, 2007 at 15:17 UTC
    Although the original placeholder approach used a bit less code, it seemed nevertheless apt to create future maintainance problems and philcrow was very nearly right with his stack idea. The stack for this situation has to be wound all the way to the top and then exhaustively popped during the recursive reconstruction of the hash. It musn't contain any subtag info or this could cause the whole file to be represented in the envelope - instant self-defeat - so not all the hash slice should be represented in the stack - the stack itself has to imply that part of the structure.

    Meanwhile the gettag function also acquired new functionality to pick up comments and xml version specifiers between normal tags and store them in the hash for the following tag, later to be reproduced if necessary by puttag. Because xml version is always in the envelope (if configured) the envelope stack and build functionality has to support that too:

    sub PutEnvelope { my $self = shift; # obj or hashref my $tref = shift; # tag hashref to be consigned to envelope my $eref = ( $self -> { ESTACK } ||= [] ); # envelope stack my $tag = each %$tref; push @$eref, { $tag => { ASSMNTS => $tref -> { ASSMNTS }, COMMENT => $tref -> { COMMENT }, DEPTH => $self -> { THIS }{ DEPTH } } }; # note SUBTAGS are not stacked up if ( $self -> { THIS }{ DEPTH } == 1 ) { # stack now at top. Unwind it:- MakeEnvelope( $self, $self -> { ENVELOPE } ||= {} ); } } sub MakeEnvelope { # recursively build envelope hash my ( $self, $eref ) = @_; # eref = where to build it $self -> { THIS }{ ENVDEPTH } ||= 0; $self -> { THIS }{ ENVDEPTH }++; my $frame = pop @{ $self -> { ESTACK }}; my $tag = each %$frame; unless( $frame ) { $self -> { THIS }{ EBVDEPTH }--; return; } while( $frame -> { $tag }{ DEPTH } < $self -> { THIS }{ ENVDEPTH } ) { XMLerror( $self, "multiple tags found at same " ."level of envelope (i.e. tags " ."above DEPTH) - ignoring tag $tag", 'WARNING' ); $frame = pop @{ $self -> { ESTACK } }; unless( $frame ) { $self -> { THIS }{ EBVDEPTH }--; return; } } delete ( $frame -> { $tag => { DEPTH } } ); # done its job $eref -> { $tag } = $frame -> { $tag }; if ( @{ $self -> { ESTACK } } ) { MakeEnvelope( $self, $eref -> { $tag => { SUBTAGS }} [0] = {} + ); # make sub-envelope and put back SUBTAGS } $self -> { THIS }{ ENVDEPTH }--; return; }


    Free your mind

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://595928]
Approved by Corion
Front-paged by Old_Gray_Bear
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (11)
As of 2018-04-20 15:39 GMT
Find Nodes?
    Voting Booth?