Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options

RFC on Inline::C hack: Hash_Iterator

by tlm (Prior)
on Aug 01, 2005 at 06:12 UTC ( #479807=perlmeditation: print w/replies, xml ) Need Help??

Dear Perl's-internals-hacking monks,

I've implemented a Hash_Iterator class, as replacement for CORE::each. The reasons for this are 1) to enable multiple independent lexically-scoped iterators on the same hash; 2) to be able to apply to such iterators various functions that take iterators as arguments (one aspect of this being to extend the hash iterator's interface to cover the four iterator functions (start, exhausted, next, and value) described by Dominus in HOP); 3) as a beginner's exercise with Inline::C; and, last but not least, 4) to increase my Perl Crackpot Index.

One could have implemented this class by simply using keys to generate an array of all the hashkeys, and have the iterator object keep track of a position in this array. Such implementation, however, duplicates the storage that already exists for the hash. One of the features that makes iterators in general attractive is precisely that they can take up significantly less memory than the corresponding array. Iterators can be composed into more complex iterative structures, but such schemes can quickly become prohibitive unless the iterators are sufficiently lightweight. Therefore an array-based implementation of hash iterators would not be very powerful.

Be that as it may, it's done and I would very much appreciate your comments (I give the code below).

Given that I have not found this sort of iterator already, the main concern I have is that, despite appearances (i.e. the seemingly functioning module presented below) the problem may be much harder than I realize. Therefore, my top question is: is what I'm trying to do feasible in Perl (without going to heroic lengths)? Or are there fundamental reasons for why anything like this would fail in Perl5?

The basic idea of my implementation is that the iterator object is also a hash, except that its internal bucket array pointer points to the bucket array of the hash that's being iterated over. The underlying C struct has its own independent iteration-oriented fields (e.g. riter and eiter) that are unaffected by those of other iterator instances on the same hash, or even by using CORE::each on this hash.

For my implementation I cannibalized sizeable chunks of hv.c and the pp_each section of pp.c. Unfortunately my understanding of the code that I hacked is less than complete. Therefore, even though my implementation passes the toughest test suite I could come up with, I still feel rather uncertain about the soundness of the C code, and I would particularly welcome your comments and criticisms of it.

I realize that this is a fair bit of code to read. I hope that some of you may be willing to take a look, or perhaps advise me on how to get knowledgeable feedback on a longish piece of code like this.

Many thanks in advance.

SYNOPSIS use Hash_Iterator; my $it = Hash_Iterator->new( \%some_hash ); while ( defined( my $k = $it->each ) ) { # do something with $k } # $it->each restarts after hitting the end while ( my ( $k, $v ) = $it->each ) { # one more iteration } # alternatively, $it->start resets the iterator for ( $it->start; !$it->exhausted; $it->next ) { my $k = $it->value; # do something with $k } # here's another way to traverse the hash $it->start; until ( $it->exhausted ) { my ( $k, $v ) = $it->next; # do something with ( $k, $v ) } DESCRIPTION The API is as follows: new( $hashref ) The constructor, obviously. The argument is mandatory, and must be a hashref. $it->each Should behave exactly like CORE::each. $it->start Resets the iterator. (It also returns the first value. This is as described in HOP.) $it->value Returns the current value, according to the same rules as CORE::each (i.e. both key and value are returned in list context); does not advance the iterator. $it->next Advances the iterator. (It also returns the results of $it->value *before* advancing the iterator, as described in HOP.) $it->nextval An alias for $it->next(). $it->exhausted Returns true iff the iterator is exhausted. NB: The only difference between the next() (or nextval()) and each() methods is that the latter (but not the former) automatically resets the iterator after hitting the end once.

use strict; use warnings; package Hash_Iterator; use Carp (); BEGIN { *_croak = \&Carp::croak; } my %Hashes; *nextval = \&next; !! 'keep require happy'; sub new { my $class = shift; _croak "Bad arguments to constructor" unless @_ == 1 and ref $_[ 0 ] eq 'HASH'; my $hash = shift; my $self = bless +{}, $class; $Hashes{ $self } = $hash; $self->start; return $self; } sub start { my $self = shift; _start_iter( $self, $Hashes{ $self } ); return $self->value; } sub next { my $self = shift; if ( wantarray ) { my @ret = $self->value; $self->_advance; return @ret; } else { my $ret = $self->value; $self->_advance; return $ret; } } sub each { my $self = shift; if ( wantarray ) { my @ret = $self->value; $self->exhausted ? $self->start : $self->_advance; return @ret; } else { my $ret = $self->value; $self->exhausted ? $self->start : $self->_advance; return $ret; } } sub DESTROY { my $self = shift; delete $Hashes{ $self }; _DESTROY_iter( $self ); return; } sub _advance { my $self = shift; _synch_and_advance( $self, $Hashes{ $self } ) unless $self->exhausted; return; } use Inline C => <<EOC; void _start_iter( HV *hv, HV *ohv ) { register XPVHV* xhv; if ( SvMAGICAL( ( SV * ) ohv ) ) Perl_croak( "Not implemented for tied/magical hashes." ); xhv = ( XPVHV* ) SvANY ( hv ); xhv->xhv_riter = -1; xhv->xhv_eiter = NULL; _synch_and_advance( hv, ohv ); return; } I32 exhausted( HV *hv ) { register XPVHV* xhv; if ( !hv ) Perl_croak( "Bad hash" ); xhv = ( XPVHV* ) SvANY ( hv ); return xhv->xhv_riter < 0; } void _synch_and_advance( HV *hv, HV *ohv ) { register XPVHV* xhv; register XPVHV* oxhv; if ( !hv || !ohv ) Perl_croak( "Bad hash" ); xhv = ( XPVHV* ) SvANY ( hv ); oxhv = ( XPVHV* ) SvANY ( ohv ); __synch( xhv, oxhv ); __advance( xhv ); return; } void __advance( register XPVHV* xhv ) { register HE *entry; if ( !xhv->xhv_array ) { xhv->xhv_riter = -1; xhv->xhv_eiter = NULL; return; } entry = xhv->xhv_eiter; /* At start of hash, entry is NULL. */ if ( entry ) { entry = HeNEXT( entry ); while ( entry && HeVAL( entry ) == &PL_sv_placeholder ) entry = HeNEXT( entry ); } while ( !entry ) { xhv->xhv_riter++; if ( xhv->xhv_riter > ( I32 ) xhv->xhv_max ) { xhv->xhv_riter = -1; break; } entry = ( ( HE** ) xhv->xhv_array )[ xhv->xhv_riter ]; while ( entry && HeVAL( entry ) == &PL_sv_placeholder ) entry = HeNEXT( entry ); } xhv->xhv_eiter = entry; return; } void value( HV *hv ) { Inline_Stack_Vars; register XPVHV* xhv; HE *entry; I32 gimme = GIMME_V; I32 n_items = 1; xhv = ( XPVHV* ) SvANY( hv ); entry = xhv->xhv_eiter; if ( gimme == G_VOID || !entry && gimme == G_ARRAY ) Inline_Stack_Void; /* returns (nothing) */ Inline_Stack_Reset; if ( entry ) { Inline_Stack_Push( hv_iterkeysv( entry ) ); if(gimme == G_ARRAY) { Inline_Stack_Push( hv_iterval( hv, entry ) ); ++n_items; } } else { /* scalar context, no entry */ Inline_Stack_Push( sv_2mortal( &PL_sv_undef ) ); } Inline_Stack_Done; Inline_Stack_Return(n_items); } void _DESTROY_iter( HV *hv ) { register XPVHV* xhv; if ( !hv ) return; xhv = ( XPVHV* ) SvANY ( hv ); if ( xhv->xhv_name ) { if ( PL_stashcache ) hv_delete( PL_stashcache, xhv->xhv_name, strlen( xhv->xhv_name ) +, G_DISCARD ); Safefree( xhv->xhv_name ); xhv->xhv_name = 0; } xhv->xhv_max = 7; /* (it's a normal hash) */ xhv->xhv_array = 0; xhv->xhv_placeholders = 0; } void __synch( register XPVHV* xhv, register XPVHV* oxhv ) { xhv->xhv_max = oxhv->xhv_max; xhv->xhv_fill = oxhv->xhv_fill; xhv->xhv_keys = oxhv->xhv_keys; xhv->xhv_array = ( HE ** ) oxhv->xhv_array; } EOC __END__

Update: Minor, mostly cosmetic, edits to the code.

the lowliest monk

Replies are listed 'Best First'.
Re: RFC on Inline::C hack: Hash_Iterator
by esskar (Deacon) on Aug 01, 2005 at 12:27 UTC
    nice job;
    is it possible to add a "prev" method?
    and where can I find the documentation to HeNEXT ?

      ...and where can I find the documentation to HeNEXT ?

      HeNEXT is a macro (surprise!), defined in hv.h as

      #define HeNEXT(he) (he)->hent_next
      where hent_next is a field in the following struct (also from hv.h):
      typedef struct he HE; /* ... */ /* entry in hash value chain */ struct he { HE *hent_next; /* next entry in chain */ HEK *hent_hek; /* hash key */ SV *hent_val; /* scalar value that was hashed */ };
      So basically HeNEXT scoots along a linked list of hash entries (those belonging to a given bucket, to be precise). As you can see, the data structure only provides for travel in one direction. Therefore, I don't see an easy way to implement a prev method. If I had to have it, I think I'd just accept the memory hit and use the keys array-based implementation I described briefly in my OP. Then prev reduces to decrementing a cursor variable.

      the lowliest monk

        So basically HeNEXT scoots along a linked list of hash entries (those belonging to a given bucket, to be precise).

        Well.. you could always make use of a doubly-linked list to get around this. For the sake of the storage of a second pointer in your structure, this would seem a reasonable trade-off to give a more flexible iteration method.

        Just on a scalability note, and to raise awareness of a possible stumbling block if this code were to be used as anything other than an iterator, I'm always wary of a hash bucket containing a linked list. In this example, there's little to no issue - it's designed to be an iterator, and hence the list will always be traversed from start to end.

        For (most) implementations, however, where more random access of hashed elements is required, an iterative method is really quite inefficient, requiring up to time const + N to look up any entry. In these cases, it's often better to use something akin to a binary tree, offering at worst time const + log N to find any entry.

        For true flexibility, a linked list threaded through binary tree has been my structure of choice for awhile - offering a sane method of iterating through the structure while retaining a reasonable random-element-access time.

        Nice work - ++tlm.

Re: RFC on Inline::C hack: Hash_Iterator
by spectraphonic (Scribe) on Aug 01, 2005 at 11:57 UTC
    nifty stuff.... *bows* (in a monklike fashion)

    i like the iterator concept for processing data structures. it really simplifies the code and improves its readability. independent iterators to the same structure seems handy too. kinda reminds me of C++ STL

    i aspire to produce ninja code like this

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlmeditation [id://479807]
Approved by Corion
Front-paged by Arunbear
[choroba]: Good morning from home!
[choroba]: (took a sickday to stay with my son)

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (5)
As of 2017-04-27 07:49 GMT
Find Nodes?
    Voting Booth?
    I'm a fool:

    Results (501 votes). Check out past polls.