Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

recursive parsing techniques

by blahblah (Friar)
on Dec 29, 2003 at 07:48 UTC ( [id://317376]=perlquestion: print w/replies, xml ) Need Help??

blahblah has asked for the wisdom of the Perl Monks concerning the following question:

I searched around but could not find any clues.
I am messing around with the vCard RFC and have stumped myself trying to come up with a recursive parser that allows for (nearly) infinite depth. This is required by the RFC: some of the properties of a vCard allow another vCard as their value -- wha wha :/.
In simplicity the structure is like this (not a correct vCard by a long shot but illustrates the problem):

BEGIN:VCARD FN:Fee AGENT: BEGIN:VCARD FN:Fie AGENT: BEGIN:VCARD FN:Foe AGENT: BEGIN:VCARD FN:Fum END:VCARD END:VCARD END:VCARD END:VCARD

and what I am looking for is each AGENT to be a hash reference to the card inside of it such that I can refer to Fum like:

$card{AGENT}{AGENT}{AGENT}{FN};

I don't need the exact solution as I am willing to learn it (although I won't stop you) -- I just need some pointers to recursive parsing techniques.

Thanks!

Replies are listed 'Best First'.
Re: recursive parsing techniques
by castaway (Parson) on Dec 29, 2003 at 08:10 UTC
    Recursion means writing a sub that calls itself.. Write a parser that parses one vCard, and if it encounters a vCard inside the vCard, calls itself with the start of the new card? Passing out the result as a hashref, which can be added to the hash in the calling function..

    Eg:

    sub parsevcard { my ($inputstr, $inputpos) = @_; my $currentcard = {}; # Parse current vCard starting at $inputpos position in $inputstr s +tring.. # enter values in $currentcard hashref if($inputstr =~ /\GBEGIN:VCARD/) { # We found another vcard, set $pos to beginnnig of new card, an +d call ourselves again.. $currentcard->{AGENT} = parsevcard($inputstr, $pos); } return $currentcard; }
    Untested!

    C.

      In addition to the excellent comments by castaway above, I would only add that it is additionally advantageous to include an escape clause within the recursive subroutine in order to ensure that the parsing routine does pass beyond a specified recursion depth - This is a relatively simple safeguard to implement from a defensive programming standpoint.

      For example, a one-liner which (eventually) exhausts the memory available to the Perl process:

      rob@development:/home/rob/workspace$ perl -e 'sub a { &a }; a' Out of memory!

      This can be prevented by the incorporation of a maximum depth for subroutine recursion such as follows:

      sub parsevcard { my ($inputstr, $inputpos, $depth) = @_; if ($depth > $max_depth) { # Recursion beyond allowable depth error handling here } . . $currentcard->{AGENT} = parsevcard($inputstr, $pos, ++$depth); }

      Depending upon the recursive problem at hand, this error trapping can be performed by way of examination of the output data structure alone, for example, the number of elements within an output array, rather than by a separate counter per se.

       

      perl -le "print+unpack'N',pack'B32','00000000000000000000001010011110'"

        I originally thought this was pointless, since you can only recursively store data as far as it's actually done. But it all depends. ala...
        node1 subnode1 subsubnode1 ...
        But if your language can refer to itself...
        nodeA nodeB nodeC referenceToNodeA
        If the data's relations can be put into a graph that is not in the form of a tree, some recursive checks like the above can be invaluable. If it's in the form of a tree, w/ no cycles, things should be pretty safe so long as the data can be loaded into memory :)

        Play that funky music white boy..
      Thank you all for your responses. The answer seems so obvious when I read them. 8 hours sleep doesn't hurt either :)

      The $pos thing is throwing me a little bit. If I wanted to parse the vCard file in a while loop instead of slurping the whole thing into memory, how can I obtain the position of the file during the while iteration - and how could I advance to the next line of the file without breaking out of the recurse or the while?

      Outside of "while(<FILEHANDLE>)" contructs I think I need a little more practice manipulating files on disk without loading everything into memory.../me goes off to read more...

      Thanks again!
        I was thinking to read in the entire file, concatenate into one big string, and work on that.. If you want to do it line by line, and each item is on a line by itself (as it appears to be from your example), then you wont need the position bit, just call the function with the filehandle and the current line instead. Thus you get:
        sub parsevcard { my ($fh, $line) = @_; my $currentvcard = {}; # check that $line is really the beginning of a vcard, else die # loop reading lines from <$fh> and putting them in $currentvcard, + call $currentvcard->{AGENT} = parsevcard($fh, $line) if a BEGIN:VCAR +D is encountered return $currentvcard; # check for end vcard? }
        .. Actual contents left as an exercise for the reader..

        C.

Re: recursive parsing techniques
by melora (Scribe) on Dec 29, 2003 at 15:34 UTC
    Naturally, being the old fogey that I am, I'd probably write my own recursive parser for this sort of thing. I know, I know, is it false laziness or false impatience? Well, anyhow, I can make it work. Writing my own in the past, I've noticed a couple of things: one, you need to have a clear mental picture of when the recursive call needs to happen; and two, you need to make quite sure that the routine called recursively will exit! Both of those have bitten me over the years. Let me give you a trivial, useless example that doesn't relate to what you're doing:
    a * (b - d + (e / f) - 7) + 3
    This is trivial for recursive parsing, I know, but it makes the point: you've got to call the routine when you come to an '(', and you've got to exit when you come to an ')'. Not doing either of those will most likely result in hangs and/or rubbish. I know, I've tried. In the case you're showing, looks to me like you want to recurse (call the routine you're in) on "BEGIN:VCARD" and return on "END:VCARD". What that ends up meaning is that you write the routine to parse one VCARD, and to add that one to the hash. Oh, yeah, and you may want to verify that your data is healthy, so there are as many ENDs as BEGINs. Remember, "To iterate is human, to recurse divine."
Re: recursive parsing techniques
by extremely (Priest) on Dec 29, 2003 at 20:20 UTC
    You should dig into the lore surrounding Parse::RecDescent and try that on for size. Once you get over the initial hump of thinking sideways at parse problems you'll find it a valuable tool, I promise.

    --
    $you = new YOU;
    honk() if $you->love(perl)

Re: recursive parsing techniques
by Roger (Parson) on Dec 30, 2003 at 02:19 UTC
    TIMTOWTDI, you don't really have to use recursion here, I have written a non-recursive demo that does what you described...
    use strict; use warnings; use Data::Dumper; my $card = load_vcard(); print Dumper($card); sub load_vcard { my $n = 0; my @var = (); foreach (<DATA>) { chomp; $n++, next if /^BEGIN:VCARD/; $n-- ? last : next if /^END:VCARD/; next if /^AGENT/; push @var, /FN:(\w+)/; } my %card; eval '$card' . '{AGENT}' x $_ . '{FN} = "' . $var[$_] . '"' for (0..$#var); return \%card; } __DATA__ BEGIN:VCARD FN:Fee AGENT: BEGIN:VCARD FN:Fie AGENT: BEGIN:VCARD FN:Foe AGENT: BEGIN:VCARD FN:Fum END:VCARD END:VCARD END:VCARD END:VCARD
    And the output is -
    $VAR1 = { 'AGENT' => { 'AGENT' => { 'AGENT' => { 'FN' => 'Fum' }, 'FN' => 'Foe' }, 'FN' => 'Fie' }, 'FN' => 'Fee' };
Re: recursive parsing techniques
by clscott (Friar) on Dec 30, 2003 at 21:34 UTC

    Are sure that Net::vCard or Text::vCard won't solve your needs?

    --
    Clayton

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://317376]
Approved by castaway
Front-paged by broquaint
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (3)
As of 2024-09-08 16:55 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?
    erzuuli‥ 🛈The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.