Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Parsing C Functions..

by jmmistrot (Sexton)
on Apr 27, 2008 at 19:43 UTC ( #683184=perlquestion: print w/ replies, xml ) Need Help??
jmmistrot has asked for the wisdom of the Perl Monks concerning the following question:

So I have been slowly building up some methods to break apart pre-processed headers. I just recently began to try and take an inventory of the various functions actually being used in our code base and am stumped. SO here I am again back at the monastery seeking enlightenment. Here is the test function I am trying to parse:
float myFunction(float myfunc_arg0, float myfunc_arg1 ) { if(myfunc_arg0>1){ print "arg0 action\n"; } elsif(myfunc_arg1){ print "arg1 action\n"; } else{ print "take no action\n"; } }
So I think I have the regex to parse the function signature:  /\s*((?:\S+\s+){0,3})(?:(\S+)\(\s*((?:\S+\s+\S+)*)\s*\))\s*/ I want to remove each complete function from the file as I go so that I can come back for a second pass to grab all globally scoped variables in the C code. My problem is advancing the file keeping track of nesting to know when I have left the function. As you may be aware there is no rule the requires the C code function definition or its scoping delimiters "{" and "}" to be on one line. I consider myself an intermediate regex acolyte and am sure that there is an easy way to do this but I have yet to stumble upon it. It seems to me I am unable to use a replace to slurp up the function cause I don't know the depth of the delimiter nesting. I cant come up with an expression to handle arbitrary cases in a replace anyway... I have tried removing all newlines, running through the file that way.. I can successfully grab the function signature but have yet to figure out a way to know when I have passed by the end of the function. Any "perls" of wisdom greatly appreciated.. Your humble student, -jmm

Comment on Parsing C Functions..
Select or Download Code
Re: Parsing C Functions..
by chromatic (Archbishop) on Apr 27, 2008 at 19:55 UTC

    The best advice anyone can give you is to use a parser, not regular expressions. While it's probably possible to build a state machine with Perl 5 regular expressions which handles all of the possible tokens and states and state transitions appropriately, no one I know capable of writing such a thing has actually done it.

    You might be able to get by with regular expressions and Text::Balanced -- we use that in Parrot to build PMCs out of our PMC mini-language which includes C code -- but for anything more complex, you'll have a much easier time using a parser.

Re: Parsing C Functions..
by Your Mother (Canon) on Apr 27, 2008 at 19:56 UTC

    Just about anything is possible with regexes if you go far enough but take one look at the back pages of Jeffrey Friedl's "Mastering Regular Expressions" to see the horrifying 2 page regex that results from parsing plain old email addresses correctly.

    You might be able to do what you want the way you're trying but I would suggest instead giving Parse::RecDescent a shot. Bit of a learning curve but the resulting code will be much easier to read, tweak, and reuse.

    (update: switched ISBN to 2nd ed.)

        It took me a while before I realized that csourceparser.pl was actually stored in demo/demo_another_Cgrammar.pl of Parse::RecDescent.
Re: Parsing C Functions..
by pc88mxer (Vicar) on Apr 27, 2008 at 20:02 UTC
    How about just matching open/close brace pairs that are at column 0? Will that work for you? Something like:  m/^\{.*?^\}/ms

    Another option is to fully parse the C code. There's a good start at this described here: Converting C to English with Perl

    Update: Here's another grammar which might help: Parsing C

Re: Parsing C Functions..
by jmmistrot (Sexton) on Apr 27, 2008 at 20:17 UTC
    Sighhh ok :) I had seen the references mentioned but for Text::Balanced. I was hoping to get this done a bit more quickly/ easily... unfortunately the 0 Column suggestion only works in one case and there is no guarantee that the delimiters will exist there. So off to C Parser land after a visit to the modules mentioned. Thanks for the advice!
Re: Parsing C Functions..
by pc88mxer (Vicar) on Apr 27, 2008 at 20:39 UTC
    What's your end goal? If you just want to know what global variables are defined in a file, perhaps you can compile it and use objdump from GNU binutils to tell you that info.
Re: Parsing C Functions..
by syphilis (Canon) on Apr 27, 2008 at 23:41 UTC
    You might find something useful in the code in Inline's ParseRegExp.pm (and ParseRecDescent.pm).

    Cheers,
    Rob
Re: Parsing C Functions.. (tokens)
by tye (Cardinal) on Apr 28, 2008 at 04:55 UTC

    This isn't particularly hard if you pick the right approach. I think the easiest way it to tokenize the language you are parsing and then build a simple state machine to parse just enough of the tokens to get the information you want.

    Since type declarations can contain parens (though usually don't) I settled on considering ") {" as the start of a function body. I think that can't happen elsewhere in valid C code. I didn't check my POD syntax and the code is very imcomplete, intentionally.

    =for state_machine START \n -> {flush} START ws -> START else -> DECL DECL ; -> {flush} START ) -> CLOSE CLOSE { -> {block=1} FUNC else -> DECL FUNC { -> {++block} FUNC } -> { --block ? FUNC : END } END ws -> END \n -> {save} START else -> {save} DECL =end use enum qw( START DECL CLOSE FUNC END ); my $state= START(); my $part= ''; while( $code =~ m{ \G( \n # newline | [^\S\n]+ # horizontal whitespace | [a-z]\w* # identifier | \d+ # digit string | "(?:[^\\"]+|\\[^\n])*" | '(?:[^\\']+|\\[^\n])*' | /\*.*?\*/ # C comment | //[^\n]*\n # C++ comment | [^\w\s] # punctuation mark ) }xsg ) { my $token= $1; if( START() == $state ) { if( $token eq "\n" ) { $token= $part= ''; } elsif( $token =~ /\S/ ) { $state= DECL(); } ... } elsif( END() == $state ) { if( $token =~ /\S/ ) { push @func, $part; $part= ''; ... } } $part .= $token; }

    And it is pretty easy to extend the state machine to track nesting of parens so you can extract the identifier before the last outer open paren before the body of the function is detected, for example (which would be the name of the function).

    - tye        

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://683184]
Approved by ww
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (6)
As of 2014-12-19 01:54 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (70 votes), past polls