Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??
I'm not very comfortable with recursive patterns, they're new in Perl 5.10 and I doubt that PHP/PCRE support them... In that case, I'd take a 2-step approach, very much like the traditional lex/yacc approach, but simplified:
  1. tokenize
  2. parse (balance braces)

1. Tokenize

Using regular expressions, you can pull out the tokens: quoted strings, words, parens/braces/brackets, other symbols. That way you will not accidently mistake braces in quoted strings for syntactically meaningful braces. Your regex engine needs to be capable of continue matching where you left off last time, in Perl you use //g in scalar context, in Javascript you can use //g.exec(string). Likely PCRE supports something like it in PHP, but I don't actually know.

The regex can look something like this (from the top of my head, not thoroughly tested):

/\d[\w.]*|[\w\$]+|'(?:\\?.)*'|"(?:\\?.)*"|\/\*(?s:.*?)\*\/|\/\/.*|\/(? +:\\?.)*\/[a-z]*|\+\+|\-\-|[\n\S]/g
Note that I skip whitespace except newlines, which are meaningful in Javascript, as they can terminate the current stamement. Maybe (likely) you just don't care.

Here's some (Perl) code to test it with — load the Javascript into $_ first:

while(/(\d[\w.]*|[\w\$]+|'(?:\\?.)*'|"(?:\\?.)*"|\/\*(?s:.*?)\*\/|\/\/ +.*|\/(?:\\?.)*\/[a-z]*|\+\+|\-\-|[\n\S])/g) { unless($1 eq "\n") { print "Token: $1\n"; } else { print "Newline\n"; } }
I only display newlines differently because a bare newline as a token doesn't print so clearly.

2. Parsing – balancing braces

As you got through the tokens you extract one by one, you keep track of the nesting level: increment it if you encounter a bare "{", decrement it for a bare "}". As soon as it is decremented back to the same level as you started on for this function (usually 0, but it could be higher for nested functions), you found its end.

Here's the same code again, extended to keep track of the nesting level. As I assume the Javascript is syntactically valid, I just keep a common $level for every type of bracket, it's just simpler this way.

my $level = 0; while(/(\d[\w.]*|[\w\$]+|'(?:\\?.)*'|"(?:\\?.)*"|\/\*(?s:.*?)\*\/|\/\/ +.*|\/(?:\\?.)*\/[a-z]*|\+\+|\-\-|[\n\S])/g) { if($1 eq "\n") { print "Newline\n"; } elsif(grep $1 eq $_, '(', '{', '[') { print "Token: $1 level $level\n"; $level++; } elsif(grep $1 eq $_, ')', '}', ']') { $level--; print "Token: $1 level $level\n"; print "Found the end of a top level block\n" if $level==0; } else { print "Token: $1\n"; } }

This should suffice to get you started.

update

  1. I changed the way multiline comments (/* ... */) are handled: now the whole comment is one token. I don't know if Javascript supports nested comments, but as it is, my code doesn't support them: It just searches for the next "*/". I found nothing on the internet about them, so I suppose, if allowed, that they are very rare.
  2. I had forgotten about regexes. Added, handled the same like quoted strings (with "/" as delimiter, backslash escapes anything) but with a possible suffix of lower case letters for the modifiers.

In reply to Re: Recursive Regular Expression Help by bart
in thread Recursive Regular Expression Help by me

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others meditating upon the Monastery: (4)
As of 2024-04-19 22:58 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found