Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked

Regexps for Parsing Brackets in Chemical Formulae

by Elgon (Curate)
on Nov 03, 2001 at 20:00 UTC ( #123045=perlquestion: print w/replies, xml ) Need Help??
Elgon has asked for the wisdom of the Perl Monks concerning the following question:

Hi Folks,

I've got a dandy little regexp-related problem for you all: I am writing a little module which takes a molecular formula and converts it into a hash where the keys are a unique list of elemental constituents and the values are the number of atoms present in the molecule. Sounds easy - believe me, if you're as naff as I am at regexps it ain't!

So we have our formula in $formula... First we want to get rid of bracket pairs without coefficients next to them, so I though something like this...

1 while $formula =~ s/(\()(\[A-Za-z0-9()]+)(\)\D)/$2/e;

but this can be wrong as in some cases the maximal matching will chop out brackets which don't match...Help!

Then we want to swap out brackets which are followed by a two or more (if they're followed by 1 as a coefficient - and they shouldn't really be - then they effectively don't have a coefficient and should just have the brackets removed...) In this case we should multiply the what's inside the brackets when we multiply them out (which the following may or (more likely) may not do!)

1 while $formula =~ s/(\()(\[A-Za-z0-9()]+)(\)\)([0-9]+)/$4x$2/e;

Once these two tasks I have got a way of doing the rest but I cannot work out the correct regexps to do the above tasks - I just don't have the knowlege, the experience or a copy of "Mastering Regular Expressions"!

Just to clarify, if we have the following formula... Mo(PH3)4(CO)(NH2C2H5) for example, it should become... Mo(PH3)4CONH2C2H5 after the first regexp and then MoPH3PH3PH3PH3CONH2C2H5 at the end, which I can parse nicely myself. Note that if you have a series of brackets... (...(...)...(...)...) they need to be processed in the correct order, which really has me scratching my head I can tell you.

I will bow in deep respect to anyone who can give me a hand on this one as it has got me a bit stumped. (For the record it is not for an assessed piece of work - I am a chemist after all - but a mixture of general interest and boredom.) Virtual beer to you!

"Without evil there can be no good, so it must be good to be evil sometimes.
--Satan, South Park: Bigger, Longer, Uncut.

Replies are listed 'Best First'.
Re: Regexps for Parsing Brackets in Chemical Formulae
by chipmunk (Parson) on Nov 03, 2001 at 20:38 UTC
    I think you've got the right idea using 1 while s///, because you're matching from inside out rather than left to right. Here's one way to do the whole substitution all at once: 1 while s/\(([^\(\)]+)\)((?:\d+)?)/ $1 x ($2 || 1) /ge;
    This matches a parenthesized substring that does not itself contain any parenthesizes, and optionally a subsequent number, and replaces it with the substring, minus the parentheses, repeated the appropriate number of times.

      Many thanks to Chipmunk and other folks,

      I'll go away and play with these suggestions, which seem quite groovy (insofar as I can tell which ain't that far!) The reason for all of this is sort of related to my final-year project but not actually included in it (the project is in PHP): My tutor wrote a routine to do this kind of thing, which took him ages in some other language and I'm trying to introduce him to the power of Perl (and by extension, Perlmonks.)

      In the virtual bar of pm I owe you all a pint.


      "Without evil there can be no good, so it must be good to be evil sometimes.
      --Satan, South Park: Bigger, Longer, Uncut.

        You were close. That should do it:
        use strict; my %count; # added gratuitous parentheses for embedded formula testing sake. $_='Mo(P(H)3)4(CO)(NH2C2(H)5)'; # at each iteration do subformula with rigtmost left parenthesis. # quit when no more parenthesis s/(.*)\((.*?)\)(\d*)/$1 . $2 x ($3 ? $3 : 1) /e while m/\(/; s/([A-Z](?:[a-z])?)(\d*)/ $count{$1} += $2 ? $2 : 1 ;''/eg; printf "%-2s %3d\n", $_, $count{$_} for sort keys %count;
        It prints:
        C 3 H 19 Mo 1 N 1 O 1 P 4

        -- stefp

      Chipmunk, nice solution, but you don't need to escape the ()'s in the brackets. They are treated as literals inside brackets.

      -monkfish (The Fishy Monk)

Re: Regexps for Parsing Brackets in Chemical Formulae
by monkfish (Pilgrim) on Nov 03, 2001 at 20:36 UTC
    I wish you had provided some more sample data for me to play with, because I am not very farmiliar with chemical formulas but the following should do the trick:

    1 while $formula =~ s/\(([^()]+)\)(\D|$)/$1$2/g; 1 while $formula =~ s/\(([^()]+)\)(\d)/$1x$2/e;
    In the first line we are replacing any parens and their contents (which may not include parens) if followed by a non number with just the contents $1 and the non number $2. The |$ is to get a paren as the last character of a line.

    In line 2 we find parens followed by a number and multiply them out.

    -monkfish (The Fishy Monk)

Re: Regexps for Parsing Brackets in Chemical Formulae
by Masem (Monsignor) on Nov 03, 2001 at 20:34 UTC
    A partial suggestion:

    Do a while loop on $formula =~ s/\(([A-Za-z0-9]*)\)/Q$i/. This will only get any inner compositions (no addition parens). You'll replace these with Q1, Q2, etc. (or if you're worried about more, you can use Qa, Qb, or QA, QB, etc. since I suspect you're considering chemical symbols with no more than 2 letters). $1 will capture the inner composition, which you should work out and associate in a hash with the Q variable. Note that you remove those inner parans when you do this.

    The next time around, if there are still more parans, you'll capture those; Now you can consider the Q series and do any necessarily multiplication from those as well.

    Once you exit this while loop, you'll have no more parans, so you can calculate the final composition with no problems.

    Dr. Michael K. Neylon - || "You've left the lens cap of your mind on again, Pinky" - The Brain
    "I can see my house from here!"
    It's not what you know, but knowing how to find it if you don't know that's important

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://123045]
Approved by root
NodeReaper writes the same droll phrase again

How do I use this? | Other CB clients
Other Users?
Others having an uproarious good time at the Monastery: (5)
As of 2018-05-25 07:49 GMT
Find Nodes?
    Voting Booth?