Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

Recursive capture of a variable number of elements using regexp

by seaver (Pilgrim)
on Apr 09, 2009 at 17:46 UTC ( [id://756667]=perlquestion: print w/replies, xml ) Need Help??

seaver has asked for the wisdom of the Perl Monks concerning the following question:

Dear all,

I'm tackling a problem for which I could manually hard-code the result, but I'm very aware that I can achieve the same results with a regexp string, which would be more useful...

I have a list of chemical formulas, a sample of which is listed below, I simply want to extract, for each formula, the elements within.

This is what I've got for my one_liner:
perl -ne 'chomp;split /\s+/,$_;print $_[1],"\n";while($_[1] =~ /([A-Z +][a-z]?(\d*))/g){print "\t",$1,"\t",$2,"\n";}'
My question is, is this the only way I could get through the variable number of groups? I feel like I could write it into the regular expression itself, that the variable number of groups get directly inserted into an array or a hash, and I can exclude the while loop...is this possible?

Another question would be, if there is only one atom of an element, then there wouldn't be any output for the second group, but can I convert that empty output into a zero string, "inline"?

Thanks
Sam
__DATA__ CH4N2O C9H12N2O6 C5H11NO2 C5H4N4O2 C10H11N4O9P C10H12N4O6 C5H10O5 C5H12O5 C5H10O5 C27H44O C1694H2993O101

Replies are listed 'Best First'.
Re: Recursive capture of a variable number of elements using regexp
by kennethk (Abbot) on Apr 09, 2009 at 17:58 UTC
    When used in list context, a regex will return a list of matching expressions - see perlre. Since you have two terms to match here (element and abundance), you could store the results straight to a hash. Consider the following:

    $_ = 'CH4N2O'; print $_,"\n"; %hash = /([A-Z][a-z]?)(\d*)/g; while ( ($key, $value) = each %hash) { $value ||= 1; print "\t$key\t$value\n"; }

    Note I also changed your grouping a little.

    Update:If your chemical formulas encode structural information (e.g. HOH for water), then keys in a hash will get clobbered. You can, of course, substitute an array for the hash, and compensate appropriately. Thanks jwkrahn for reminding me to include a warning.

      If I remember my chemistry correctly, and I probably don't, but can't the same element appear more than once in a formula, and if so then using a hash would elide some of the elements?

        As long as the formulas are sum formulas, there should be no problem, as each element should be summed up and shouldn't appear again in the same formula. So it should be fine with C2H6O.

        If the formula tries to represent some kind of molecular structure, you may be right: CH3CH2OH

        (Both formulas represent ethanol).

        What linuxer said. Given the sample data, I was assuming the OP was just interested in Hill Order formulas. Technically speaking though, you are correct, and I will admonish appropriately.
      Thanks kennethk!
Re: Recursive capture of a variable number of elements using regexp
by BrowserUk (Patriarch) on Apr 09, 2009 at 18:08 UTC

    Is this the kind of output you're looking for?

    perl -ne"print $1,'-',$2||1,' 'while /([A-Z][a-z]*)(\d+)?/g;print qq[\ +n]" CON CH4N2O C-1 H-4 N-2 O-1 C9H12N2O6 C-9 H-12 N-2 O-6 C5H11NO2 C-5 H-11 N-1 O-2 C5H4N4O2 C-5 H-4 N-4 O-2 C10H11N4O9P C-10 H-11 N-4 O-9 P-1 C10H12N4O6 C-10 H-12 N-4 O-6 C5H10O5 C-5 H-10 O-5 C5H12O5 C-5 H-12 O-5 C5H10O5 C-5 H-10 O-5 C27H44O C-27 H-44 O-1 C1694H2993O101 C-1694 H-2993 O-101 ^Z

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Recursive capture of a variable number of elements using regexp
by jwkrahn (Abbot) on Apr 09, 2009 at 18:11 UTC

    Perhaps this is what you want:

    $ echo "CH4N2O C9H12N2O6 C5H11NO2 C5H4N4O2 C10H11N4O9P C10H12N4O6 C5H10O5 C5H12O5 C5H10O5 C27H44O C1694H2993O101" | perl -lne'print; print "\t$1\t$2" while /([A-Z][a-z] +?(\d*))/g' CH4N2O C H4 4 N2 2 O C9H12N2O6 C9 9 H12 12 N2 2 O6 6 C5H11NO2 C5 5 H11 11 N O2 2 C5H4N4O2 C5 5 H4 4 N4 4 O2 2 C10H11N4O9P C10 10 H11 11 N4 4 O9 9 P C10H12N4O6 C10 10 H12 12 N4 4 O6 6 C5H10O5 C5 5 H10 10 O5 5 C5H12O5 C5 5 H12 12 O5 5 C5H10O5 C5 5 H10 10 O5 5 C27H44O C27 27 H44 44 O C1694H2993O101 C1694 1694 H2993 2993 O101 101
      Or without the  while loop (the file  chem.data holds the data specified in the OP):
      >perl -wMstrict -ne "print qq{$_\t}, join(qq{\t}, /([A-Z][a-z]?(\d*))(?=.*(\n?))/g); " chem.data CH4N2O C H4 4 N2 2 O C9H12N2O6 C9 9 H12 12 N2 2 O6 6 C5H11NO2 C5 5 H11 11 N O2 2 C5H4N4O2 C5 5 H4 4 N4 4 O2 2 C10H11N4O9P C10 10 H11 11 N4 4 O9 9 P C10H12N4O6 C10 10 H12 12 N4 4 O6 6 C5H10O5 C5 5 H10 10 O5 5 C5H12O5 C5 5 H12 12 O5 5 C5H10O5 C5 5 H10 10 O5 5 C27H44O C27 27 H44 44 O C1694H2993O101 C1694 1694 H2993 2993 O101 101
      Note, however, that this approach:
      • produces a superfluous tab immediately before the newline in each 'individual chemical component' output line;
      • produces oddball output for the last record (i.e., the last line) in the data file if it is not newline-terminated.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://756667]
Approved by zwon
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others learning in the Monastery: (3)
As of 2024-04-25 10:08 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found