Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

how to find what's not there with a regex?

by samizdat (Vicar)
on Aug 24, 2005 at 13:23 UTC ( #486173=perlquestion: print w/replies, xml ) Need Help??
samizdat has asked for the wisdom of the Perl Monks concerning the following question:

Hi, all - Esteemed codegrokkers, I need to create a regex which will parse either

      xyz = 'some     long   exp'

or

      xyz = some long exp with no more than one space separating parts

from a long string containing multiple examples of this.

TIA for your assistance! :D UPDATE: example as requested:

drsubc  = agauss(0, 1, 3)   delm1   = '0 + 0.045u*distm1'               delm2   = '0 + 0.07u*distm2'        delm3   = '0 + 0.07u*distm3'                delm4   = '0 + 0.07u*distm4'                delmt   = '0 + 0.07u*distmt'                delml   = '0.16u + 0.43u*distml'            delam   = '0.32u + 0.86u*distam'            dele1   = '0 + 0.25u*diste1'                dele2   = '0 + 0.25u*diste2'                delma   = '0.16u + 0.6u*distma'             pmsxt   = 'npmsxt + 12.5u*dpmsxt'           tih     = 0.35u                capct   = '0.50u + 0.13u*xdcapct'           capcti  = '0.55u + 0.13u*xdcapct'           m1t     = '0.41u + 0.05u*xdm1t'    m1ti    = '0.36u + 0.05u*xdm1t'             m2t     = '0.48u + 0.057u*dm2t'             m3t     = '0.48u + 0.057u*dm3t'             m4t     = '0.48u + 0.057u*dm4t'             mtt     = '0.48u + 0.057u*dmtt'             qtt     = '0.242u + 0.0202u*dqtt'           htt     = '0.242u + 0.0202u*dhtt'           mlt     = '2.0u + 0.2u*dmlt'                amt     = '4.0u + 0.4u*damt'                e1t     = '3.0u + 0.5u*de1t'            e2t     = '4.0u + 0.5u*xde1mat'             mat     = '4.0u + 0.4u*dmat'                m1m2t   = '0.35u + 0.05u*dm1m2t'

The goal is to separate out all the parameters (or function definitions) and their value expressions.

Replies are listed 'Best First'.
Re: how to find what's not there with a regex?
by pbeckingham (Parson) on Aug 24, 2005 at 13:42 UTC

    How about this:

    #! /usr/bin/perl use strict; use warnings; while (<DATA>) { chomp; print "[$_]\n" for /\s*([^=]+\s+=\s+'[^']+'|\S+\s+=(?:\s+[^=]+)+(?:( +?=\s+\S+\s+=)|$))/msg; } __DATA__ xyz = 'some long exp' xyz = some long exp with no more than one +space separating parts a = b c d = e drsubc = agauss(0, 1, 3) delm1 = '0 + 0.045u*distm1' + delm2 = '0 + 0.07u*distm2' delm3 = '0 + 0.07u*distm3' + delm4 = '0 + 0.07u*distm4' delmt = '0 + + 0.07u*distmt' delml = '0.16u + 0.43u*distml' + delam = '0.32u + 0.86u*distam' dele1 = '0 + 0.2 +5u*diste1' dele2 = '0 + 0.25u*diste2' + delma = '0.16u + 0.6u*distma' pmsxt = 'npmsxt + 12.5 +u*dpmsxt' tih = 0.35u capct = '0.50u + + 0.13u*xdcapct' capcti = '0.55u + 0.13u*xdcapct' + m1t = '0.41u + 0.05u*xdm1t' m1ti = '0.36u + 0.05u*xdm1t' + m2t = '0.48u + 0.057u*dm2t' m3t = '0.4 +8u + 0.057u*dm3t' m4t = '0.48u + 0.057u*dm4t' + mtt = '0.48u + 0.057u*dmtt' qtt = '0.242u + +0.0202u*dqtt' htt = '0.242u + 0.0202u*dhtt' m +lt = '2.0u + 0.2u*dmlt' amt = '4.0u + 0.4u*dam +t' e1t = '3.0u + 0.5u*de1t' e2t = ' +4.0u + 0.5u*xde1mat' mat = '4.0u + 0.4u*dmat' + m1m2t = '0.35u + 0.05u*dm1m2t'
    Generates the output:
    [xyz = 'some long exp'] [xyz = some long exp with no more than one space separating parts] [a = b c] [d = e] [drsubc = agauss(0, 1, 3) ] [delm1 = '0 + 0.045u*distm1'] [delm2 = '0 + 0.07u*distm2'] [delm3 = '0 + 0.07u*distm3'] [delm4 = '0 + 0.07u*distm4'] [delmt = '0 + 0.07u*distmt'] [delml = '0.16u + 0.43u*distml'] [delam = '0.32u + 0.86u*distam'] [dele1 = '0 + 0.25u*diste1'] [dele2 = '0 + 0.25u*diste2'] [delma = '0.16u + 0.6u*distma'] [pmsxt = 'npmsxt + 12.5u*dpmsxt'] [tih = 0.35u ] [capct = '0.50u + 0.13u*xdcapct'] [capcti = '0.55u + 0.13u*xdcapct'] [m1t = '0.41u + 0.05u*xdm1t'] [m1ti = '0.36u + 0.05u*xdm1t'] [m2t = '0.48u + 0.057u*dm2t'] [m3t = '0.48u + 0.057u*dm3t'] [m4t = '0.48u + 0.057u*dm4t'] [mtt = '0.48u + 0.057u*dmtt'] [qtt = '0.242u + 0.0202u*dqtt'] [htt = '0.242u + 0.0202u*dhtt'] [mlt = '2.0u + 0.2u*dmlt'] [amt = '4.0u + 0.4u*damt'] [e1t = '3.0u + 0.5u*de1t'] [e2t = '4.0u + 0.5u*xde1mat'] [mat = '4.0u + 0.4u*dmat'] [m1m2t = '0.35u + 0.05u*dm1m2t']



    pbeckingham - typist, perishable vertebrate.
      Almost. I think you're on the right track, because your solution's caught all but drsubc and tih correctly. Let me study what you've done, and thanks very much!

        Fixed the drsubc and tih.



        pbeckingham - typist, perishable vertebrate.
Re: how to find what's not there with a regex?
by Eimi Metamorphoumai (Deacon) on Aug 24, 2005 at 13:29 UTC
    Please read How (Not) To Ask A Question. In particular, could you please post some sample data, along with exactly what parts you're trying to extract, and what the criteria are? Reread your question from the point of view of someone who doesn't already know what you want, and I think you'll see that you're leaving out pretty much everything we need to know.

    Update: Now there's some data, but still no real specification of how your parameters are separated or what's really going on here. It looks like this may do what you want, but if not you'll have to step back for a moment and think about what you're doing.

    #!/usr/bin/perl -l use strict; use warnings; use Data::Dumper; my %variables; undef $/; $_ = <DATA>; while (s/\s*(\w+)\s*=\s*([^=]+)\s*\z//){ $variables{$1} = $2; } print Dumper(\%variables); __DATA__ drsubc = agauss(0, 1, 3) delm1 = '0 + 0.045u*distm1' + delm2 = '0 + 0.07u*distm2' delm3 = '0 + 0.07u*distm3' + delm4 = '0 + 0.07u*distm4' delmt = '0 + 0 +.07u*distmt' delml = '0.16u + 0.43u*distml' delam = '0 +.32u + 0.86u*distam' dele1 = '0 + 0.25u*diste1' + dele2 = '0 + 0.25u*diste2' delma = '0.16u + 0. +6u*distma' pmsxt = 'npmsxt + 12.5u*dpmsxt' ti +h = 0.35u capct = '0.50u + 0.13u*xdcapct' + capcti = '0.55u + 0.13u*xdcapct' m1t = '0.41u + 0. +05u*xdm1t' m1ti = '0.36u + 0.05u*xdm1t' m2t = ' +0.48u + 0.057u*dm2t' m3t = '0.48u + 0.057u*dm3t' + m4t = '0.48u + 0.057u*dm4t' mtt = '0.48u ++ 0.057u*dmtt' qtt = '0.242u + 0.0202u*dqtt' + htt = '0.242u + 0.0202u*dhtt' mlt = '2.0u + 0.2u*dmlt +' amt = '4.0u + 0.4u*damt' e1t = + '3.0u + 0.5u*de1t' e2t = '4.0u + 0.5u*xde1mat' + mat = '4.0u + 0.4u*dmat' m1m2t = '0.35u + +0.05u*dm1m2t'
Re: how to find what's not there with a regex?
by inman (Curate) on Aug 24, 2005 at 15:43 UTC
    Reversing the initial input makes the regex easier. The resulting array needs reversing and every item in the array needs reversing.
    my $data = reverse <DATA>; my @answers = map {scalar reverse $_} reverse $data =~ /(.*?\s*?=\s.*?\w+)/g; print "$_\n" foreach (@answers);
      That's brilliant. You're absolutely right, that makes it much simpler!!!
Re: how to find what's not there with a regex?
by ikegami (Pope) on Aug 24, 2005 at 13:50 UTC
    This works with your data:
    while (<>) { chomp; while ( / (\w+) # Name ($1) \s* # Spaces (optional) = # Equal sign \s* # Spaces (optional) ( ' # Quote [^']* # Non-quotes ' # Quote | # -or- [^'\s]+ # Non-spaces|quotes ) /xg ) { my ($name, $expr) = ($1, $2); $expr = substr($expr, 1, -1) if substr($expr, 0, 1) eq "'"; print("var: $name, expr: $expr\n"); } }

    Updated to catch unquoted expressions.

    Output:

    var: drsubc, expr: agauss(0, <--- Doesn't work :( var: delm1, expr: 0 + 0.045u*distm1 var: delm2, expr: 0 + 0.07u*distm2 var: delm3, expr: 0 + 0.07u*distm3 var: delm4, expr: 0 + 0.07u*distm4 var: delmt, expr: 0 + 0.07u*distmt var: delml, expr: 0.16u + 0.43u*distml var: delam, expr: 0.32u + 0.86u*distam var: dele1, expr: 0 + 0.25u*diste1 var: dele2, expr: 0 + 0.25u*diste2 var: delma, expr: 0.16u + 0.6u*distma var: pmsxt, expr: npmsxt + 12.5u*dpmsxt var: tih, expr: 0.35u <--- Works :) var: capct, expr: 0.50u + 0.13u*xdcapct var: capcti, expr: 0.55u + 0.13u*xdcapct var: m1t, expr: 0.41u + 0.05u*xdm1t var: m1ti, expr: 0.36u + 0.05u*xdm1t var: m2t, expr: 0.48u + 0.057u*dm2t var: m3t, expr: 0.48u + 0.057u*dm3t var: m4t, expr: 0.48u + 0.057u*dm4t var: mtt, expr: 0.48u + 0.057u*dmtt var: qtt, expr: 0.242u + 0.0202u*dqtt var: htt, expr: 0.242u + 0.0202u*dhtt var: mlt, expr: 2.0u + 0.2u*dmlt var: amt, expr: 4.0u + 0.4u*damt var: e1t, expr: 3.0u + 0.5u*de1t var: e2t, expr: 4.0u + 0.5u*xde1mat var: mat, expr: 4.0u + 0.4u*dmat var: m1m2t, expr: 0.35u + 0.05u*dm1m2t
      That works with the quoted variant, ikegami, but not the unquoted variant, like the first function. How do I say 'anything including spaces up to the first occurrence of more than one space in a row'?

        What follows is a solution which requires the minimum knowledge of the format. It works with the two special cases. Sorry, I must be tired today.

        while (<>) { chomp; while ( / (\w+) # An identifier. \s* = \s* # Equal with opt spaces. ( (?: (?! \s+ \w+ \s* = ) # Stop if we see the next formula. . # A chararacter. )+ ) /xg ) { my ($name, $expr) = ($1, $2); $expr = substr($expr, 1, -1) if substr($expr, 0, 1) eq "'"; print("var: $name, expr: $expr\n"); } }

        Output:

        var: drsubc, expr: agauss(0, 1, 3) <- Works var: delm1, expr: 0 + 0.045u*distm1 var: delm2, expr: 0 + 0.07u*distm2 var: delm3, expr: 0 + 0.07u*distm3 var: delm4, expr: 0 + 0.07u*distm4 var: delmt, expr: 0 + 0.07u*distmt var: delml, expr: 0.16u + 0.43u*distml var: delam, expr: 0.32u + 0.86u*distam var: dele1, expr: 0 + 0.25u*diste1 var: dele2, expr: 0 + 0.25u*diste2 var: delma, expr: 0.16u + 0.6u*distma var: pmsxt, expr: npmsxt + 12.5u*dpmsxt var: tih, expr: 0.35u <- Works var: capct, expr: 0.50u + 0.13u*xdcapct var: capcti, expr: 0.55u + 0.13u*xdcapct var: m1t, expr: 0.41u + 0.05u*xdm1t var: m1ti, expr: 0.36u + 0.05u*xdm1t var: m2t, expr: 0.48u + 0.057u*dm2t var: m3t, expr: 0.48u + 0.057u*dm3t var: m4t, expr: 0.48u + 0.057u*dm4t var: mtt, expr: 0.48u + 0.057u*dmtt var: qtt, expr: 0.242u + 0.0202u*dqtt var: htt, expr: 0.242u + 0.0202u*dhtt var: mlt, expr: 2.0u + 0.2u*dmlt var: amt, expr: 4.0u + 0.4u*damt var: e1t, expr: 3.0u + 0.5u*de1t var: e2t, expr: 4.0u + 0.5u*xde1mat var: mat, expr: 4.0u + 0.4u*dmat var: m1m2t, expr: 0.35u + 0.05u*dm1m2t
        How do I say 'anything including spaces up to the first occurrence of more than one space in a row'?
        A literal translation (untested) would be /(?>.*?(?=  ))/s.

      Sorry - this doesn't handle the non-quoted element.



      pbeckingham - typist, perishable vertebrate.
        I fixed it while you were replying :)
Re: how to find what's not there with a regex?
by BrowserUk (Pope) on Aug 24, 2005 at 13:57 UTC

    Updated: Simplified.

    #! perl -slw use strict; while( <DATA> ) { print "$1 : ", $2||$3 while m[ (\w+) ## the name \s+=\s+ ## the = (?: ## Either ' ( [^']+ ) ' ## all the non-quotes between quotes | ## or (.*?) ## the minimum ) \s{2,} ## absorb the two or more spaces ]gx; } =results P:\test>junk drsubc : agauss(0, 1, 3) delm1 : 0 + 0.045u*distm1 delm2 : 0 + 0.07u*distm2 delm3 : 0 + 0.07u*distm3 delm4 : 0 + 0.07u*distm4 delmt : 0 + 0.07u*distmt delml : 0.16u + 0.43u*distml delam : 0.32u + 0.86u*distam dele1 : 0 + 0.25u*diste1 dele2 : 0 + 0.25u*diste2 delma : 0.16u + 0.6u*distma pmsxt : npmsxt + 12.5u*dpmsxt tih : 0.35u capct : 0.50u + 0.13u*xdcapct capcti : 0.55u + 0.13u*xdcapct m1t : 0.41u + 0.05u*xdm1t m1ti : 0.36u + 0.05u*xdm1t m2t : 0.48u + 0.057u*dm2t m3t : 0.48u + 0.057u*dm3t m4t : 0.48u + 0.057u*dm4t mtt : 0.48u + 0.057u*dmtt qtt : 0.242u + 0.0202u*dqtt htt : 0.242u + 0.0202u*dhtt mlt : 2.0u + 0.2u*dmlt amt : 4.0u + 0.4u*damt e1t : 3.0u + 0.5u*de1t e2t : 4.0u + 0.5u*xde1mat mat : 4.0u + 0.4u*dmat m1m2t : 0.35u + 0.05u*dm1m2t =cut __DATA__ drsubc = agauss(0, 1, 3) delm1 = '0 + 0.045u*distm1' + delm2 = '0 + 0.07u*distm2' delm3 = '0 + 0.07u*distm3' + delm4 = '0 + 0.07u*distm4' delmt = '0 + + 0.07u*distmt' delml = '0.16u + 0.43u*distml' + delam = '0.32u + 0.86u*distam' dele1 = '0 + 0.2 +5u*diste1' dele2 = '0 + 0.25u*diste2' + delma = '0.16u + 0.6u*distma' pmsxt = 'npmsxt + 12.5 +u*dpmsxt' tih = 0.35u capct = '0.50u + + 0.13u*xdcapct' capcti = '0.55u + 0.13u*xdcapct' + m1t = '0.41u + 0.05u*xdm1t' m1ti = '0.36u + 0.05u*xdm1t' + m2t = '0.48u + 0.057u*dm2t' m3t = '0.4 +8u + 0.057u*dm3t' m4t = '0.48u + 0.057u*dm4t' + mtt = '0.48u + 0.057u*dmtt' qtt = '0.242u + +0.0202u*dqtt' htt = '0.242u + 0.0202u*dhtt' m +lt = '2.0u + 0.2u*dmlt' amt = '4.0u + 0.4u*dam +t' e1t = '3.0u + 0.5u*de1t' e2t = ' +4.0u + 0.5u*xde1mat' mat = '4.0u + 0.4u*dmat' + m1m2t = '0.35u + 0.05u*dm1m2t'

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.
      dnwrs = agauss('cnr_res/3',1,3)

      An even loonier case... thanks, all, for the help. I think I'm going to have to go back to the original multiline source and see if these are more identifiable there.

        Any other variations?

        #! perl -slw use strict; while( <DATA> ) { m[(\w+)\s+=\s+'?(.+)'?] and print "$1 : $2" for split /\s{2,}(?=\w+\s+=)/, $_; } __END__ P:\test>junk drsubc : agauss(0, 1, 3) delm1 : 0 + 0.045u*distm1' dnwrs : agauss('cnr_res/3',1,3) delm2 : 0 + 0.07u*distm2' delm3 : 0 + 0.07u*distm3' delm4 : 0 + 0.07u*distm4' delmt : 0 + 0.07u*distmt' delml : 0.16u + 0.43u*distml' delam : 0.32u + 0.86u*distam' dele1 : 0 + 0.25u*diste1' dele2 : 0 + 0.25u*diste2' delma : 0.16u + 0.6u*distma' pmsxt : npmsxt + 12.5u*dpmsxt' tih : 0.35u capct : 0.50u + 0.13u*xdcapct' capcti : 0.55u + 0.13u*xdcapct' m1t : 0.41u + 0.05u*xdm1t' m1ti : 0.36u + 0.05u*xdm1t' m2t : 0.48u + 0.057u*dm2t' m3t : 0.48u + 0.057u*dm3t' m4t : 0.48u + 0.057u*dm4t' mtt : 0.48u + 0.057u*dmtt' qtt : 0.242u + 0.0202u*dqtt' htt : 0.242u + 0.0202u*dhtt' mlt : 2.0u + 0.2u*dmlt' amt : 4.0u + 0.4u*damt' e1t : 3.0u + 0.5u*de1t' e2t : 4.0u + 0.5u*xde1mat' mat : 4.0u + 0.4u*dmat' m1m2t : 0.35u + 0.05u*dm1m2t'

        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
        "Science is about questioning the status quo. Questioning authority".
        The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.
Re: how to find what's not there with a regex?
by davidrw (Prior) on Aug 24, 2005 at 13:53 UTC
    Maybe something like this (you could do %matches instead of @matches if desired, as well):
    my @matches = $input =~ m/\b(\w+)\s+=( '.*?'|( \S+)+)/sg;
    Match the LHS and then the equals sign, and then either a single-quoted string or a sequence of single_space-word sets.
Re: how to find what's not there with a regex?
by QM (Parson) on Aug 24, 2005 at 15:17 UTC
    How to find what's not there with a regex?
    <facetious_mode>

    Don't look!

    </facetious_mode>

    Sorry, couldn't resist. Must be the lack of sleep ;)

    -QM
    --
    Quantum Mechanics: The dreams stuff is made of

      no, that's how to not find what _is_ there... sorry, couldn't resist!

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://486173]
Approved by Tanalis
help
Chatterbox?
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (3)
As of 2018-01-19 21:25 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    How did you see in the new year?










    Results (223 votes). Check out past polls.

    Notices?