Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?

How do I extract named variable names from regex string

by maspsr (Initiate)
on Jan 30, 2012 at 13:39 UTC ( #950770=perlquestion: print w/replies, xml ) Need Help??
maspsr has asked for the wisdom of the Perl Monks concerning the following question:


I Have the following regex which I use when extracting information from different logfiles:

(?i-xsm:(?<mon>\w+)\s+(?<day>\d+)\s+(?<hour>\d+):(?<min>\d+):(?<sec>\d ++)\s*(?<host>\S+)\s*(?<prog>\S+)\/(?<proc>\S+)\[(?<pid>\d+)\]:\s*(?<i +d>\S+):\s*to=<(?<to>\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b)>,\s*r +elay=(?<name>\S+)\[(?<ip>\d+.\d+.\d+.\d+)\]:(?<port>[0-9]+),\s*delay= +(?<time>\d+\.\d+),\s*delays=(?<beforequeue>[0-9.]+)\/(?<inqueue>[0-9. +]+)\/(?<connect>[0-9.]+)\/(?<transmission>[0-9.]+),\s*dsn=(?<status>[ +0-9.]+),\s*status=sent\s*\((?<code>[0-9.]+)(?<message>.*)\))

I would like to extract all the named parameters like ?<mon>,?<day>,.... to make sure that I haven't used same name twice. The string is composed from substring I have in a database. I have tried different approaches but until now without success. I thought I could use some repetitive pattern like


But I can't make it work. I could use the perl split function and probably get a list og the names.

any ideas ?

best regards

Peter Sørensen/Univ.Of.Southern.Denmark

Replies are listed 'Best First'.
Re: How do I extract named variable names from regex string
by McA (Priest) on Jan 30, 2012 at 14:00 UTC
    I'm not sure whether you are searching for something like that.
    my $string = 'yourregex here'; while($string =~ m/\(\?\<([a-zA-Z0-9]+)\>/g) { print $1, "\n"; }
    Best regards
Re: How do I extract named variable names from regex string
by chessgui (Scribe) on Jan 30, 2012 at 14:02 UTC
    Note that a-z+ should be [a-z]+ and the second '+' refers only to the character '>'.
    my @names; while( $string=~s/\?<([a-z]+)>// ){push(@names,$1);} print "Names: ",join(',',@names),"\n";
      This is the confusion which results when someone does not use code tags. The OP actually did use [a-z]+, as can be seen when clicking on the xml link, but it rendered poorly.
Re: How do I extract named variable names from regex string
by ww (Bishop) on Jan 30, 2012 at 14:18 UTC
    I'm having trouble trying to imagine the range of possible content in the string against which you wish to use your regex. I don't see any consistent separator between the fields you want to capture, which tends to rule out use of a "repetitive pattern" of a character similar to what you've shown (and special attention to chessgui's catch).

    Please post a few lines of (representative) sample data.

Re: How do I extract named variable names from regex string
by JavaFan (Canon) on Jan 30, 2012 at 17:09 UTC
    Easy, you can let Perl do all the work. Assume your pattern is in $str, then do:
    use 5.010; "" =~ /(?:$str)?/; while (my ($key, $matches) = each %-) { if (@$matches > 1) { say "Duplicate name $key"; } }
      Sorry for nitpicking, but in praxis this only works with a string where all groups match, "" isn't enough.

      use 5.010; "" =~ /(?<mon>\w+)\s(?<mon>\w+)/; print scalar keys %-; # 0

      Now constructing such a string is in general even more difficult than just parsing for named captures labels.

      Good idea anyway! :)

      Cheers Rolf

        Thank you for playing, but you fail.

        "" =~ /(?:(?<mon>\w+)\s(?<mon>\w+))?/; say scalar keys %-; # 1
        All that's required is for the pattern to match. Taking the pattern, and wrapping it inside a (?: )? will make it match against "" (except for some degenerate cases). If you look back at my code, this is exactly what I did.
Re: How do I extract named variable names from regex string
by pklausner (Scribe) on Jan 30, 2012 at 16:31 UTC

    This looks like a syslog message line. If you have access to the syslog daemon configuration and already use syslog-ng, you already have macros to split all the standard fields. Then you could pipe the payload message to your perl script. In there a split /[ =]+/ ... looks like it would yield the proper {name, value} pairs. I gather rsyslog has similar features.

    Main advantage of this approach: your actions run in real-time, as the events come in.

Re: How do I extract named variable names from regex string
by TomDLux (Vicar) on Jan 30, 2012 at 17:04 UTC

    Right now the regex is huge and complicated. Imagine trying to prove it captures everything you need, with no lost data, and it captures no more than you need, to maximze the data you can process in available memory. Or imagine you need to hire someone to make changes. You can hire a first year student at X kroner / hour, a second year student at X^2 k/h, a senior at X! k/h .... or even worse, imagine having to figure out next year, what you were had in mind when you programmed it.

    If split partitions your problem into smaller, more easily understandable problems, that sounds like a good idea to me. Split is fast and efficient.

    Alternately, define a number of small regex, and then assemble them into a larger structure. That way the components are named, so you have an idea what it is intended to achieve, and decipherment is bounded, with a limited number of characters to figure out. Compare to the regex-as-a-whole, where you can't tell where part one ends and part two begins.

    As Occam said: Entia non sunt multiplicanda praeter necessitatem.

Re: How do I extract named variable names from regex string
by LanX (Chancellor) on Jan 31, 2012 at 00:04 UTC
    I'm too lazy to work out a complete solution but a clean approach would be to deparse the regex-compilation:

    > perl -e 'use re 'debug';/(?i-xsm:(?<mon>\w+)\s+(?<day>\d+)\s+)/' Compiling REx "(?i-xsm:(?<mon>\w+)\s+(?<day>\d+)\s+)" Final program: 1: OPEN1 'mon' (3) 3: PLUS (5) 4: ALNUM (0) 5: CLOSE1 'mon' (7) 7: PLUS (9) 8: SPACE (0) 9: OPEN2 'day' (11) 11: PLUS (13) 12: DIGIT (0) 13: CLOSE2 'day' (15) 15: PLUS (17) 16: SPACE (0) 17: END (0) stclass ALNUM plus minlen 4 Freeing REx: "(?i-xsm:(?<mon>\w+)\s+(?<day>\d+)\s+)"

    Now fetching all /OPEN\d+ '(\w+)'/-opcodes shouldn't be too difficult.

    See perldoc re for more options.

    Cheers Rolf

    UPDATE: shrank example.

Re: How do I extract named variable names from regex string
by mcdave (Beadle) on Feb 02, 2012 at 02:43 UTC
    I like the nice, simple my @items = ($pat =~ m/\?<([a-z]+)>/g) up above, but the "split" comment got me amused so I wrote down
    map { s/>.*// } (my @items = split /\?</, $pat) ;
    which seems just about cryptic enough, then
    my %freq = () ; map { $freq{$_}++ } @items ;
    and pull out the keys with value bigger than 1 for duplicates.

    I couldn't come up with a way to scan for duplicates all in one line, which is a little disappointing, but maybe there's something clever to do.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://950770]
Approved by Corion
Front-paged by Corion
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (10)
As of 2017-03-29 07:43 GMT
Find Nodes?
    Voting Booth?
    Should Pluto Get Its Planethood Back?

    Results (344 votes). Check out past polls.