http://www.perlmonks.org?node_id=1226708

Djay has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

I've been working on building a small script, equivalent to a Powershell Script I've already written (but this time for Linux), which essentially takes the output of a command, regex's it, chucks all the capture groups into separate arrays and counts the number of objects in each array. This is reasonably complicated (for me) in Powershell, but can be accomplished through clever use of hashtables.

I am a complete and utter Perl novice, with literally no experience in the language. I've tried to do the above with Bash but it seems so limited, and my Regex is using PCRE. I've tried using the Perl one liner but I cant seem to call the capture group variables into an array in bash to count them.

Below is a sample output from the command

JobID Type State Status Policy Schedule + Client Dest Media Svr Active PID 41735 Backup Done 0 Policy_name_here daily + hostname001 MediaSvr1 8100 41734 Backup Done 0 Policy_name_here daily + hostname002 MediaSvr1 7803 41733 Backup Done 0 Policy_name_here daily + hostname004 MediaSvr1 7785 41732 Backup Done 0 Policy_name_here daily + hostname005 MediaSvr1 27697 41731 Backup Done 0 Folicy_name_here daily + hostname006 MediaSvr1 27523 41730 Backup Done 0 Policy_name_here daily + hostname007 MediaSvr1 27834 41729 Backup Done 0 Policy_name_here - + hostname008 MediaSvr1 27681 41728 Backup Done 0 Policy_name_here - + hostname009 MediaSvr1 27496 41727 Catalog Backup Done 0 catalog full + hostname010 MediaSvr1 27347 41712 Catalog Backup Done 0 catalog - + hostname004 30564

I'm terrible with HTML so I don't know how to fix the text wrapping in the above output

Below is my regex (which works in Powershell)

/(\d+)?\s+((\b[^\d\W]+\b)|(\b[^\d\W]+\b\s+\b[^\d\W]+\b))?\s+((Done)|(A +ctive)|(\w+\w+\-\w\-+))?\s+(\d+)?\s+((\w+)|(\w+\_\w+)|(\w+\_\w+\_\w+) +)?\s+((b[^\d\W]+\b\-\b[^\d\W]+\b)|(\-)|(\b[^\d\W]+\b))?\s+((\w+\.\w+\ +.\w+)|(\w+))?\s+((\w+\.\w+\.\w+)|(\w+))?\s+(\d+)?/g

In the Script I have for windows, each capture group corresponds to the columns on this table, and my script counts the objects within each capture group (thus counting the number of failed, successful, running backups etc.

I have no code written in Perl other than the following (which doesn't work for my purpose and was Frankenstein'd from much googling)

#!/usr/bin/perl # use strict; use warnings; my $output = `bpdbjobs`; while (my $line = $output) { chomp $line; my @array = $line =~ /(\d+)?\s+((\b[^\d\W]+\b)|(\b[^\d\W]+\b\s ++\b[^\d\W]+\b))?\s+((Done)|(Active)|(\w+\w+\-\w\-+))?\s+(\d+)?\s+((\w ++)|(\w+\_\w+)|(\w+\_\w+\_\w+))?\s+((b[^\d\W]+\b\-\b[^\d\W]+\b)|(\-)|( +\b[^\d\W]+\b))?\s+((\w+\.\w+\.\w+)|(\w+))?\s+((\w+\.\w+\.\w+)|(\w+))? +\s+(\d+)?/g; foreach my $s (@array) { print "'$s'\n"; } }

Any help would be greatly appreciated - cheers

  • Comment on Turning regex capture group variables into arrays, then counting the number of objects in the array
  • Select or Download Code

Replies are listed 'Best First'.
Re: Turning regex capture group variables into arrays, then counting the number of objects in the array
by markong (Pilgrim) on Dec 04, 2018 at 12:50 UTC

    Assuming that the big regex you posted is *actually* tested against one line of command output and bug free, I'm gonna give you a skeletal example of one simple way to proceed.

    But first, Perl is influenced by something called context:

    $output = `program args`; # collect output into one multiline string @output = `program args`; # collect output into array, one line per +element
    So your code probably end up in an infinite loop, because $output is a string which is evaluated as having the true boolean value inside the while() test.

    Try to start with something like this instead:
    #!/usr/bin/perl # use strict; use warnings; my @output = `bpdbjobs`; for my $line (@output) { chomp $line; my @matches = $line =~ /(\d+)?\s+((\b[^\d\W]+\b)|(\b[^\d\W]+\b\s+\b[^\d\W]+\b))?\s+((Done)|(A +ctive)|( \w+\w+\-\w\-+))?\s+(\d+)?\s+((\w+)|(\w+\_\w+)|(\w+\_\w+\_\w+))?\s+((b[ +^\d\W]+\ b\-\b[^\d\W]+\b)|(\-)|(\b[^\d\W]+\b))?\s+((\w+\.\w+\.\w+)|(\w+))?\s+(( +\w+\.\w+ \.\w+)|(\w+))?\s+(\d+)?/g; ## <<--- Beware of the global modifier g h +ere! if (@matches) { # @matches now is an array containing the captured matches # Pretty printing time ? } }

    It now should be a matter of double checking the regex correctness and maybe using some modules to help with printing, e.g.: Text::Table ?

    Good luck!
      Its funny, this is actually very close to the first portion of my Powershell script. Does this allow me to call and count the matches in an individual capture group? To do this in Powershell (sorry to bring this here but I find it easier to explain code in code form)
      $output = ./bpdbjobs $Results = @() $ColumnName = @() foreach ($match in $OUTPUT) { $matches = $null $match -match "(?<jobID>\d+)?\s+(?<Type>(\b[^\d\W]+\b)|(\b[^\d\W]+ +\b\s+\b[^\d\W]+\b))?\s+(?<State>(Done)|(Active)|(\w+\w+`-\w`-+))?\s+( +?<Status>\d+)?\s+(?<Policy>(\w+)|(\w+`_\w+)|(\w+`_\w+`_\w+))?\s+(?<Sc +hedule>(\b[^\d\W]+\b\-\b[^\d\W]+\b)|(\-)|(\b[^\d\W]+\b))?\s+(?<Client +>(\w+\.\w+\.\w+)|(\w+))?\s+(?<Dest_Media_Svr>(\w+\.\w+\.\w+)|(\w+))?\ +s+(?<Active_PID>\d+)?\s+(?<FATPipe>\b[^\d\W]+\b)?" $Results+=$matches } foreach ($result in $results) { $Object = New-Object psobject -Property @{ JobID = $Result.jobID Type = $Result.Type State = $Result.State Status = $Result.Status Policy = $Result.Policy Schedule = $Result.Schedule Client = $Result.Client Dest_media_svr = $Result.dest_media_svr Active_PID = $Result.Active_PID FATPipe = $Result.FATPipe } $ColumnName += $Object }
      Powershell already understands that $Result.jobID is referring to the jobID capture group. All this does is put the capture group and results of said capture into an object which I can put into a variable and use any time using the below code as an example
      $Successful = ($ColumnName | where {$_.Status -eq "0"}).count
      This creates a variable which is formed out of the previous codes variable, it then pulls out only the matches in the Status column ($_.Status) and counts the ones that match the value 0.

        If you have fixed format records, then perhaps using unpack is a simpler option. For example

        #!/usr/bin/perl use strict; use Data::Dumper ; my $fmt = 'A13 A8 A8 A10 A6 A20 A6 A12 A16 A5'; my %counts = (); my @col = ('JobID','Col2','Type','State', 'Status','Policy','Schedule','Client', 'Dest Media Svr','Active PID'); # 10 cols while (<DATA>){ next unless /\S/; # skip blank lines next if /^\s+JobID/; # skip header chomp; my @f = unpack $fmt,$_; s/^\s+|\s+$//g for @f; # trim spaces # count each column for my $n (0..$#col){ ++$counts{$col[$n]}{$f[$n]}; } print join "\|",@f,"\n"; # check } print Dumper \%counts; printf "Succesfull = %d\n",$counts{'Status'}{'0'}; __DATA__ JobID Type State Status Policy Schedule + Client Dest Media Svr Active PID 41735 Backup Done 0 Policy_name_here daily + hostname001 MediaSvr1 8100 41734 Backup Done 0 Policy_name_here daily + hostname002 MediaSvr1 7803 41733 Backup Done 0 Policy_name_here daily + hostname004 MediaSvr1 7785 41732 Backup Done 0 Policy_name_here daily + hostname005 MediaSvr1 27697 41731 Backup Done 0 Folicy_name_here daily + hostname006 MediaSvr1 27523 41730 Backup Done 0 Policy_name_here daily + hostname007 MediaSvr1 27834 41729 Backup Done 0 Policy_name_here - + hostname008 MediaSvr1 27681 41728 Backup Done 0 Policy_name_here - + hostname009 MediaSvr1 27496 41727 Catalog Backup Done 0 catalog full + hostname010 MediaSvr1 27347 41712 Catalog Backup Done 0 catalog - + hostname004 30564
        poj

        I have zero PowerShell knowledge, but from what you describe you probably want to use the named capture groups feature (?<NAME>...) (the original regex you posted didn't contain any...).
        This is still a capture group just like a regular parenthesized grouping, but its name is NAME.

        I'd recommend you to look at some examples and while you're at it, you should also read part 1 of that wonderful tutorial to understand how to access the various bits of information from a successful match!

        It's all there!

Re: Turning regex capture group variables into arrays, then counting the number of objects in the array
by 1nickt (Canon) on Dec 04, 2018 at 12:46 UTC

    Hi, welcome to Perl, the One True Religion. You don't show the output of your code, but if the input from the program is as you showed, while (my $line = $output) ... ain't gonna work.

    Either you would need to split your input into a list of lines and loop through them with for, or else you need to make your regexp handle a multi-line string with \s (edit: or get the input in multiple lines, as shown by markong below).

    But, if your inout is as shown, with no (unquoted) spaces in the values, you would be better off parsing it as a delimited file, e.g. with Text::CSV_XS or Spreadsheet::Read.

    (And if you wanted to get clever, there are ways to write a custom "grammar" to parse your data format.)

    Hope this helps!


    The way forward always starts with a minimal test.
Re: Turning regex capture group variables into arrays, then counting the number of objects in the array
by rsFalse (Chaplain) on Dec 05, 2018 at 14:05 UTC
    I recommend to use /x modifier in such long regex and make that regex to span several lines for better readability. Docs about /x modifier -> /x and /xx.
Re: Turning regex capture group variables into arrays, then counting the number of objects in the array
by Anonymous Monk on Dec 04, 2018 at 14:27 UTC

    You might get better help if you told us what you were trying to do rather than just giving us some code to ponder on.

    One reason for this is that PowerShell and Perl may interpret your given regular expression differently. In Perl, your regular expression has 24 capture groups. Is this really what you want?

    Another reason is that there might be a better Perl implementation than just translating PowerScript. For example, if all you are trying to do with your input is to split it into columns and you are sure that none of the columns will be empty,

    my @array = split qr/\s+/, $line;

    is cleaner and more understandable than a 260-character (more or less) regular expression.

      Some columns are blank, some show hostnames, some underscores, hyphens, etc. I use regex 101 to test all my regex and this regex is functional with every outcome I've seen so far on that website, which uses base PCRE. This translated into Powershell with only one change (escaping the -)

      In my Powershell code I have named capture groups, which I cant use here (I dont think Perl 5.8.5 supports them)

      The script itself is to parse the output (sometimes hundreds of lines) of a command which shows information about backups being run. In Powershell, it correctly parses the data, counts the number of Successful, Failed, Running, Partial backups based on the output of the "Status" column and the "State" column. It then shows this in a format our Monitoring tool (SolarWinds) understands, and displays a message about any failed backups, referencing some of the other columns.

        In my Powershell code I have named capture groups ... (I dont think Perl 5.8.5 supports them)

        It's good to know the version of Perl you're using. Indeed, Perl 5.8 does not support named capture groups.


        Give a man a fish:  <%-{-{-{-<