Filter output based on values

by LexPl (Sexton)
by LexPl (Sexton)
on Nov 18, 2024 at 17:24 UTC

LexPl has asked for the wisdom of the Perl Monks concerning the following question:

I have got a large number of specific regexes - stored in an array @regexes. My aim is to get a statistics which tells me what regexes occur in the input file and how often each one occurs.

A count loop accumulates the number of occurrences for each $regex from the @regexes in a variable $tally.

for my $i (0 .. $#regexes) { my $regex = $regexes[$i]; ++$tally[$i] while $xml =~ /$regex/g; }

Later I have an output loop that prints each $regex and its number of occurrences ($tally[i]).

for my $i (0 .. $#regexes) { my $regex = $regexes[$i]; $regex =~ s/^\(\?\^://; $regex =~ s/\)$//; printf {$out} "%-20s %30d \n", $regex, $tally[$i] // 0 ; }

Now I would like to exclude any regex from the output, which doesn't occur in my file ($tally[i] ne '0'). But my idea to wrap the printf statement in an if-statement doesn't work.

if ($tally[$i] ne '0') { printf {$out} "%-20s %30d \n", $regex, $tally[$i] // 0 ; }

Please bear with me when I add a little secondary problem to this ticket. How could I describe the output format in printf so that all occurrences will be aligned properly independent from the length of the regex string?

Re: Filter output based on values
by Corion (Patriarch) on Nov 18, 2024 at 18:26 UTC

    Have you looked at the value of $tally[$i] ? Maybe it helps to formulate your if condition better if you inspect and print that value.

    For your printf question, see printf, which points you to sprintf, but sprintf can only do fixed-width. I often use Text::Table instead.

    use Text::Table; my $tb = Text::Table->new( "Regex", "Tally" ); my @rows; for my $regex (@regexes) { my $tally; $tally++ while $xml =~ /$regex/g; $regex =~ s/^\(\?\^://; $regex =~ s/\)$//; push @rows, [$regex, $tally]; } $tb->load( @rows ); print $tb;
Re: Filter output based on values
by haj (Vicar) on Nov 18, 2024 at 21:17 UTC

    The check $tally[i] ne '0' assumes that you initialized the elements of @tally. Probably the untouched elements of @tally are still undef.

    You can simplify your check:

    if ($tally[$i]) { printf {$out} "%-20s %30d \n", $regex, $tally[$i] // 0 ; }

    To align the occurrences you can find out the longest regex in advance and use this length in your format. The formats for printf are strings, so Perl will happily interpolate a variable into them!

    Here's a complete example. I use a bit of map and grep magic to get the longest regex which actually occurs in the text.

    use 5.032; use autodie; use List::Util qw( max ); use File::Temp qw( tempfile ); my $xml = <<END; Electric monks believed things for you, thus saving you what was becoming an increasingly onerous task, that of believing all the things the world expected you to believe... The new improved Monk Plus models were twice as powerful, had an entirely new multi-tasking Negative Capability feature that allowed them to hold up to 16 entirely different and contradictory ideas in memory simultaneously without generating any irritating system errors. -- Douglas Adams, Dirk Gently's Holistic Detective Agency END my @tally; my @regexes = ( qr/\bmonk/, qr/\bmonk\b/, qr/\bmonk\b/i, qr/Douglas Adams, Dirk Gently's Holistic Detective Agency/, qr/Douglas Adams, The Hitchhikers Guide To The Galaxy (A trilogy i +n five)/ ); for my $i (0 .. $#regexes) { my $regex = $regexes[$i]; ++$tally[$i] while $xml =~ /$regex/g; } my $max_length = max(map { length "$_" } @regexes[ grep { $tally[$_] } (0 .. $#regexes) ]) +; my ($out,$path) = tempfile( CLEANUP => 0); say "Your output is available at '$path'"; for my $i (0 .. $#regexes) { my $regex = $regexes[$i]; $regex =~ s/^\(\?\^://; $regex =~ s/\)$//; if ($tally[$i]) { printf {$out} "%-${max_length}s %3d\n", $regex, $tally[$i] // +0 ; } }
Re: Filter output based on values
by InfiniteSilence (Curate) on Nov 20, 2024 at 07:33 UTC

    Not a great improvement on the other solutions (below). However, making a single structure saves space and allows for a way to easily see things in the debugger (as shown). Also, no need to scan the list again to obtain the max length.

    #!/usr/bin/perl -w use strict; my $maxLen = 0; my $aline = qq~an interesting line of text I would like to scan I see +I see I see~; my @regexes = (['first',qr/should fail/,0] , ['second','like',0] , ['t +hird',qr/I/,0] , ['fourth',qr/scan/,0]); sub niceOutput {for(@regexes){ printf "%-${main::maxLen}s %30d\n", $_- +>[1],$_->[2] if $_->[2] > 0} } sub resetRegexes {for(@regexes){$_->[2] = 0}} sub rexyScan {my $Len = 0; for my $n(@regexes){ ++$n->[2] while ($ali +ne=~m/$n->[1]/g); $Len = length $n->[1]; if($Len > $maxLen) { $main:: +maxLen = $Len } }; } &rexyScan; &niceOutput;

    Debugger shows @regexes:

    main::(-e:1): 0 DB<1> @regexes = (['first',qr/interesting/,0] , ['second','like',0] +, ['third',qr/I/,0] , ['fourth',qr/scan/,0]); DB<2> x @regexes 0 ARRAY(0x569bf962c388) 0 'first' 1 (?^u:interesting) -> qr/(?^u:interesting)/ 2 0 1 ARRAY(0x569bf95f6650) 0 'second' 1 'like' 2 0 2 ARRAY(0x569bf95f68d8) 0 'third' 1 (?^u:I) -> qr/(?^u:I)/ 2 0 3 ARRAY(0x569bf95f6428) 0 'fourth' 1 (?^u:scan) -> qr/(?^u:scan)/ 2 0 DB<3>

    Celebrate Intellectual Diversity

