Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Regexp substitution on variable-length ranges with embedded code?

by Polyglot (Friar)
on May 26, 2021 at 11:02 UTC ( #11133054=perlquestion: print w/replies, xml ) Need Help??

Polyglot has asked for the wisdom of the Perl Monks concerning the following question:

I have data in a format like this:

43:1:1; 43:1:2; 43:1:3; 43:1:4; 43:1:5; 43:1:6; 27:3:7; 27:3:8; 27:3:9 +; 65:1:4; 65:1:18

I'm attempting to condense these to ranges for every sequence in which the first two colon-delimited numbers are the same and the third numbers are in chronological sequence. For this range, my output should be:

43:1:1-5; 27:3:7-9; 65:1:4; 65:1:18

The variability in potential sequence lengths is what is throwing me on this one. While I can match it with a regex expression, I don't know how to make the substitution with only the last matched number in the sequence, i.e., how to know which capture group is the last numbered capture.

Here is what I was trying:

$seq =~ s/ (\d+):(\d+):(\d+) (?:;|\s)* (\1):(\2):(?{1+($3|$6)}) /$1:$2:$3-$6/xg;

This leaves me with the wrong output:

#OUTPUT 43:1:1-2; 43:1:3-4; 43:1:5-6; 27:3:7-8; 27:3:9; 65:1:4-18

Am I attempting something beyond the bounds of regex?

Blessings,

~Polyglot~

Replies are listed 'Best First'.
Re: Regexp substitution on variable-length ranges with embedded code?
by choroba (Archbishop) on May 26, 2021 at 11:18 UTC
    You can use regex to extract the data, but I'd reach for something like Set::IntSpan to handle the intervals.
    #!/usr/bin/perl use warnings; use strict; use Set::IntSpan; my $data = '43:1:1; 43:1:2; 43:1:3; 43:1:4; 43:1:5; 43:1:6; 27:3:7; 27 +:3:8; 27:3:9; 65:1:4; 65:1:18'; my %result; my $out = sub { my $key = (keys %result)[0] or return; print "$key:", delete $result{$key}, ';'; }; while ($data =~ /([0-9]+:[0-9]+):([0-9]+)/g) { my ($key, $value) = ($1, $2); if (exists $result{$key}) { $result{$key}->U($value); } else { $out->(); $result{$key} = 'Set::IntSpan'->new($value); } } $out->();
    Output:
    43:1:1-6;27:3:7-9;65:1:4,18;

    map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]

      I saw this and was going to reply with something to the effect of Set::IntSpan but I was too late and mine wouldn't have had as clever a trick with that closure coderef . . . bah.

      Also as far as the original question, IMHO most times if you're thinking about embedding code inside your regexen you're pretty well already deep into (quasi-apocryphical) JWZ-quote / xkcd 1171 territory.

      The cake is a lie.
      The cake is a lie.
      The cake is a lie.

Re: Regexp substitution on variable-length ranges with embedded code?
by tybalt89 (Prior) on May 26, 2021 at 18:28 UTC
    #!/usr/bin/perl use strict; # https://perlmonks.org/?node_id=11133054 use warnings; $_ = '43:1:1; 43:1:2; 43:1:3; 43:1:4; 43:1:5; 43:1:6; 27:3:7; 27:3:8; +27:3:9; 65:1:4; 65:1:18'; print "$_\n"; s/\b(\d+:\d+:)(\d+)\K(; \1((??{$+ + 1}))\b)+/-$+/g; print "$_\n";

    Outputs:

    43:1:1; 43:1:2; 43:1:3; 43:1:4; 43:1:5; 43:1:6; 27:3:7; 27:3:8; 27:3:9 +; 65:1:4; 65:1:18 43:1:1-6; 27:3:7-9; 65:1:4; 65:1:18

      This also works very nicely. Very nice one-liner. Now I'm wondering if the last two could be joined with a comma in the same operation to yield:

      43:1:1-6; 27:3:7-9; 65:1:4,18

      I hadn't really considered such possible, but you experts make me think I may have underestimated.

      Blessings,

      ~Polyglot~

        Simple with a second line. You don't really need it in one line, do you?

        #!/usr/bin/perl use strict; # https://perlmonks.org/?node_id=11133054 use warnings; $_ = '43:1:1; 43:1:2; 43:1:3; 43:1:4; 43:1:5; 43:1:6; 43:1:9; 43:1:10; + 27:3:2; 27:3:7; 27:3:8; 27:3:9; 65:1:4; 65:1:18; 65:1:28'; print "$_\n"; s/\b(\d+:\d+:)(\d+)\K(; \1((??{$+ + 1}))\b)+/-$+/g; 1 while s/\b(\d+:\d+:)[\d,-]+\K; \1([\d,-]+)/,$2/; print "$_\n";

        Outputs:

        43:1:1; 43:1:2; 43:1:3; 43:1:4; 43:1:5; 43:1:6; 43:1:9; 43:1:10; 27:3: +2; 27:3:7; 27:3:8; 27:3:9; 65:1:4; 65:1:18; 65:1:28 43:1:1-6,9-10; 27:3:2,7-9; 65:1:4,18,28
        One regex for both would be very complicated. But two regexes almost trivial now.

        First regex should transform in comma separated list: 65:1:2,3,4,18

        Second regex should apply shown technique to condense lists to ranges: 65:1:2-4,18

        - Ron
Re: Regexp substitution on variable-length ranges with embedded code? (updated)
by AnomalousMonk (Bishop) on May 26, 2021 at 13:18 UTC

    I agree "pure" regex isn't the way to go, but...

    Win8 Strawberry 5.30.3.1 (64) Wed 05/26/2021 9:07:19 C:\@Work\Perl\monks >perl use 5.018; # need lexicals in regexes, regex extensions use strict; use warnings; my @Test = ( '43:1:1; 43:1:2; 43:1:3; 43:1:4; 43:1:5; 43:1:6; 27:3:7; 27:3:8; 27: +3:9; 65:1:4; 65:1:18', '987:23:45; 987:23:46; 65:1:17; 65:1:19', ); for my $data (@Test) { print "'$data' \n"; my $rx_base = qr{ (?> \d+ : \d+ :) }xms; my $rx_tail = qr{ (?> \d+) }xms; my $rx_sep = qr{ (?> ;? \s*) }xms; my @run; $data =~ s{ ($rx_base) ($rx_tail) (?{ push @run, $^N }) (?: $rx_sep \1 ($rx_tail) (?{ push @run, $^N }) (?(?{ $run[-1] - $run[-2] != 1 }) (*F)) )+ } {$1$2-$3}xmsg; print "'$data' \n\n"; } ^Z '43:1:1; 43:1:2; 43:1:3; 43:1:4; 43:1:5; 43:1:6; 27:3:7; 27:3:8; 27:3: +9; 65:1:4; 65:1:18' '43:1:1-6; 27:3:7-9; 65:1:4; 65:1:18' '987:23:45; 987:23:46; 65:1:17; 65:1:19' '987:23:45-46; 65:1:17; 65:1:19'
    (I think this could be scaled back to pre-5.10 regexes if necessary.)

    Update: Here's another version that I think is a bit nicer. It avoids "absolute" capture group variables and backreferences. It is also not push-y, using plain scalars that are self-initializing.


    Give a man a fish:  <%-{-{-{-<

      Your solution works for me, with the addition of:
      use re 'eval';
      (I'm on Perl 5.12.)

      ...but I don't understand it. Specifically, this line is the most difficult one for me:

      (?(?{ $run[-1] - $run[-2] != 1 }) (*F))

      Is that double-eval'ed or executed? What is the '*F' referencing, and what is the '$^N' from the lines above? I've never seen this kind of regex before.

      Blessings,

      ~Polyglot~

        ... I don't understand ... this line ...

        (?(?{ $run[-1] - $run[-2] != 1 }) (*F))

        Is that double-eval'ed or executed? What is the '*F' referencing ...

        The embedded code is executed.

        +----------------------+------ embedded Perl code | | v v (?(?{ $run[-1] - $run[-2] != 1 }) (*F))
        This is the "(?(*condition*)*yes-pattern*)" regex expression added with Perl version 5.10 (see Extended Patterns in perlre). In this case, the *condition* is the true/false result of evaluating the code. If true, the (*F) (a.k.a. (*FAIL)) backtracking control verb is executed and the match fails and backtracks to the most recent successfully matched substring: a sequence with incrementing values for $3.

        ... what is the '$^N' ...

        The $^N Perl special variable (see Variables related to regular expressions in perlvar) returns the value of the most recently closed capture group.


        Give a man a fish:  <%-{-{-{-<

Re: Regexp substitution on variable-length ranges with embedded code?
by kcott (Bishop) on May 27, 2021 at 11:18 UTC

    G'day Polyglot,

    I see you already have a number of solutions. Here's another with no complex regexes nor any CPAN requirements.

    #!/usr/bin/env perl use strict; use warnings; my $data = '43:1:1; 43:1:2; 43:1:3; 43:1:4; 43:1:5; 43:1:6; 27:3:7; 27 +:3:8; 27:3:9; 65:1:4; 65:1:18'; $data .= '; 43:1:8; 65:1:9; 65:1:3; 1:1:2; 1:1:1'; my %seqs_for; my @key_order; for (split /; /, $data) { my ($key, $seq) = /^(\d+:\d+):(\d+)$/; push @key_order, $key unless exists $seqs_for{$key}; push @{$seqs_for{$key}}, $seq; } my @output; for my $key (@key_order) { my ($out, $last_seq, $end_range) = ('', -2, ''); my @sorted_seqs = sort { $a <=> $b } @{$seqs_for{$key}}; for my $i (0 .. $#sorted_seqs) { my $seq = $sorted_seqs[$i]; if (length $out) { if ($seq == $last_seq + 1) { $end_range = $seq; } else { $out .= "-$end_range" if $end_range; push @output, $out; $out = "$key:$seq"; $end_range = ''; } } else { $out = "$key:$seq"; } $last_seq = $seq; if ($i == $#sorted_seqs) { $out .= "-$end_range" if $end_range; push @output, $out; } } } print join('; ', @output), "\n";

    Note that I added a few more data elements mainly for testing purposes:

    $data .= '; 43:1:8; 65:1:9; 65:1:3; 1:1:2; 1:1:1';

    Output:

    43:1:1-6; 43:1:8; 27:3:7-9; 65:1:3-4; 65:1:9; 65:1:18; 1:1:1-2

    — Ken

Re: Regexp substitution on variable-length ranges with embedded code?
by LanX (Sage) on May 26, 2021 at 13:03 UTC
    IF sequences are always consecutive and never interrupted, one could try to grab the first and the last of each sequence and replace each sequence in a loop.

    BUT you need to be sure that neither

    • neither 43:1:1; 43:1:2; 43:1:5; 43:1:6;
    • nor 43:1:1; 43:1:2; 27:3:7; 43:1:3; 43:1:4;
    happens.

    Still I wouldn't choose to embed Perl into a regex to solve that.

    That's only complicating things, because the complex logic would be only obfuscated

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery

Re: Regexp substitution on variable-length ranges with embedded code?
by AnomalousMonk (Bishop) on May 26, 2021 at 13:27 UTC
    $seq =~ s/ (\d+):(\d+):(\d+) (?:;|\s)* (\1):(\2):(?{1+($3|$6)}) /$1:$2:$3-$6/xg;

    FWIW, the (?{1+($3|$6)}) expression | embedded code expression in the quoted regex just adds 1 to the bitwise-or of $3 and $6 (treated as numbers... I think) and then throws away the result; it's a no-op.


    Give a man a fish:  <%-{-{-{-<

Re: Regexp substitution on variable-length ranges with embedded code? ( $^N )
by Anonymous Monk on May 26, 2021 at 12:46 UTC
      > $^N

      indeed

      use v5.12; use warnings; my $str ='43:1:1; 43:1:2; 43:1:3; 43:1:4; 43:1:5; 43:1:6; 27:3:7; 27:3 +:8; 27:3:9; 65:1:4; 65:1:18'; say $str; $str =~ s#(\d+:\d+:)(\d);(?: \1((??{$^N+1}));)+#$1$2-$3#g; say $str;

      C:/Strawberry/perl/bin\perl.exe -w d:/tmp/pm/sequencer.pl 43:1:1; 43:1:2; 43:1:3; 43:1:4; 43:1:5; 43:1:6; 27:3:7; 27:3:8; 27:3:9 +; 65:1:4; 65:1:18 43:1:1-6 27:3:7-9 65:1:4; 65:1:18 Compilation finished at Wed May 26 15:46:31

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery

      PS: beautifying with /x modifier and adding missing ; in replacement left for the interested reader. :)

        > PS: beautifying with /x modifier and adding missing ; in replacement left for the interested reader. :)

        well since my editor has a (mostly functioning) cperl-beautify-regexp

        use v5.12; use warnings; use Test::More; my $str = '43:1:1; 43:1:2; 43:1:3; 43:1:4; 43:1:5; 43:1:6; 27:3:7; 27: +3:8; 27:3:9; 65:1:4; 65:1:18'; my $exp = '43:1:1-6; 27:3:7-9; 65:1:4; 65:1:18'; my $got = $str; $got =~ s/ ( \d+:\d+: ) # $1 ( \d+ # (updated +) ) # $2 = first $^N ; (?: # group w/o match \s \1 ( (??{ $^N+ 1 }) # include last_match + 1 ) # $3 = next $^N ; )+ # repeat /$1$2-$3;/xg; is($got, $exp, "fits"); done_testing;

        Worth noting that the OP was wrong with his expectation, it's 43:1:1-6; not 43:1:1-5;

        Cheers Rolf
        (addicted to the Perl Programming Language :)
        Wikisyntax for the Monastery

        PS: still not perfect, because the input isn't terminated with a semicolon, but I'm tired now. :)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11133054]
Approved by marto
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (2)
As of 2021-09-19 15:17 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?