Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

Re: Regex to match range of characters broken by dashes

by Cristoforo (Curate)
on Jul 17, 2016 at 19:20 UTC ( [id://1167921]=note: print w/replies, xml ) Need Help??


in reply to Regex to match range of characters broken by dashes

Here is a solution using a dynamic regex construct. I've never used it before and learned more about it here.

A fly in the ointment was in the code for our $s (@stops). The code wouldn't work here with a 'my' declaration. 'our' was necessary.

This is fairly readable and should work for any number of groups (provided they don't exceed the count of fasta characters in a string of them. I didn't test for that to see how it behaved).

The dynamic regex form was necessary because the count of the quantifier changed for each iteration of the 'for' loop ($s-1).

The printout after the __END__ token shows the results of the run.

Update: Added a final substitution to remove dashes preceding and following the double underscore (as he desired in his post. Missed that.)

#!/usr/bin/perl use strict; use warnings; use 5.014; my @stops = (2,6); # group by 2 then 4 (6 == 2 + 4) my $tag = '___'; for ('ATCGGATCTGGC', 'A-C-G--CTGGC') { my $seq = $_; for our $s (@stops) { # necessary to use 'our' instead of 'my' $seq =~ s/ ( # begin capture (??{ # dynamic regex "(?:[TAGC][^TAGC]*)" . # group to apply quantifi +er to "{" . ($s-1) . "}" . # quantifier "[TAGC]" # end token }) # end dynamic reference ) # end capture /$1$tag/x; # end of substitution } $seq =~ s/__-+/__/g; say $seq; } __END__ C:\Old_Data\perlp>perl dynamic_regex.pl AT___CGGA___TCTGGC A-C___G--CTG___GC C:\Old_Data\perlp>

Replies are listed 'Best First'.
Re^2: Regex to match range of characters broken by dashes
by AnomalousMonk (Archbishop) on Jul 17, 2016 at 23:47 UTC
    A fly in the ointment was in the code for our $s (@stops). The code wouldn't work here with a 'my' declaration. 'our' was necessary.

    This is a bug that was corrected in Perl version 5.18 IIRC. With this correction, lexical variables always work as expected in  "(?{ code })" and  "(??{ code })" regex constructs.

    The dynamic regex form was necessary because the count of the quantifier changed for each iteration of the 'for' loop ($s-1).

    I don't see the necessity here. Except for the fact that aliasing into the  @stop array makes calculating the quantifier a bit tedious, it can all be written normally, given that the  s/// match regex is, by default, re-compiled on each  s/// execution:

    c:\@Work\Perl\monks>perl -wMstrict -le "my @stops = (2,6); ;; my $tag = '___'; ;; for ('ATCGGATCTGGC', 'A-C-G--CTGGC') { my $seq = $_; printf qq{'$seq' -> }; ;; for our $s (@stops) { local our $q = $s - 1; $seq =~ s/ ((?:[TAGC][^TAGC]*){$q} [TAGC]) /$1$tag/x; } print qq{'$seq'}; } " 'ATCGGATCTGGC' -> 'AT___CGGA___TCTGGC' 'A-C-G--CTGGC' -> 'A-C___-G--CTG___GC'
    And except for say, it works under Perl version 5.8.9. See also Re: Regex to match range of characters broken by dashes Update 2 for another for-loop example.

    Update: I've based my code example on your original code, prior to adding the second  s/// fixup.


    Give a man a fish:  <%-{-{-{-<

      Thanks for pointing out points in my solution that can be stated cleaner.

      This is a bug that was corrected in Perl version 5.18 IIRC. With this correction, lexical variables always work as expected in "(?{ code })" and "(??{ code })" regex constructs.

      I wasn't aware of that bug. And your local our $q = $s - 1; fixes that.

      it can all be written normally, given that the s/// match regex is, by default, re-compiled on each s/// execution

      That is a nice solution! The (??{. . .}) construct wasn't necessary.

        ... local our $q = $s - 1; fixes [the lexical bug].

        It can even be fixed a bit more cleanly, and also fold in the added  s/// fixup at the end (still runs under 5.8.9):

        c:\@Work\Perl\monks\Q.and>perl -wMstrict -le "my @stops = (2,6); ;; my $tag = '___'; ;; for (qw(ATCGGATCTGGC A-C-G--CTGGC)) { my $seq = $_; printf qq{'$seq' -> }; $seq =~ s{ ((?: [TAGC] [^TAGC]*){$_} [TAGC]) [^TACG]* }{$1$tag}xms for map $_-1, @stops; print qq{'$seq'}; } " 'ATCGGATCTGGC' -> 'AT___CGGA___TCTGGC' 'A-C-G--CTGGC' -> 'A-C___G--CTG___GC'


        Give a man a fish:  <%-{-{-{-<

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1167921]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (4)
As of 2024-04-25 12:33 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found