Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

regex issue

by Anonymous Monk
on Aug 03, 2016 at 16:39 UTC ( #1169091=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi, i am working on my regexes, in this case backreferences. I found this code :

/\b(\w\w\w)\s\g1\b/;/

This code should find all three letter-doubles. Unfortunately i can't bring it to work, because there is little further explanation. i was hoping someone could give some useful comments about this line of code, maybe give a short example. I have added some code that i wrote , which finds all three-letter words and reports how many times each word occurs.Thanks

$term = 'Dit is het eerste het is niet het laatste Dit'; @woorden = split / /, $term; @let = grep (length($_)=3,@woorden); foreach(@let){ $aantal = $term =~ s/$_//g; if($aantal==0){next;} print "$_:"."$aantal\n"; };

Replies are listed 'Best First'.
Re: regex issue
by AnomalousMonk (Bishop) on Aug 03, 2016 at 19:03 UTC

    First of all, I don't understand if you want all three-letter immediately repeated words, as your OPed regex
        /\b(\w\w\w)\s\g1\b/;/
    implies, or all three-letter words that are repeated anywhere else in the string, as your code implies.

    In the spirit of the second interpretation (and counting them) (requires Perl version 5.10+ for  \g-1 construct):

    c:\@Work\Perl>perl -wMstrict -MData::Dump -le "use 5.010; ;; my $term = 'Dit is het eerste het is xhetx xhet hetx niet het laatste + Dit'; ;; my $word = qr{ \b \w{3} \b }xms; ;; my %repeats; while ($term =~ m{ ($word) (?= .*? (?= $word) \g-1) }xmsg) { $repeats{$1}++; } ;; dd \%repeats; " { Dit => 1, het => 2 }
    Change the definition of  $word to whatever best suits your requirements. | See Update 3 below.

    Updates:

    1. Added info about 5.10+ requirement.
    2. BTW, the "regex"  /\b(\w\w\w)\s\g1\b/;/ doesn't actually compile. It looks like it might be a piece of something else, e.g., a substitution:
          s/\b(\w\w\w)\s\g1\b/;/
      (update: or maybe the / at the end is completely extraneous and the statement  /\b(\w\w\w)\s\g1\b/; was intended — that would work)
    3. When I wrote "Change the definition of  $word to whatever best suits your requirements" above, what I had in mind was that any  $word definition used in the context of the
          m{ ($word) (?= .*? (?= $word) \g-1) }xmsg
      match would be assured to match repeated words per my understanding of the OP. Not so, and it's easy to manufacture a counterexample. Of course, it's also easy to fix the counterexample to avoid the problem, but the fix requires knowledge of internal details of the  $word definition, and this is exactly what I was trying to avoid. In a further iteration, I can come up with a match regex that seems to fulfill all my (admittedly rather arbitrary) requirements, but it's not well tested and I don't really love it as I should. So as always, Caveat Programmor.


    Give a man a fish:  <%-{-{-{-<

      Why are you using ';;' every blank line? If you want a blank line, leave a blank line. If you want a comment, use the perl comment character, '#'.

      As Occam said: Entia non sunt multiplicanda praeter necessitatem.

        What you're seeing is a  perl -e " ... code ... " Windose command line padded out with spaces to emulate the appearance of multi-line source. The code comes from the Windoze clipboard. The original intent was to quickly cut/paste, possibly modify, and test posted code snippets, so I decided to eliminate blank lines. If I want to have something that looks like a blank line, "something" has to be there, and by convention, I use  ;; as that something.

        By the same token, because what's after the  -e switch is just a single string/line, any  # comment-to-end-of-line just clobbers the entire remainder of the line, even though the remainder appears multi-line. So, no comments.


        Give a man a fish:  <%-{-{-{-<

Re: regex issue
by Laurent_R (Canon) on Aug 03, 2016 at 19:14 UTC
    Please better define what you mean with "three letter-doubles". The two monks that responded previously understood something different, and I understood yet a third possibility.

    Please explain and/or provide an example.

      I meant all 3letter words that occur more than once.

        In that case:

        use strict; use warnings; my $term = 'Dit is het eerste het is niet het laatste Dit'; my @tlw = $term =~ /\b(\w{3})\b/g; # Now you have all the three-letter words, so count them my %seen = (); $seen{$_}++ for @tlw; for my $k (keys %seen) { print "$k occurs $seen{$k} times\n" if $seen{$k} > 1; }
Re: regex issue
by Anonymous Monk on Aug 03, 2016 at 16:49 UTC

    i made a mistake it should be length($_)==3.

      A slightly different approach can give you what you are looking for. By using a lookahead, you can do what you want.

      $term = 'Dit is het eerste het is niet het laatste Dit'; @captured = $term =~ /\b(\w\w\w)\b(?=.*\1\b)/g; print join ' ', @captured; _____________ Dit het het

      The \b are word boudaries (change from letter/number/underscore) to non-letter/number/underscore or vice-versa.

      The (?= looks forward for what comes after it, but remembers where it starts.

      The \1 is the same as your \g1 (I unfortunately have an older perl.)

      The g at the end means capture them all

      het appears twice since it is there three times

        Small correction, you missed a word-boundary assertion in the look-ahead. The regex should be:
        /\b(\w\w\w)\b(?=.*\b\1\b)/g
        Without the additional \b before \1, three-letter words that are trailing substrings of other words would also match.

        Thank you

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1169091]
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (4)
As of 2019-12-08 08:50 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?