Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW

Words, no consecutive doubled letters but repeated letters

by wlegrand (Initiate)
on Oct 27, 2022 at 19:01 UTC ( #11147753=perlquestion: print w/replies, xml ) Need Help??

wlegrand has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I'm on Perl v5.18.2 but I would prefer to have this work with Perl v5.10 I can use Perl to find words in my Dictionary that have doubled letters in various positions using variations of this that I manually adjust to fit the word.

 perl -wnl -e '/(?i)\A(?=\w{10}\z)[a-z]{2}([a-z])\1[a-z]([a-z])\2/ and print;' filen

This will search for 10 letter words with doubled letters at positions 3, 4 and 6, 7. It has a letter pattern of abCCdEEfgh where C stands for the 3rd and 4th letter position and D stands for the 6th and 7th letter position. I get 181 words.</>

babbittess .......... yellowwort

Now I am searching for the type of word that is 12 letters long that has no consecutive doubled letters but has repeated letters. Ex. reservations which has a letter pattern of ABcBAefghijh where A stands for the 1st and 5th letter position and B stands for the 2nd and 4th letter position. There are 25,176 twelve letter words in my Dictionary and I need some way to extract the words that match that type of pattern. Perl can do any text manipulation but I can't. A one-liner or a script. I will adjust the one-liner or script manually for other pattern words. Can you help?


Replies are listed 'Best First'.
Re: Words, no consecutive doubled letters but repeated letters
by GrandFather (Saint) on Oct 27, 2022 at 21:10 UTC

    Solving this sort of problem can be made much easier if you break it into parts. Use a simple regex to drop words containing doubled letters then you can use a simple regex to find repeated letters.

    use warnings; use strict; my %words = map {$_ => 1} split ' ', "This will search for words with +doubled letters at positions"; print "$_\n" for grep {!/(\w)\1/ && /(\w).+\1/} sort keys %words;


    doubled positions
    Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond
Re: Words, no consecutive doubled letters but repeated letters
by kcott (Archbishop) on Oct 28, 2022 at 08:40 UTC

    G'day Willi,

    Firstly, I'd recommend that you move away from the idea of adjusting code whenever search criteria changes: it's tedious, error-prone, and likely to involve many duplications of effort. Instead, write one script that does everything for you.

    You've already got a good start in this direction with your letter patterns. In the code below, I changed abCCdEEfgh to ..AA.BB... and ABcBAefghijh to AB.BA.......: I think this makes it a bit clearer without changing the underlying principle. You can also use this for a length check.

    The code below shows a technique: you'll need to make some changes to suit your needs. I've added comments that indicate the types of modifications that might be required.

    #!/usr/bin/env perl use 5.010; use strict; use warnings; use autodie; # Possibly read from command line, database, file, etc. my @in_patterns = qw{..AA.BB... AB.BA.......}; # Prefer exclusion (blacklist) over inclusion (whitelist). # Here, exclude proper nouns # and any words with non-alphabetic characters. # Note: in other scenarios, whitelists are better; # e.g. only allow access to X, Y & Z. my $blacklist_re = qr{(?:^[A-Z]|[^A-Za-z])}; # Point to your preferred dictionary. Some on my system: #my $dict = '/usr/share/dict/words'; # --> linux.words my $dict = '/usr/share/dict/strine'; # --> australian-english open my $dict_fh, '<', $dict; for my $in_pat (@in_patterns) { say "*** Input Pattern: $in_pat"; my $len = length $in_pat; my $match_re = ''; my %seen; my $count = 0; for my $char (split //, $in_pat) { if ($char eq '.') { $match_re .= '.'; } elsif (! $seen{$char}) { $match_re .= '(.)'; $seen{$char} = ++$count; } else { $match_re .= "\\$seen{$char}"; } } say "*** Match Pattern: $match_re"; my $qr_re = qr{^$match_re$}; say "*** QR Regex: $qr_re"; seek $dict_fh, 0, 0; while (<$dict_fh>) { chomp; next unless length($_) eq $len; next if $_ =~ $blacklist_re; next unless $_ =~ $qr_re; say; } }

    Abridged output:

    *** Input Pattern: ..AA.BB... *** Match Pattern: ..(.)\1.(.)\2... *** QR Regex: (?^:^..(.)\1.(.)\2...$) barrelling barrenness ... tunnellers tunnelling *** Input Pattern: AB.BA....... *** Match Pattern: (.)(.).\2\1....... *** QR Regex: (?^:^(.)(.).\2\1.......$) minimisation monomaniacal ... reverberates secessionist

    — Ken

Re: Words, no consecutive doubled letters but repeated letters (updated x2)
by AnomalousMonk (Archbishop) on Oct 27, 2022 at 20:41 UTC

    /\A (?= [[:alpha:]]{12} \Z) (?! .* (.) \g-1) (?= .* (.) .+ \g-1)/x and print

    But if you're scanning a dictionary, don't you know to begin with that all the words are words? It might be faster to scan with
        /\A (?= .{12} \Z) (?! .* (.) \g-1) (?= .* (.) .+ \g-1)/x and print
    (OTOH, I know that some dictionaries include apostrophized and hyphenated words, so maybe you need to be specific. (Update: Upon further investigation, I find that the dictionary file I'm using does, indeed, include words like "wristwatch's", so [[:alpha:]]{12} vs. .{12} makes a difference for me.))

    Update 1: Note that the regexes above need Perl version 5.10+ to support the \g{n} backreference operator.

    Update 2: I second GrandFather's point about splitting up the components of a complex filter regex into individual regexes: they become simpler and easier to understand and manage, In that vein:

    c:\@Work\Perl>perl -wMstrict -n -le "BEGIN { $::n = 0 } /\A .{12} \z/x && !/(.) \g-1/x && /(.) .+ \g-1/x and ++$::n and print; END { print \"$::n found\" } " ..\moby\mwords\354984si.ngl
    (All the stuff with $::n is for development/debug only; it can be discarded for end use.)

    Give a man a fish:  <%-{-{-{-<

Re: Words, no consecutive doubled letters but repeated letters
by LanX (Sage) on Oct 27, 2022 at 19:35 UTC
    You already know how to do the A and B part with backreferences (it's even straight forward with named captures called A and B)

    The harder part is to exclude repetition in c..efghijh

    ... one way is repeated negative look-aheads, another splitting and counting.

    We had a similar question just recently: Nonrepeating characters in an RE

    You'll find these and many other helpful suggestions discussed there!

    HTH! :)

    Cheers Rolf
    (addicted to the 𐍀𐌴𐍂𐌻 Programming Language :)
    Wikisyntax for the Monastery


    > abCCdEEfgh ... and D stands for the 6th and 7th letter position

    I think you meant E

Re: Words, no consecutive doubled letters but repeated letters
by tybalt89 (Monsignor) on Oct 28, 2022 at 18:17 UTC

    Posted just to see if it's what you are looking for, since it seems a little unclear (see other replies :)

    perl -ne '/^(?!.*(.)\1)(?=.*(.).+\2)(?=.*(?!\2)(.).+\3)\w{12}$/ and pr +int' /usr/share/dict/words

    This finds 2657 entries in my 'words' file with no adjacent duplicates but at least two different duplicated letters.

Re: Words, no consecutive doubled letters but repeated letters
by Anonymous Monk on Oct 29, 2022 at 16:37 UTC

    I regret to say that I can't figure out how to pm you all individually. So I will do it here.

    I want to thank you all for unselfishly volunteering your time and talent to people like me. My query was not for a class assignment or commercial purposes, I am 88 years old and I agitate my brain cells by solving cryptograms. These tools you have given me will aid me in solving the difficult ones. You have given me much to study and I will enjoy doing it. I appreciate your generosity.


Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11147753]
Approved by LanX
Front-paged by Corion
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (1)
As of 2023-06-02 02:49 GMT
Find Nodes?
    Voting Booth?

    No recent polls found