http://www.perlmonks.org?node_id=111621

artist has asked for the wisdom of the Perl Monks concerning the following question:

Hi,
Anyone has easy idea as how to find patterns in a given string where only number of patterns are known? One pattern can appear any number of times in the string. String contains nothing but patterns. Number of patterns is always more than 1.
ex. string: helloworldhellohellohihellohiworld
    number of patterns:3
    ========================================
    Deduce patterns:hello, world, hi

Edit: chipmunk 2001-09-10

Replies are listed 'Best First'.
Re: Patter Finding
by George_Sherston (Vicar) on Sep 11, 2001 at 04:53 UTC
    Yow. I'm not sure who's crazier - you for suggesting this might be something one would want to do, or me for trying to do it ;)

    Because It's There, as the man said.

    Having struggled with it a bit I realised one thing about the question itself, which is that we can't say there are only three patterns. In fact there are a lot more - "hell", "hel", and "he" to name but the most obvious additions. That's unless we want to match against a dictionary, in which case it's just a matter of processing power.

    Assuming we are interested in patterns rather than specific words I think the following does it. I should say at the outset that the clever bit in this comes from japhy's regex book which is referred to in this node.
    my $string = "helloworldhellohellohihellohiworld"; my $length = length $string; my $window = int (($length - 2) / 2); # use japhy's regex to hoover up all char # sequences that MIGHT be patterns: my @pats; my $regex; while ($window > 1) { $regex = '(?=(' . '.' x $window . '))'; push @pats, ($string =~ /$regex/g); $window --; } # now go through @pats to find the duplicates # and print the final result @pats = sort @pats; my %dups; for (2 .. $#pats) { $dups{$pats[$_]} ++ if ($pats[$_] eq $pats[$_ - 1]) } $dups{$_} ++ for keys %dups; for (keys %dups) {print $dups{$_},' occurrences of "',$_,'"',"\n"}
    This throws up 31 patterns, with up to four occurrences each. (BTW, in case $window doesn't make sense, I assumed (A) there must be at least two occurrences of each pattern, otherwise it wouldn't really be a pattern; (B) each pattern must be at least 2 chars and (C) there must be at least 2 patterns.)

    Thanks for making me think. Can I stop now?

    § George Sherston
      Well George I came up with the same number of 'patterns' but I didnt need a regex. I thought you might like to see it:
      my $s='helloworldhellohellohihellohiworld'; #determine every substring in the original my %hash; for my $i (0..length($s)-1) { $hash{substr($s,$i,$_)}++ for (1..length($s)-$i); } #filter out singles and the chars %hash=map { ($hash{$_}>1 && length($_)>1) ? ($_,$hash{$_}) :() } keys %hash; #yes this is how i format maps #and ternary ops.. :-) #print the results use Data::Dumper; print Dumper(\%hash);
      Id love to know how the OP wanted the computer to tell that 'hello' is a word but 'elloh' isnt... (forgetting real english words that are embedded like 'low' 'el' 'hell')

      Incidentally get the following results (reformatted):

      el,ell,ello,elloh,ellohi, he,hel,hell,hello,helloh,hellohi,hi, ld,ll,llo,lloh,llohi,lo,loh,lohi, oh,ohi,or,orl,orld, rl,rld,wo,wor,worl,world
      I have a feeling there isn't really a way to do what the OP wants to do. Its not really prefix matching, nor suffix matching....

      To the OP what should happen here if said 7 words? 'hellohiothellobrakerakerashash'

      Yves
      --
      You are not ready to use symrefs unless you already know why they are bad. -- tadmc (CLPM)

      Update minor bugfixes and challenge to Op

Re: Patter Finding
by lemming (Priest) on Sep 11, 2001 at 04:17 UTC

    update: Look at my second offering, it's better

    Here's an inefficient subroutine that I've used in the past. Probably time for an overhaul
    Output when called with $string, 2, 2:
    4 : (hello)
    2 : (world)
    2 : (hi)

    sub get_pattern { my ($string, $min_len, $min_num) = @_; my $str_len = length($string); my $srch_max = int($str_len/2); my %patterns; # First we find all patterns that are up to 1/2 the length of the stri +ng foreach my $len ($min_len..$srch_max) { my $eol = $str_len - $len; foreach my $ind1 (0..$eol) { my $pat = substr($file, $ind1, $len); unless ( defined($patterns{$pat}) ) { $patterns{$pat} = 0; my $index = 0; do { $index = index($file, $pat, $index); unless ($index < 0) { $index += length($pat); $patterns{$pat}++; } } while ($index >= 0); } } } # We then dump all patterns that do not occur min_num times foreach my $key (keys %patterns) { delete $patterns{$key} if ($patterns{$key} < $min_num); } # We then go through the patterns by order and remove those # that are invalidated by better patterns foreach my $key (sort { $patterns{$b} * (length($b)-1) <=> $patterns{$a} * (length($a)-1) or length($b) <=> length($a) or $a cmp $b } keys %patterns) { my $check = 0; $patterns{$key} = 0; my $index; do { $index = index($file, $key, 0); unless ($index < 0) { $check = 1; $patterns{$key}++; substr($file, $index, length($key)) = "\000"; } } while ($index >= 0); delete $patterns{$key} if ($patterns{$key} < $min_num); } foreach my $key (sort { $patterns{$b} * (length($b)-1) <=> $patterns{$a} * (length($a)-1) or length($b) <=> length($a) or $a cmp $b } keys %patterns) { (my $pat = $key) =~ s/\n/\\n/g; printf("%3d : (%s)\n", $patterns{$key}, $pat); } }
      Hi Lemming,

      Im a little confused. As posted your code goes into an infinte loop. When I s/$file/\$string/g I get the output as you said we would (impressive) However if minlen is 0 it goes into an infinite loop!

      Also when I try the string:'hellohiothellobrakerakerashash' I only get one of the many words contained, and a couple that arent words.

      2 : (hello) 2 : (aker) 2 : (ash)
      I would expect any of the following:
      hello,hi,othello,brake,rake,raker,rash,ash,hash,ohio, the,lob,bra,hell,era # I get this using substr counts: ak,ake,aker,akera,as,ash,el,ell,ello,er,era, he,hel,hell,hello,ke,ker,kera,ll,llo,lo, ra,rak,rake,raker,rakera,sh
      So my guess is that the above results are coincidental or am I missing something? Yves

      --
      You are not ready to use symrefs unless you already know why they are bad. -- tadmc (CLPM)

        I would expect any of the following: hello,hi,othello,brake,rake,raker,rash,ash,hash,ohio, the,lob,bra,hell,era

        I don't see how you could expect some of those strings, because some only appear once in the string (e.g. "othello", "ohio"), so you really couldn't call them a "pattern" unless you're matching against a dictionary file.

        My solution near the top of this thread sort of assumes that the string is a contiguous series of patterns (one of the original constraints was "String contains nothing but patterns"), so it only finds "hello" from your test string, but if you change this line:

        # From this if (/\G(.{2,})(?=.*?\1)/g) { # To this if (/\G.*?(.{2,})(?=.*?\1)/g) {
        Then it does better and finds "ash", "rake", and "hello" from your test string, which is about as good as it gets, I believe.

      Ok. Here's a better version. While I haven't benchmarked it, my feeling are that it's a hog, but I bullet proofed several areas. It's less than a hog than my earlier post. I'm posting the new version for an easier compare.

      #!/usr/bin/perl # string, min_len of pattern, min_num of patterns use strict; use warnings; my $string = "bookhelloworldhellohellohihellohiworldhihelloworldhihe +llobookpenbookpenworld"; get_pattern($string, 2, 2); exit; sub get_pattern { my ($string, $min_len, $min_num) = @_; my $str_len = length($string); my $srch_max = int($str_len/2); my %patterns; # First we find all patterns that are up to 1/2 the length of the stri +ng print "length : $str_len\n"; my %tmp_hash; foreach my $len ($min_len..$srch_max) { my $eol = $str_len - $len; foreach my $ind1 (0..$eol) { my $pat = substr($string, $ind1, $len); unless ( defined($tmp_hash{$pat}) ) { $tmp_hash{$pat} = 0; $tmp_hash{$pat}++ while ($string =~ /\Q$pat\E/g); $patterns{$pat} = $tmp_hash{$pat} if ($tmp_hash{$pat} >= $min_ +num); } } } undef %tmp_hash; print "Patterns: ", scalar (keys %patterns), "\n"; # We then go through the patterns by order and remove those # that are invalidated by better patterns # Longer strings that occur more often are considered better my $mod_str = $string; foreach my $key (sort { $patterns{$b} * (length($b)-1) <=> $patterns{$a} * (length($a)-1) or length($b) <=> length($a) } keys %patterns) { my $tstr = $mod_str; # We null out any area with pattern and count $patterns{$key} = ($tstr =~ s/\Q$key\E/\000/g); if ($patterns{$key} >= $min_num) { # If it hits threshold we keep $mod_str = $tstr; } else { # If not we toss pattern delete $patterns{$key}; } } print "Valid : ", scalar (keys %patterns), "\n"; # We finally print results foreach my $key (sort { $patterns{$b} * (length($b)-1) <=> $patterns{$a} * (length($a)-1) or length($b) <=> length($a) or $a cmp $b } keys %patterns) { (my $pat = $key) =~ s/\n/\\n/g; printf "%3d: (%s)\n", $patterns{$key}, $pat; } }
Re: Patter Finding
by japhy (Canon) on Sep 11, 2001 at 05:09 UTC
    This works... but probably only on account of villany. I think for something to be considered a "pattern", it has to show up TWICE, not just once.
    $_ = "helloworldhellohellohihellohiworld"; m{ (?{ local ($c1, $c2, $c3) = (0,0,0) }) ^ (.+?) (?: \1 (?{ local $c1 = $c1 + 1 }) )* (.+?) (?: (?: \1 (?{ local $c1 = $c1 + 1 }) ) | (?: \2 (?{ local $c2 = $c2 + 1 }) ) )* (.+?) (?: (?: \1 (?{ local $c1 = $c1 + 1 }) ) | (?: \2 (?{ local $c2 = $c2 + 1 }) ) | (?: \3 (?{ local $c3 = $c3 + 1 }) ) )* (?(?{ $c1 && $c2 && $c3 })|(?!)) $ }x and print "<$1> <$2> <$3>\n";

    _____________________________________________________
    Jeff[japhy]Pinyan: Perl, regex, and perl hacker.
    s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

Re: Pattern Finding
by runrig (Abbot) on Sep 11, 2001 at 03:08 UTC
    Untested, and not very thought out:
    $_ = "hellohiworldhellohiworldhellohiworldhellohiworld"; if (/^((.+?)\2+)$/) { print "[$1][$2]\n"; }
    Update: Hmm, this does not work, and I'm not sure why (regex bug? or bad code?) :-(

    Ok, got the above code to work at least the way I was expecting, but still probably not what you want.

    <update>Though here's something that seems to do what you want:

    use warnings; use strict; $_ = "helloworldhellohellohihellohiworld"; my %pttrns; PTTRN: { if (/\G(.{2,})(?=.*?\1)/g) { $pttrns{$1}++; SKIP: { my $again; for my $pat (keys %pttrns) { if (substr($_, pos, length $pat) eq $pat) { pos() += length($pat); redo SKIP; } } } redo PTTRN; } } print "[$_]\n" for keys %pttrns; ~/tmp >./tst [world] [hello] [hi]
Re: Patter Finding
by nardo (Friar) on Sep 11, 2001 at 06:21 UTC
    The pattern /^(.+?)\1*(.+?)(?:\1|\2)*(.+?)(?:\1|\2|\3)+$/ will find the three patterns and put them in $1, $2, and $3. You can use a function to build this regex:
    sub findpattern { my $data = shift; my $num = shift; my $re; my $i; my @retval; $re = '^'; for($i = 1; $i <= $num; $i++) { $re .= '(.+?)(?:'.join('|', map("\\$_", 1..$i)).')'; if($i == $num) { $re .= '+'; } else { $re .= '*'; } } $re .= '$'; if($data =~ /$re/) { for($i = 1; $i <= $num; $i++) { push(@retval, ${$i}); } return @retval; } return undef; }
    If passed the data and the number of patterns it will return a list of the patterns or undef if unable to find them (at least one pattern must be repeated more than once).
Re: Patter Finding
by synapse0 (Pilgrim) on Sep 11, 2001 at 04:07 UTC
    I would say this is a daunting task at the very best, probably not at all suited for perl. Humans can pick up patterns like that with no problem (providing they know the language the patterns are written in), but how does a computer "know" of the word hello? what differentiates "hello" from "hell"? Why is "oh" not part of the pattern??
    There needs to be a much stricter rulebase for this to even consider trying it in perl I would imagine, possibly a known dictionary, or small set of specific patterns to look for..
    Anyway, my 2 cents..
    -Syn0
And Now For Something Completely Different... Re: Pattern Finding
by Zaxo (Archbishop) on Sep 11, 2001 at 17:29 UTC

    I'd like to show a failed attempt at this interesting but ill-posed problem. I say ill-posed because the expected output is not deducible from the input alone, but expects patterns which form English words.

    The LZx family of compression algorithms depend on finding and cataloging repeated substrings. I had the notion to use a modified LZW compression routine to find a list of candidate patterns. Only the dictionary is built, and I generate no compressed stream (take that, Unisys!). The dictionary keeps frequencies rather than unique numeric identifiers.

    After preliminaries,

    #!/usr/bin/perl -w -- use strict; # Usage: ./Pattern [n [string]]
    I shift in arguments or set to a default, then clear out control characters. $depth provides for multiple scanning of the string, part of why this approach is flawed. As a stream-oriented algorithm, LZW is does not predict frequent substrings at the beginning, and is greedy about tacking extra characters onto a candidate. We'll see these effects later in the output listing.
    my $depth = shift || 2; my $string = shift || "helloworldhellohellohihellohiworld"; $string =~ s/[\000-\037\0177]/ /g; # strip non-printable
    Per LZW, we prime the dictionary with our alphabet. We set up a current working string, $j, then scan the input string one character ($_) at a time. If $j.$_ has been seen, we go on to the next character. If it has not, we increment $j's count, add $j.$_ to the dictionary, and reset $j to $_.
    # seed dictionary with printable characters... my %dict = map {(chr$_ => 0)} '32'..'126'; # and populate it from the input string for (1..$depth) { my $j = ''; for (split //, $string) { my $tmp = $j . $_; $j = defined $dict{$tmp} ? $tmp : do{ $dict{$j}++; $dict{$tmp}=0; $_}; } }
    Print the collected substrings, filtering out the single characters. I make a crude attempt to sort by desirability of a pattern, accounting for both frequency and length. A Data::Dumper spill of the dictionary can be uncommented for closer study.
    # the following sort routine is arbitrary, # chosen because it behaves the way I want... sort of. { local $, = " "; local $\ = "\n"; print sort { $dict{$a}*length($a)**2 <=> $dict{$b}*length($b)**2 #|| } grep {length$_>1} keys %dict; } #use Data::Dumper; #print Dumper(\%dict); __END__
    Run without arguments, the program prints:
    ow ohi llo worl dh ldh orl rld hellohi hi hellow lohi iwo ihe ell ld h +e ll lo iw rl oh ih el or wo hel wor loh hell helloh hello

    Clearly, the limited view of the data taken by a stream oriented algorithm is not good enough to recognize the two occurances of 'world' disjoint to 'hello''s four. The lookaside and capture facilities of perl's regex engine are superior for this task.

    Props to jepri, who motivated me to look at the LZ clan a while back. I didn't use his code for this; these warts are all mine. I originally was going to try a similar trick in a cryptanalytic tookit, but this exercise has showed me that I need to find a better idea. I think I'll find several in the other replies here.

    ++artist for a stimulating question to think about.

    After Compline,
    Zaxo

Re: Pattern Finding
by ariels (Curate) on Sep 11, 2001 at 15:10 UTC
    (In perl -de 17:)
    DB<6> $x = 'helloworldhellohellohihellohiworldhi' DB<7> print "$1\n" while $x =~ /(\w+)(?=.*\1.*\1)/g hello o l hello h l l o hi h l

    The single letters appear, of course, since they're repeated more than once. We can fix that by requiring \1 to be at least 2 characters long...

    DB<8> print "$1\n" while $x =~ /(\w{2,})(?=.*\1.*\1)/g hello hello hi
    Of course it still finds `world', which appears 4 times.

    To autogenerate the regexp:

    sub repeat_finder_re { my $n = shift; '(\w{2,})(?=' . ('.*\\1'x($n-1)) . ')' }
Re: Pattern Finding
by t13 (Novice) on Sep 11, 2001 at 10:15 UTC
    Is it just me or is your problem statement not well defined? e.g. I see the following patterns: hello, world, hellohi.

    If you say nay, then you open yourself up to the possibilites of some kind of short sequence popping up. The previous sentence had "you" and "yourself" for example. Which is which?

    t.

Re: Pattern Finding
by tachyon (Chancellor) on Sep 12, 2001 at 09:16 UTC

    Here is my effort. It finds all the patterns and also does a quick dictionary lookup for real words. If you don't have a dictionary text file you can select from a wide variety at the National Puzzlers' League -- Word Lists

    my $str ='helloworldhellohellohihellohiworld'; my %hash; # grab all the substrings 2+ chars and count occurences in a hash for my $i ( 0 ..(length($str) -1) ) { $hash{ substr($str, $i, $_) }++ for 2.. (length($str) - $i); } # sort on occurences and then in alphabetical order # only select elements that occur >1 times using grep my @order = sort { $hash{$b} <=> $hash{$a} || $a cmp $b } grep { $hash{$_} > 1 } keys %hash; print "\nPatterns found:\n\n"; print "$hash{$_} occurrences of $_ \n" for @order; # now grab dictionary file into a hash # you can save memory using Search::Dict open DICT, "c:/windows/desktop/dict.txt" or die $!; while ($word = <DICT>) { chomp $word; $dict{$word}++; } @words = grep { defined $dict{$_} } @order; print "\nReal words found in dictionary:\n\n"; print "$hash{$_} occurrences of $_ \n" for @words; # remove substrings of larger words @words = sort { length $b <=> length $a } @words; for $i ( 0 .. $#words - 1 ) { for $j ( $i + 1 .. $#words ) { $hash{$words[$j]} = 0 if $words[$i] =~ m/\Q$words[$j]/ and $ha +sh{$words[$i]} == $hash{$words[$j]}; } } # regenerate sort order grepping out unwanted substrings (set occurenc +es to zero above) @words = sort { $hash{$b} <=> $hash{$a} || $a cmp $b } grep { $hash{$_} } @words; print "\nBest Matches:\n\n"; print "$hash{$_} occurrences of $_ \n" for @words; __END__ # sample output Patterns found: 4 occurrences of el 4 occurrences of ell 4 occurrences of ello 4 occurrences of he 4 occurrences of hel 4 occurrences of hell 4 occurrences of hello 4 occurrences of ll 4 occurrences of llo 4 occurrences of lo 3 occurrences of elloh 3 occurrences of helloh 3 occurrences of lloh 3 occurrences of loh 3 occurrences of oh 2 occurrences of ellohi 2 occurrences of hellohi 2 occurrences of hi 2 occurrences of ld 2 occurrences of llohi 2 occurrences of lohi 2 occurrences of ohi 2 occurrences of or 2 occurrences of orl 2 occurrences of orld 2 occurrences of rl 2 occurrences of rld 2 occurrences of wo 2 occurrences of wor 2 occurrences of worl 2 occurrences of world Real words found in dictionary: 4 occurrences of el 4 occurrences of ell 4 occurrences of he 4 occurrences of hell 4 occurrences of hello 4 occurrences of lo 3 occurrences of oh 2 occurrences of hi 2 occurrences of or 2 occurrences of wo 2 occurrences of world Best Matches: 4 occurrences of hello 3 occurrences of oh 2 occurrences of hi 2 occurrences of world

    cheers

    tachyon

    s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

Re: Pattern Finding
by jackdied (Monk) on Sep 12, 2001 at 21:06 UTC
    This little guy will do want you want.
    The string can't have spaces, or you'll have to use a
    different seperator rather than a space. If you wanted it
    to be bullet proof, you would have to use arrays and do the
    regexp compares on every element.

    WARNING: If there is a pattern that is never repeated, this will go into an infinite loop

    It is not commented, but short enough to be self explanitory (?)
    #!/usr/bin/perl
    use strict;
    use Data::Dumper;
    
    my $orig_str = 'helloworldhellohellohihellohiworld';
    
    my @patterns = ();
    
    my $str = $orig_str;
    
    while (length($str)) {
        my $len = 1;
        my $token = substr($str, 0, $len);
        my $test_p = substr($str, $len);
        while ($test_p =~ /$token/) {
            $len++;
            $token = substr($str, 0, $len);
            $test_p = substr($str, $len);
        }
        $token = substr($str, 0, $len - 1);
        push @patterns, $token;
        $token =~ s/ //g;
        $str =~ s/$token/ /g;
        $str =~ s/^ *//;
    }
    
    print Dumper(\@patterns);
    
Re: Pattern Finding
by artist (Parson) on Sep 13, 2001 at 01:34 UTC
    Thanks Monks, for your excellent responses.

    I am glad to see 'amazing ideas. I am in the process of running your code against different patterns.
    Meanwhile test your code against:

    String: bookhelloworldhellohellohihellohiworldhihelloworldhihellobookpenbookpenworld
    
    Number of Patterns:5
    ==========================================================
    Deduce patterns:world pen, book, hi, hello
    
    Some more rules to identify patterns.
    1.  I am not looking specifically for dictionary words. 
    2.  One Pattern cannot be part of another pattern.
    3.  There could be multiple answers.
    4.  A Pattern may  show up only once.
    5.  A Pattern may contain single character only
    6.  A Pattern may contain space also.
    
    In case of multiple answer there could be 'techniques' we can apply to obtain the possible best
    such as: minimum sum of length of patterns.

    Thanks,
    Artist.

      Well I came up with a solution but for bizarre reasons decided to obfu it, so I posted it as Pattern Matching Obfu, you will have to change the string $S as appropriate to your requireements. Your clarification of the constraints on the problem lead to some interesting angles, some that I suspect are unintended. Most especially that irrelevent of the short pattern solution there are likely to be very many long patterns, each of which _ONLY_ match once.

      My algorthm, in a rather humourous fashion found the following solutions, amongst many others, that meet your critera in a very short amount of time (the | is the seperator between sub patterns):

      bookhelloworldh|ellohellohihell|ohiworldhihello|worldhihelloboo|kpenbo +okpenworld bookhelloworldh|ellohellohihell|ohiworldhihello|worldhihellobook|penbo +okpenworld bookhelloworldh|ellohellohihell|ohiworldhihello|worldhihellobookp|enbo +okpenworld bookhelloworldh|ellohellohihell|ohiworldhihell|oworldhihellobookp|enbo +okpenworld bookhelloworldhe|llohellohihell|ohiworldhihello|worldhihellobookp|enbo +okpenworld bookhelloworldhel|lohellohihell|ohiworldhihello|worldhihellobookp|enbo +okpenworld bookhelloworldhel|lohellohihello|hiworldhihello|worldhihellobookp|enbo +okpenworld bookhelloworldhel|lohellohihello|hiworldhihello|worldhihellobook|penbo +okpenworld bookhelloworldhel|lohellohihelloh|iworldhihello|worldhihellobook|penbo +okpenworld bookhelloworldhell|ohellohihelloh|iworldhihello|worldhihellobook|penbo +okpenworld
      Yves
      --
      You are not ready to use symrefs unless you already know why they are bad. -- tadmc (CLPM)

      Have you considered what your 'rules' imply:

      3. There could be multiple answers.
      4. A Pattern may show up only once.
      5. A Pattern may contain single character only

      Every single char is by definition a pattern. So to is every combination of substrings. There will be quite a few answers. The number will be given by:

      l + (l-1) + (l-2) + ..... ( l - (l - 1) ) + ( l - l )
      where l = length of the string.

      This is (l+(l**2))/2 for each and every string under your rules. BTW my substring/dictionary and best match hack returns this:

      Best Matches: 6 occurrences of hello 4 occurrences of hi 4 occurrences of world 3 occurrences of book 3 occurrences of oh 2 occurrences of low 2 occurrences of pen C:\>

      Your rules are not specific enough to formulate an answer. How do you define what is part of a pattern. Is this "Igohellohellohellohi" a string that contains 3 'hello' or 4 'hi' or 6 'l'..... Perhaps concluding this was the real task?

      cheers

      tachyon

      s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

      Here is something that I believe almost satisfies your requirements (just needs a bit more work which I'm not ready to do at the moment, and its not thoroughly tested). It doesn't do very well without a min and max length for each pattern, so maybe if this was wrapped in a sub which adjusted the min and max to various sizes, and evaluated the results on each pass by some heuristic, it could do fairly well with all of the requirements (and that'll have to wait 'till later):
      use warnings; use strict; # Min and max length for each pattern my $min = 2; my $max = 8; # Number of patterns my $num = shift; # Generate pattern to capture words my $words = join ('', map { "(.{$min,$max})" . "(?:" . join( '|', map("\\$_", 1..$_)) . ")*" } 1..$num); $_="bookhelloworldhellohellohihellohiworldhihelloworldhihellobookpenbo +okpenworld"; if (my @pats = /^$words$/) { for my $pat (@pats) { print "[$pat]\n"; } } stan:~/tmp >./tst 5 [book] [hello] [world] [hi] [pen]
      Update: Greatly simplified. Wondering if I'm doing someone's homework. Noticed that its very similar to nardo's approach, but cleaner, I think, and slightly different behavior due to the newest problem definition. Great minds think alike :)
        Hi, This is one of the classic problem in AI.

        The problem I posted, is actaully an exercise on segmentation section of the OpenLab on http://www.a-i.com.(You will need to register) I have extended it to some other critera such as 'spaces allowed' to meet more general problems. I tried runrig's solution and it doesn't work when number of patterns is 6, for the condition that one pattern cannot be part of another pattern.

        I am trying to solve this problem myself also, what I am looking for is good design to begin with.

        Artist.
        (My computer doesn't keep the login for more than one page, Please let me know if you know the soltuion).