in reply to Pattern Finding

Thanks Monks, for your excellent responses.

I am glad to see 'amazing ideas. I am in the process of running your code against different patterns.
Meanwhile test your code against:

String: bookhelloworldhellohellohihellohiworldhihelloworldhihellobookpenbookpenworld

Number of Patterns:5
Deduce patterns:world pen, book, hi, hello
Some more rules to identify patterns.
1.  I am not looking specifically for dictionary words. 
2.  One Pattern cannot be part of another pattern.
3.  There could be multiple answers.
4.  A Pattern may  show up only once.
5.  A Pattern may contain single character only
6.  A Pattern may contain space also.
In case of multiple answer there could be 'techniques' we can apply to obtain the possible best
such as: minimum sum of length of patterns.


Replies are listed 'Best First'.
Re: Re: Pattern Finding
by demerphq (Chancellor) on Sep 17, 2001 at 03:19 UTC
    Well I came up with a solution but for bizarre reasons decided to obfu it, so I posted it as Pattern Matching Obfu, you will have to change the string $S as appropriate to your requireements. Your clarification of the constraints on the problem lead to some interesting angles, some that I suspect are unintended. Most especially that irrelevent of the short pattern solution there are likely to be very many long patterns, each of which _ONLY_ match once.

    My algorthm, in a rather humourous fashion found the following solutions, amongst many others, that meet your critera in a very short amount of time (the | is the seperator between sub patterns):

    bookhelloworldh|ellohellohihell|ohiworldhihello|worldhihelloboo|kpenbo +okpenworld bookhelloworldh|ellohellohihell|ohiworldhihello|worldhihellobook|penbo +okpenworld bookhelloworldh|ellohellohihell|ohiworldhihello|worldhihellobookp|enbo +okpenworld bookhelloworldh|ellohellohihell|ohiworldhihell|oworldhihellobookp|enbo +okpenworld bookhelloworldhe|llohellohihell|ohiworldhihello|worldhihellobookp|enbo +okpenworld bookhelloworldhel|lohellohihell|ohiworldhihello|worldhihellobookp|enbo +okpenworld bookhelloworldhel|lohellohihello|hiworldhihello|worldhihellobookp|enbo +okpenworld bookhelloworldhel|lohellohihello|hiworldhihello|worldhihellobook|penbo +okpenworld bookhelloworldhel|lohellohihelloh|iworldhihello|worldhihellobook|penbo +okpenworld bookhelloworldhell|ohellohihelloh|iworldhihello|worldhihellobook|penbo +okpenworld
    You are not ready to use symrefs unless you already know why they are bad. -- tadmc (CLPM)
Re: Re: Pattern Finding
by tachyon (Chancellor) on Sep 24, 2001 at 21:12 UTC

    Have you considered what your 'rules' imply:

    3. There could be multiple answers.
    4. A Pattern may show up only once.
    5. A Pattern may contain single character only

    Every single char is by definition a pattern. So to is every combination of substrings. There will be quite a few answers. The number will be given by:

    l + (l-1) + (l-2) + ..... ( l - (l - 1) ) + ( l - l )
    where l = length of the string.

    This is (l+(l**2))/2 for each and every string under your rules. BTW my substring/dictionary and best match hack returns this:

    Best Matches: 6 occurrences of hello 4 occurrences of hi 4 occurrences of world 3 occurrences of book 3 occurrences of oh 2 occurrences of low 2 occurrences of pen C:\>

    Your rules are not specific enough to formulate an answer. How do you define what is part of a pattern. Is this "Igohellohellohellohi" a string that contains 3 'hello' or 4 'hi' or 6 'l'..... Perhaps concluding this was the real task?




Re: Re: Pattern Finding
by runrig (Abbot) on Sep 13, 2001 at 03:20 UTC
    Here is something that I believe almost satisfies your requirements (just needs a bit more work which I'm not ready to do at the moment, and its not thoroughly tested). It doesn't do very well without a min and max length for each pattern, so maybe if this was wrapped in a sub which adjusted the min and max to various sizes, and evaluated the results on each pass by some heuristic, it could do fairly well with all of the requirements (and that'll have to wait 'till later):
    use warnings; use strict; # Min and max length for each pattern my $min = 2; my $max = 8; # Number of patterns my $num = shift; # Generate pattern to capture words my $words = join ('', map { "(.{$min,$max})" . "(?:" . join( '|', map("\\$_", 1..$_)) . ")*" } 1..$num); $_="bookhelloworldhellohellohihellohiworldhihelloworldhihellobookpenbo +okpenworld"; if (my @pats = /^$words$/) { for my $pat (@pats) { print "[$pat]\n"; } } stan:~/tmp >./tst 5 [book] [hello] [world] [hi] [pen]
    Update: Greatly simplified. Wondering if I'm doing someone's homework. Noticed that its very similar to nardo's approach, but cleaner, I think, and slightly different behavior due to the newest problem definition. Great minds think alike :)
      Hi, This is one of the classic problem in AI.

      The problem I posted, is actaully an exercise on segmentation section of the OpenLab on will need to register) I have extended it to some other critera such as 'spaces allowed' to meet more general problems. I tried runrig's solution and it doesn't work when number of patterns is 6, for the condition that one pattern cannot be part of another pattern.

      I am trying to solve this problem myself also, what I am looking for is good design to begin with.

      (My computer doesn't keep the login for more than one page, Please let me know if you know the soltuion).

        for the condition that one pattern cannot be part of another pattern

        This is the toughest condition, and so I don't think it can be done with a regex, at least not with perl's regex engine (hope someone can prove me wrong :). At every stage of capuring a pattern, you'd have to be able to fail if the longer of the current pattern and each of all past patterns doesn't contain the other. Here's some psuedo perl regex code which, if it worked would accomplish this (hope you get the idea), but I'm using things in the wrong way, the regex engine isn't re-entrant, it uses "$1" instead of "\1" (and in a symbolic reference sort of way), etc, but I though it was interesting nonetheless. It would go right after each pattern caputure in my solution:

        join('', map { "(?{(length($$i)>length($$_))$$i !~ /$$_/ | $$_ !~ /$$i/})" } 1..$_)