http://www.perlmonks.org?node_id=11104709

stevena has asked for the wisdom of the Perl Monks concerning the following question:

Given a lexicon with long words and short words (happening to be length 9 and up, and 8 and down, respectively), I am looking to enumerate all minimal extensions of a short by a long; that is, with no intervening word (either long or short) that extends the short and is extended by the long. (Extensions could also be called strict superstrings.)

I.e., if   'w > x' means word w extends word x, I want to find all pairs (L, S), such that:

* L > S

* there's no W with L > W > S,

where L, S, W are restricted to lexicon words.

For example, (NUMERICAL, NUMERIC) is desired, but not (SOCIOBIOLOGIST, OLOGIST) because SOCIOBIOLOGIST > BIOLOGIST > OLOGIST.

I am looking for transparent code conducive to my adding less interesting constraints. (Fine if you guide me without writing the code. Also happy to see code.)

Not homework or work -- solely hobby-related. Thanks!

  • Comment on minimal superstrings/maximal substrings

Replies are listed 'Best First'.
Re: minimal superstrings/maximal substrings
by jcb (Parson) on Aug 20, 2019 at 01:25 UTC

    If I read this correctly, you do not want (SOCIOBIOLOGIST, OLOGIST) but do want (SOCIOBIOLOGIST, BIOLOGIST) and (BIOLOGIST, OLOGIST).

    This leads to a simple iterative solution, where you first find all L > S tuples with the longest L and shortest S, then "split" those tuples for cases where some W exists such that L > W > S until you have no more tuples to split. Each split replaces one entry L > S in the list with a pair of entries L > W and W > S.

    I am unsure, but suspect that you may be able to get better performance by applying scalar reverse to each string and working with "mirrored" strings, turning suffix matches (hard) into prefix matches (easy).

    Now that I think about it, sorting those "mirrored" strings should put all possible L > S tuples into neat contiguous groups, with the shortest S at the beginning, longest L at the end, and any W in the middle, if you use the default cmp for the sort.

    Does the entire lexicon fit into memory or do you need to sort them on disk?

    Edit to add: At least this stage is simple enough that I would suggest solving this with the shell sort command and awk if you have those tools. I firmly believe that every monk here should learn AWK and use it for these types of simple hobby tasks — doing so will make you a better Perl programmer. (The GNU Awk manual is positively "light" reading compared to the enormous library of POD that comes with Perl.)

      Does the entire lexicon fit into memory or do you need to sort them on disk?
      It most probably does. Word lists and dictionaries have at most a few hundreds of thousands of entries, and that's quite small compared to computer memory nowadays, probably even on Raspberry PI nanocomputers.

      On your edit: yes, I agree that AWK is a great tool that I have used quite a lot in the past, but, personally, I have almost stopped using AWK when I started to know Perl well enough to make it a more powerful and more flexible replacement.

Re: minimal superstrings/maximal substrings
by AnomalousMonk (Archbishop) on Aug 20, 2019 at 01:01 UTC
Re: minimal superstrings/maximal substrings
by tybalt89 (Monsignor) on Aug 20, 2019 at 10:21 UTC

    Straight from the problem description. Probably O(N**3) or greater.

    #!/usr/bin/perl # https://perlmonks.org/?node_id=11104709 use strict; use warnings; use List::Util qw( none ); use Path::Tiny; my @words = # get some wor +ds grep /^fun/, # for shorter +testing grep /^[a-z]+$/, path('/usr/share/dict/words')->lines({chomp => 1}); my @longs = grep length $_ >= 9, @words; # all legal lo +ng words for my $short ( grep length $_ <= 8, @words ) # all legal sh +ort words { for my $long ( grep /$short/, @longs ) # all long sup +erstrings { if( none { # test for wor +d between length $_ > length $short and length $_ < length $long and $long =~ $_} @words ) { print "$long $short\n"; } } }

    Outputs:

    funicular fun funniness fun functional function functionaries function functionary function functioned function functioning function functions function fundamental fund fundraiser fund fundraising fund funereally funereal fungicidal fungi fungicide fungi fungibles fungible funkiness funk funneling funnel

      Or use regex to find if there is a word between.

      #!/usr/bin/perl # https://perlmonks.org/?node_id=11104709 use strict; use warnings; use List::Util qw( none ); use Path::Tiny; my @words = # get some wor +ds grep /^fun/, # for shorter +testing grep /^[a-z]+$/, path('/usr/share/dict/words')->lines({chomp => 1}); my @longs = grep length $_ >= 9, @words; # all legal lo +ng words my $string = join ' ', sort { length $a <=> length $b } @words; for my $short ( grep length $_ <= 8, @words ) # all legal sh +ort words { for my $long ( grep /$short/, @longs ) # all long sup +erstrings { if( $string !~ # test for wor +d between /\b$short\b.*?\b(\w*$short\w*)\b.*?\b(?=$long\b)\w*\1\w*\b/ ) { print "$long $short\n"; } } }
Re: minimal superstrings/maximal substrings
by LanX (Saint) on Aug 22, 2019 at 12:57 UTC
    This creates the full graph before filtering to your length requirements
    use strict; use warnings; use Data::Dump qw/pp dd/; my @words = data(); my (%lower,%super); for my $word ( sort {length $a <=> length $b} @words ) { SUPER: for my $sub ( keys %lower ) { if ( -1 != index ($word,$sub) ) { for ( @{$super{$sub}} ) { next SUPER if -1 < index ($word,$_) } push @{$super{$sub}},$word; } } $lower{$word}=1; } #pp \%super; for my $key ( keys %super ) { if (length $key >8) { delete $super{$key}; next; } $super{$key} = [grep { length $_ > 8 } @{$super{$key}}]; } pp \%super; sub data { return "funk", "fungibles", "funneling", "fundraiser", "fungi", "functions", "fun", "fundamental", "functioned", "functional", "functioning", "funicular", "funnel", "functionaries", "functionary", "funkiness", "function", "fungicidal", "fungible", "fungicide", "fundraising", "funniness", "fund", "funereal", "funereally","fundraising-team", }

    { fun => ["funicular", "funniness"], function => [ "functions", "functioned", "functional", "functioning", "functionary", "functionaries", ], fund => ["fundraiser", "fundamental", "fundraising"], funereal => ["funereally"], fungi => ["fungicide", "fungicidal"], fungible => ["fungibles"], funk => ["funkiness"], funnel => ["funneling"], }

    Disclaimer: data mostly stole from tybalt89 :)

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

Re: minimal superstrings/maximal substrings
by The Perlman (Scribe) on Aug 21, 2019 at 14:20 UTC
    On a side note, the mathematical construct you are describing is known as a "poset" = partially ordered set

    The best algorithm depends on your data's size, because a complex algorithm doesn't pay off when only being some seconds faster.

    My take on it is to first sort and partition your words in subsets of same length, because every "cover word" must have more letters.

    Start by putting all words of the first partition into @uncovered and test them against all words of the next partition.

    For each match you'll store into a hash $cover{$smaller}=$bigger and delete $smaller from @uncovered.

    Then you'll push the @current_partition to the remaining words in @uncovered and repeat with the next partition.

    This is a generic algorithm, your case with 9 up and 8 down can be easily adopted.

    - Ron
Re: minimal superstrings/maximal substrings
by stevena (Novice) on Aug 20, 2019 at 00:29 UTC
    Btw, I could post my partial code, but does only the L>S part and it contains some of my other constraints (which I suppose I could squelch for the purpose).
      It would be helpful if you posted more input data and resultant output data.
      Post your current code too using the posted examples and explain what is wrong.