Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Challenge: Generate a glob patterns from a word list

by choroba (Archbishop)
on May 03, 2021 at 22:41 UTC ( #11131998=perlquestion: print w/replies, xml ) Need Help??

choroba has asked for the wisdom of the Perl Monks concerning the following question:

(Inspired by this StackOverflow question.)

You are given a list of words. Find the shortest glob expression generating the words in the list. The glob expression should only use curly brackets and commas, no spaces, no wildcards.

e.g.

#! /usr/bin/perl use warnings; use strict; use Test::More; sub list2glob { ... } is list2glob('a', 'b'), '{a,b}'; is list2glob('ab', 'ac'), 'a{b,c}'; is list2glob('aXb', 'aYb'), 'a{X,Y}b'; is list2glob(qw( a1b3c a1b4c a2b3c a2b4c )), 'a{1,2}b{3,4}c'; is list2glob(qw( /ab/ef/ij/kl /ab/ef/ij/mn /ab/ef/ij /ab/gh/ij/kl /ab/gh/ij/mn /ab/gh/ij /cd/ef/ij/kl /cd/ef/ij/mn /cd/ef/ij /cd/gh/ij/kl /cd/gh/ij/mn /cd/gh/ij )), '/{ab,cd}/{ef,gh}/ij{/{kl,mn},}'; is list2glob(qw( abdel abdelmn abdelmo abdfgkl abdfgklmn abdfgklmo abdfghkl abdfghklmn abdfghklmo abdfgijkl abdfgijklmn abdfgijklmo acdel acdelmn acdelmo acdfgkl acdfgklmn acdfgklmo acdfghkl acdfghklmn acdfghklmo acdfgijkl acdfgijklmn acdfgijklmo )), 'a{b,c}d{e,fg{,h,ij}k}l{,m{n,o}}'; done_testing();

I've tried several times, but I can't get it right. It gets much more complex than it seems, or maybe I'm doing something wrong. Any ideas, solutions, insults?

Update: What do I mean by "shortest"? The curly brackets and commas don't count, just count the other characters.

map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]

Replies are listed 'Best First'.
Re: Challenge: Generate a glob patterns from a word list
by tybalt89 (Prior) on May 04, 2021 at 03:05 UTC

    It's an idea...

    #!/usr/bin/perl use strict; # https://perlmonks.org/?node_id=11131998 use warnings; use Algorithm::Diff qw(traverse_sequences); use List::Util qw( first ); use Test::More; sub list2glob { @_ == 2 and return pair(@_); my @best; for my $i ( 0 .. $#_ - 1 ) { for my $j ( $i + 1 .. $#_ ) { my @rest = ( pair(@_[$i, $j]), map $_[$_], grep $_ != $i && $_ != $j, 0 .. $#_ ); my $try = list2glob(@rest); $best[length $try] = $try; } } return first { defined } @best; } sub pair { my $answer = my $left = my $right = ''; my @from = split //, shift; my @to = split //, shift; traverse_sequences( \@from, \@to, { MATCH => sub { length $left . $right and $answer .= "{$left,$right}"; $left = $right = ''; $answer .= $from[shift()] }, DISCARD_A => sub {$left .= $from[shift()]}, DISCARD_B => sub {$right .= $to[pop()]}, } ); length $left . $right and $answer .= "{$left,$right}"; return $answer; } is list2glob('a', 'b'), '{a,b}'; is list2glob('ab', 'ac'), 'a{b,c}'; is list2glob('aXb', 'aYb'), 'a{X,Y}b'; is list2glob(qw( a1b3c a1b4c a2b3c a2b4c )), 'a{1,2}b{3,4}c'; is list2glob(qw( /ab/ef/ij/kl /ab/ef/ij/mn /ab/ef/ij /cd/gh/ij/kl /cd/gh/ij/mn /cd/gh/ij )), '/{ab,cd}/{ef,gh}/ij{/{kl,mn},}'; done_testing();

    Outputs:

    ok 1 ok 2 ok 3 ok 4 ok 5 1..5

    See if you can figure out why I left out some tests :)

      > See if you can figure out why I left out some tests :)

      I let it run for 7 hours, but the last test was still in progress. In the end, I killed it, as the fan noise was waking me up from sleep.

      map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
Re: Challenge: Generate a glob patterns from a word list
by Fletch (Chancellor) on May 04, 2021 at 02:51 UTC

    I'd start maybe looking at Regexp::Trie for inspiration; at least conceptually its similar just you'll be targetting glob semantics rather than regexp.

    Edit: Looking at the original question domain (mkpath -p on directory names) I wonder if you might couldn't cheat and work path segment by path segment (split into sections on qr{/} then kind of do a trie-ish thing on each segment). Don't know if that's applicable for the more general question.

    The cake is a lie.
    The cake is a lie.
    The cake is a lie.

      AFAIK are Trie algorithms effectively building a tree.

      But now we have a search graph with diamond structures.

      IOW or-clauses in a trie are nested and never chained.

      (Unless I'm missing something)

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery

        No that's a good point. I started following my own suggestion and had just that problem. I'd not match the expected in that I'd get "too much" commonality duplicated and wound up with a{1b{3c,4c},2b{3c,4c}} rather than a{1,2}b{3,4}c. Now it "works" and produces the same names for 5/6 outputs (expanding those with glob before testing), but it's not what I'd call an optimal glob representation (in that I think there's too much duplicated).

        The cake is a lie.
        The cake is a lie.
        The cake is a lie.

        I think you are right, What we have here is a N-way diff (where N is the number of strings).
        My guess is that it is horribly explosive in N.

        A 2-way diff is bad enough that it takes caching to perform adequately. N-way would be much, much worse.

Re: Challenge: Generate a glob patterns from a word list
by Discipulus (Abbot) on May 04, 2021 at 07:28 UTC
    Hello choroba,

    I propose a non concurring idea perhaps interesting, infact my result is longer and the resulting glob will produce too much occurences (57600 instead of 24 ;). Maybe it can be used and reduced..

    > Find the shortest glob expression generating the words in the list.

    ..words in the list and only them!

    > ..insults?

    brainy!! :)

    use warnings; use strict; use Data::Dump; sub list2glob { my $table; my $table_len; foreach my $word ( sort {length $b <=> length $a} @_ ){ $table_len = length $word unless $table_len; push @$table, [ map{ length $word > $_ ? substr( $word,$_,1 ) +: '' }0..$table_len-1 ]; } # dd $table; my $glob; foreach my $col (0..$table_len-1){ print "\nColumn $col\n"; my %possible; foreach my $row ( @$table ){ $possible{ $row->[$col] }++; } print "occurences: ";dd %possible; if ( 1 == scalar keys %possible ){ $glob .= join'',keys %possible; } else { $glob .= '{'.( join ',', keys %possible ).'}'; } print "actual glob expression: $glob\n"; } my @all = glob $glob; print "\n\nRESULT: glob $glob produces ",scalar @all," elements\n" +; } list2glob(qw( abdel abdelmn abdelmo abdfgkl abdfgklmn abdfgklmo abdfghkl abdfghklmn abdfghklmo abdfgijkl abdfgijklmn abdfgijklmo acdel acdelmn acdelmo acdfgkl acdfgklmn acdfgklmo acdfghkl acdfghklmn acdfghklmo acdfgijkl acdfgijklmn __END__ Column 0 occurences: ("a", 24) actual glob expression: a Column 1 occurences: ("b", 12, "c", 12) actual glob expression: a{b,c} Column 2 occurences: ("d", 24) actual glob expression: a{b,c}d Column 3 occurences: ("e", 6, "f", 18) actual glob expression: a{b,c}d{e,f} Column 4 occurences: ("l", 6, "g", 18) actual glob expression: a{b,c}d{e,f}{l,g} Column 5 occurences: ("", 2, "k", 6, "m", 4, "h", 6, "i", 6) actual glob expression: a{b,c}d{e,f}{l,g}{,k,m,h,i} Column 6 occurences: ("l", 6, "k", 6, "o", 2, "n", 2, "j", 6, "", 2) actual glob expression: a{b,c}d{e,f}{l,g}{,k,m,h,i}{l,k,o,n,j,} Column 7 occurences: ("m", 4, "l", 6, "k", 6, "", 8) actual glob expression: a{b,c}d{e,f}{l,g}{,k,m,h,i}{l,k,o,n,j,}{m,l,k, +} Column 8 occurences: ("", 10, "n", 2, "m", 4, "o", 2, "l", 6) actual glob expression: a{b,c}d{e,f}{l,g}{,k,m,h,i}{l,k,o,n,j,}{m,l,k, +}{,n,m,o,l} Column 9 occurences: ("o", 2, "n", 2, "m", 4, "", 16) actual glob expression: a{b,c}d{e,f}{l,g}{,k,m,h,i}{l,k,o,n,j,}{m,l,k, +}{,n,m,o,l}{o,n,m,} Column 10 occurences: ("", 20, "n", 2, "o", 2) actual glob expression: a{b,c}d{e,f}{l,g}{,k,m,h,i}{l,k,o,n,j,}{m,l,k, +}{,n,m,o,l}{o,n,m,}{,n,o} RESULT: glob a{b,c}d{e,f}{l,g}{,k,m,h,i}{l,k,o,n,j,}{m,l,k,}{,n,m,o,l} +{o,n,m,}{,n,o} produces 57600 elements

    L*

    There are no rules, there are no thumbs..
    Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.
Re: Challenge: Generate a glob patterns from a word list
by LanX (Cardinal) on May 05, 2021 at 18:36 UTC
    > Find the shortest glob expression ...

    So what's your metric for "shortest"?

    String length doesn't equal compression.

    use v5.12; use warnings; use Data::Dump; use Test::More; test ( [ <{b,{c,d}e}> ], [ <{b,ce,de}> ], ); test ( [ <{c,d}e> ], [ <{ce,de}> ], ); sub test { my @g = @_; for my $i (0..$#g) { for my $j ($i+1..$#g) { is_deeply($g[$i],$g[$j],"$i,$j"); } } ddx @g; } done_testing;

    ok 1 - 0,1 # challenge_choroba_globs.pl:26: (["b", "ce", "de"], ["b", "ce", "de"] +) ok 2 - 0,1 # challenge_choroba_globs.pl:26: (["ce", "de"], ["ce", "de"]) 1..2

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery

      OP updated. Don't count the curlies and commas.

      map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
        furthermore, in your example the elements from the "alphabet" always only appear once

        'a{b,c}d{e,fg{,h,ij}k}l{,m{n,o}}';

        I suppose that is not a given, but this might guaranty a unique solution.

        OTOH if characters can be repeated I'd bet that multiple solutions are possible.

        Please clarify.

        update

        for instance all these glob expressions are equivalent ( I removed the variations where {,a} and {a,} were swapped)

        use v5.12; use warnings; use Data::Dump; use Test::More; test ( [ <{a,aa}> ], [ <{,a}{a}> ], [ <{a}{,a}> ], [ <a{,a}> ], [ <{,a}a> ], ); sub test { my @g = @_; my (%h1,%h2); for my $i (0..$#g) { for my $j ($i+1..$#g) { is_deeply( [sort @{$g[$i]}], [sort @{$g[$j]}], "$i,$j") or ddx $g[$i],$g[$j]; } } } done_testing;

        Cheers Rolf
        (addicted to the Perl Programming Language :)
        Wikisyntax for the Monastery

        > OP updated. Don't count the curlies and commas.

        you should also add that curlies and commas should be minimal as a second criteria.

        for instance a , {a} and {{a}} are all equivalent.

        best you provide a sub metric

        edit

        sub cmp_glob($g1,$g2) which returns -1, 0 or 1 like cmp does.

        Cheers Rolf
        (addicted to the Perl Programming Language :)
        Wikisyntax for the Monastery

Re: Challenge: Generate a glob patterns from a word list
by tybalt89 (Prior) on May 10, 2021 at 14:39 UTC

    Here's my sixth try at it. It runs in two modes, fast & exhaustive. Fast mode only works with the two longest prefix/suffix at each stage. It is possible that fast mode will not produce the shortest solution. Notice that as set up it doesn't run an exhaustive test on the 24 word test case, but it completes all the rest in about 150 seconds on my machine.

    It basically does a trie on both ends (prefix or suffix) until things are reduced to one word.

    #!/usr/bin/perl use strict; # https://perlmonks.org/?node_id=11131998 use warnings; use List::Util qw( uniq first ); $| = 1; my $braces = qr/(\{(?:(?1))\}|[^{}\n])/; # {} must be balanced my $leftduplicate = qr/^($braces+).*\n\1/m; my $rightduplicate = qr/($braces+)\n(?:.*\n)*.*\1$/m; my %cache; my $fast; sub list2glob # find the glob for a set of w +ords { local $_ = join "\n", my @words = uniq sort @_; @words > 1 or return $words[0]; $cache{$_} and return $cache{$_}; my (@best, @found); /$leftduplicate(?{ push @{ $found[length $1] }, "l $1" })(*FAIL)/; /$rightduplicate(?{ push @{ $found[length $1] }, "r $1" })(*FAIL)/; @found or return $cache{$_} = # join all words in brace group '{' . join(',', map s/^\{($braces*)\}$/$1/r, @words) . '}'; my $count = 0; for ( reverse grep defined, @found ) # process duplicate ends { for ( uniq @$_ ) # a duplicate left or right end (trie it !) { my ($side, $same) = split; my ($regex, $left, $right) = $side eq 'l' ? (qr/^\Q$same\E/, $same, '') : (qr/\Q$same\E$/, '', $same); my $answer = list2glob( (grep !/$regex/, @words ), $left . list2glob( map { /$regex/ ? "$`$'" : () } @words ) . $ +right); $best[$answer =~ tr/{},//c] = $answer; } $fast and ++$count >= 2 and last; # do only two longest duplicates + in fast } return $cache{$_} = first {defined} @best; # shortest solution } sub runtest # run, show, and validate the glob { %cache = (); my @words = uniq sort @_; print $fast ? "( fast) " : "(exhaustive) "; my $answer = list2glob(@words) // '*NONE*'; print "$answer\n"; my @glob = uniq sort glob $answer; "@glob" eq "@words" or die "\n** FAILED **\n got: @glob\nwant: @word +s\n"; } while( <DATA> ) { s/#.*//; # remove comments my @words = split or next; # ignore blank lines print "@words =>\n"; $fast = 1; runtest( @words ); $fast = 0; @words <= 20 and # FIXME skip big exhaustive tests runtest( @words ); print "\n"; } print "SUCCESS !!\n"; __DATA__ aVb aWb aXXb aYb aZb # debugging test case +s abcdXegh abcdYfgh ac ad bc bd ad ae af bd be bf cd ce cf a1b3c a1b4c a2b3c a2b4c a1b5c a2b5c fee fie foe fum one two three four five six seven eight nine ten a bx cy dx ey fx g hx i jy aa a aa aaa aaaa aaaaa anestingtest a b # tests from https://perlmonks.org/?node_id= +11131998 ab ac aXb aYb a1b3c a1b4c a2b3c a2b4c /ab/ef/ij/kl /ab/ef/ij/mn /ab/ef/ij /ab/gh/ij/kl /ab/gh/ij/mn /ab/gh/i +j /cd/ef/ij/kl /cd/ef/ij/mn /cd/ef/ij /cd/gh/ij/kl /cd/gh/ij/mn /cd/g +h/ij abdel abdelmn abdelmo abdfgkl abdfgklmn abdfgklmo abdfghkl abdfghklmn +abdfghklmo abdfgijkl abdfgijklmn abdfgijklmo acdel acdelmn acdelmo ac +dfgkl acdfgklmn acdfgklmo acdfghkl acdfghklmn acdfghklmo acdfgijkl ac +dfgijklmn acdfgijklmo

    Outputs:

    aVb aWb aXXb aYb aZb => ( fast) a{V,W,XX,Y,Z}b (exhaustive) a{V,W,XX,Y,Z}b abcdXegh abcdYfgh => ( fast) abcd{Xe,Yf}gh (exhaustive) abcd{Xe,Yf}gh ac ad bc bd => ( fast) {a,b}{c,d} (exhaustive) {a,b}{c,d} ad ae af bd be bf cd ce cf => ( fast) {a,b,c}{d,e,f} (exhaustive) {a,b,c}{d,e,f} a1b3c a1b4c a2b3c a2b4c a1b5c a2b5c => ( fast) a{1,2}b{3,4,5}c (exhaustive) a{1,2}b{3,4,5}c fee fie foe fum => ( fast) f{um,{e,i,o}e} (exhaustive) f{um,{e,i,o}e} one two three four five six seven eight nine ten => ( fast) {eight,four,six,two,{fiv,thre,{ni,o}n}e,{sev,t}en} (exhaustive) {eight,four,six,two,{fiv,thre,{ni,o}n}e,{sev,t}en} a bx cy dx ey fx g hx i jy aa => ( fast) {g,i,{,a}a,{b,d,f,h}x,{c,e,j}y} (exhaustive) {g,i,{,a}a,{b,d,f,h}x,{c,e,j}y} a aa aaa aaaa aaaaa anestingtest => ( fast) a{,a{,a{,{,a}a}},nestingtest} (exhaustive) a{,nestingtest,{,{,{,a}a}a}a} a b => ( fast) {a,b} (exhaustive) {a,b} ab ac => ( fast) a{b,c} (exhaustive) a{b,c} aXb aYb => ( fast) a{X,Y}b (exhaustive) a{X,Y}b a1b3c a1b4c a2b3c a2b4c => ( fast) a{1,2}b{3,4}c (exhaustive) a{1,2}b{3,4}c /ab/ef/ij/kl /ab/ef/ij/mn /ab/ef/ij /ab/gh/ij/kl /ab/gh/ij/mn /ab/gh/i +j /cd/ef/ij/kl /cd/ef/ij/mn /cd/ef/ij /cd/gh/ij/kl /cd/gh/ij/mn /cd/g +h/ij => ( fast) /{ab,cd}/{ef,gh}/ij{,/{kl,mn}} (exhaustive) /{ab,cd}/{ef,gh}/ij{,/{kl,mn}} abdel abdelmn abdelmo abdfgkl abdfgklmn abdfgklmo abdfghkl abdfghklmn +abdfghklmo abdfgijkl abdfgijklmn abdfgijklmo acdel acdelmn acdelmo ac +dfgkl acdfgklmn acdfgklmo acdfghkl acdfghklmn acdfghklmo acdfgijkl ac +dfgijklmn acdfgijklmo => ( fast) a{b,c}d{e,fg{,h,ij}k}l{,m{n,o}} SUCCESS !!

    Note: I changed from your testing scheme because some tests generate different solutions but both are valid.

Re: Challenge: Generate a glob patterns from a word list
by jo37 (Friar) on May 06, 2021 at 16:52 UTC

    Somehow this challenge reminds me of finding a minimal normal form for a Boolean function. There are algorithms for this task, e.g. Quine McCluskey algorithm or Petrick's method. Though this challenge does not deal with a Boolean algebra, the ideas therein might give some inspiration.
    Just a thought.

    Greetings,
    -jo

    $gryYup$d0ylprbpriprrYpkJl2xyl~rzg??P~5lp2hyl0p$
      Yes and no.

      You can translate a glob into a Boolean term.

      Question is if you need the full power of Boolean expressions.

      Because Quine McCluskey is NP complete.

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://11131998]
Approved by GrandFather
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chilling in the Monastery: (6)
As of 2021-05-11 05:35 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Perl 7 will be out ...





    Results (113 votes). Check out past polls.

    Notices?