Re: Generate unique ids of maximum length

Another idea probably worth mentioning: embrace tab completion. Let's suppose we can use tab completion on your input, for instance consider this line:

Lenoc3_duallayer_1
^^   ^ ^         ^
[download]

You must type the marked characters, but the others are optional, they can be completed by pressing tab. Therefore in the shortening process always keep the mandatory characters, but leave out as many optional as needed starting from the right end.

limit  shortened
   10  Lenoc3_du1
       ...
    6  Len3d1
    5  Le3d1
    4  not possible
[download]

This is achievable by parsing your input into a suffix tree and operating on that.

L -> XP_ -> 0 -> ...
         -> 1 -> ...
  -> enoc -> 3_ -> carina_ -> ...
                -> duallayer_ -> ...
          -> 5_ -> carina -> ...
                -> duallayer -> ...
[download]

The first characters of the tree node strings are the mandatory ones, the others are optional. This tree structure seems similar to ikegami's code, however I haven't understood that fully yet, so I'm not sure what's the difference. For me the benefit of this approach seems to be this: you don't need to use the concept of word. Another pro for this algorithm is that you don't need to number your shortened identifiers. (But numbering your input lines (in the base of your input character set) is not bad because that produces the shortest possible unique ids.)

I have some proof of concept code, but take it with a grain of salt: it's unnecessarily complex, employs dirty hacks etc.

use strict;
use warnings;
use Data::Dump qw( dd );

sub _collapse {
  my $tree = shift;
  my ( $stree, $append );

  if ( ref $tree ) {
    my @keys = keys %$tree;
    if ( @keys == 1 and $keys[0] ne '' ) {
      ( $stree, $append ) = _collapse( $tree->{ $keys[0] } );
      return $stree, defined $append ? $keys[0] . $append : $keys[0];
    }
    else {
      for (@keys) {
        ( my $ref, $append ) = _collapse( $tree->{$_} );
        $stree->{ defined $append ? $_ . $append : $_ } = $ref;
      }
      return $stree;
    }
  }
  return;
}

sub collapse {
  my $ctree = shift;
  my ( $stree, $append ) = _collapse($ctree);
  if ( defined $append ) {
    return { $append => $stree };
  }
  else {
    return $stree;
  }
}

sub shorten {
  my $stree = shift;
  my $limit = shift;

  if ( ref $stree ) {
    while ( my ( $k, $v ) = each %$stree ) {
      local our @parts = @parts;
      push @parts, $k if $k ne '';
      if ( $k eq '' ) {
        if ( @parts > $limit ) {
          print "!\n";
          next;
        }
        my $remaining = $limit - @parts;
        my $shortened = '';
        for ( 0 .. $#parts ) {
          $shortened .= substr $parts[$_], 0, 1;
          my $str = substr $parts[$_], 1, $remaining;
          $shortened .= $str;
          last if ( ( $remaining -= length $str ) < 0 );
        }
        print $shortened, "\t", join( '', @parts ), "\n";
      }
      shorten( $v, $limit );
    }
  }
}

my $ctree = {};

while (<DATA>) {
  chomp;
  my $ref = $ctree;
  for ( split // ) {
    no warnings 'void';
    $ref->{$_}->{''}; # looks like a decent autovivification bug ;-)
    $ref = $ref->{$_};
  }
  $ref->{''} = undef;
}

#dd $ctree;
my $stree = collapse($ctree);
#dd $stree;
shorten( $stree => 5 );

__DATA__
A2990_duallayer_1
A2990_duallayer_2
A2990_duallayer_3
A2990_duallayer_4
A2990_duallayer_5
A2990_duallayer_6
A2990_duallayer_7
A2990_duallayer_8
A2990_duallayer_9
A2990_duallayer_10
LXP_01
LXP_02
LXP_03
LXP_04
LXP_05
LXP_06
LXP_07
LXP_08
LXP_09
LXP_10
LXP_11
LXP_12
LXP_13
LXP_14
LXP_15
LXP_16
LXP_17
LXP_18
Normal_1
Normal_2
Normal_3
Normal_4
Normal_5
Normal_6
Lenoc3_carina_A
Lenoc3_carina_B
Lenoc3_carina_C
Lenoc3_duallayer_1
Lenoc3_duallayer_2
Lenoc3_duallayer_3
Lenoc5_carina_1
Lenoc5_carina_2
Lenoc5_carina_3
Lenoc5_duallayer_1
Lenoc5_duallayer_2
Lenoc5_duallayer_3
[download]

Some explanation: I've started by parsing the input into a character-level suffix tree. This tree uses hashes, the keys are the substrings, the values are hashrefs of what can follow. A possible ending position of the string is marked by "" => undef. Then collapsed it to a substring-level suffix tree and descended into that to shorten the ids.

Sorry for the late post, yesterday I haven't got enough time to write this.

p.s.: Am I right this is guaranteed to produce unique ids? I'm not sure, but it seems good.

update: answered my own question: Yes, the uniqueness of the shortened ids is guaranteed, no matter what set of optional characters are left out.

Comment on Re: Generate unique ids of maximum length Select or Download Code

Replies are listed 'Best First'.
Re^2: Generate unique ids of maximum length by ikegami (Patriarch) on Apr 13, 2010 at 20:52 UTC
This tree structure seems similar to ikegami's code, however I haven't understood that fully yet, so I'm not sure what's the difference. My tree is the same as your $ctree: `A -- ... / \ X -- P -- ... \ / L \ e -- n -- o -- c -- ...` [download] except numbers are considered atomic in mine. I don't bother collapsing into your $stree form: `A -- ... / \ XP -- ... \ / L \ enoc -- ...` [download] You must type the marked characters, but the others are optional The difference with mine is that I made more character mandatory. The rational is that the OP wanted to the result to resemble the original as much as possible. Your mandatory characters: Ambiguous characters. `Lenoc3_duallayer_1 -> Le3d1 ^^ ^ ^ ^` [download] My mandatory characters: Ambiguous characters. A sequence of 0+ uppercase letters followed by a sequence of 0+ lowercase letters preceding an ambiguous lowercase letter. First and second unambiguous character of a text sequence Nonletters (digits, underscores) `Lenoc3_duallayer_1 -> Len3_du_1 ^^^ ^^^^ ^^` [download]	[reply] [d/l] [select]
Re^3: Generate unique ids of maximum length by rubasov (Friar) on Apr 13, 2010 at 22:22 UTC
Thanks for the explanation, I've got it. (Then consider my node as an explanation to ikegami's node.) It's funny how many different ways people interpret similarity/resemblance, because that was exactly the reason why I chose to keep all the optional characters if the id already fit in the char limit. That way my code always keeps the under-limit ids identical (== more similar). Of course in other cases that's not the optimal choice. I also thought of (but not implemented) a more generic way to decide which character to drop from the original id: provide the user a filter callback in which s?he can rate the characters (or substrings) considered, then drop the ones with the lowest rating (still from right to left). For example: `[_ ] => 3, [A-Z] => 2, [a-z]=> 1, anything else => 0` And this is why I've collapsed the char-level suffix tree to the substring-level: to ease the access to substrings for the purpose of rating. And also because the structure of the tree in the substring-level form cannot interfere with the selection of (non-)ambiguous characters (as in choroba's remark above if I get it right). Cheers	[reply] [d/l]
Re^4: Generate unique ids of maximum length by ikegami (Patriarch) on Apr 13, 2010 at 22:28 UTC
And this is why I've collapsed the char-level suffix tree to the substring-level: I've always done that too, for exactly the reason you mentioned. I just don't create a tree from the collapsed sequences. I just keep the currently relevant collapsed sequence in a scalar (was called `$flux`, now called `$unsplit`). I contemplated returning each item as an alternating list of required and optional components (as follows), but I wanted to keep the code a short as possible. `( ... [ 'Le', 'noc', '3', '_', 'd', 'uallayer_', '3' ], [ 'Le', 'noc', '5', '_', 'c', 'arina_', '1' ], ... )` [download] Update: Added last para and accompanying illuatration.	[reply] [d/l] [select]


Syntactic Confectionery Delight
	PerlMonks