http://www.perlmonks.org?node_id=834569


in reply to Generate unique ids of maximum length

Another idea probably worth mentioning: embrace tab completion. Let's suppose we can use tab completion on your input, for instance consider this line:

Lenoc3_duallayer_1 ^^ ^ ^ ^

You must type the marked characters, but the others are optional, they can be completed by pressing tab. Therefore in the shortening process always keep the mandatory characters, but leave out as many optional as needed starting from the right end.

limit shortened 10 Lenoc3_du1 ... 6 Len3d1 5 Le3d1 4 not possible

This is achievable by parsing your input into a suffix tree and operating on that.

L -> XP_ -> 0 -> ... -> 1 -> ... -> enoc -> 3_ -> carina_ -> ... -> duallayer_ -> ... -> 5_ -> carina -> ... -> duallayer -> ...

The first characters of the tree node strings are the mandatory ones, the others are optional. This tree structure seems similar to ikegami's code, however I haven't understood that fully yet, so I'm not sure what's the difference. For me the benefit of this approach seems to be this: you don't need to use the concept of word. Another pro for this algorithm is that you don't need to number your shortened identifiers. (But numbering your input lines (in the base of your input character set) is not bad because that produces the shortest possible unique ids.)

I have some proof of concept code, but take it with a grain of salt: it's unnecessarily complex, employs dirty hacks etc.

use strict; use warnings; use Data::Dump qw( dd ); sub _collapse { my $tree = shift; my ( $stree, $append ); if ( ref $tree ) { my @keys = keys %$tree; if ( @keys == 1 and $keys[0] ne '' ) { ( $stree, $append ) = _collapse( $tree->{ $keys[0] } ); return $stree, defined $append ? $keys[0] . $append : $keys[0]; } else { for (@keys) { ( my $ref, $append ) = _collapse( $tree->{$_} ); $stree->{ defined $append ? $_ . $append : $_ } = $ref; } return $stree; } } return; } sub collapse { my $ctree = shift; my ( $stree, $append ) = _collapse($ctree); if ( defined $append ) { return { $append => $stree }; } else { return $stree; } } sub shorten { my $stree = shift; my $limit = shift; if ( ref $stree ) { while ( my ( $k, $v ) = each %$stree ) { local our @parts = @parts; push @parts, $k if $k ne ''; if ( $k eq '' ) { if ( @parts > $limit ) { print "!\n"; next; } my $remaining = $limit - @parts; my $shortened = ''; for ( 0 .. $#parts ) { $shortened .= substr $parts[$_], 0, 1; my $str = substr $parts[$_], 1, $remaining; $shortened .= $str; last if ( ( $remaining -= length $str ) < 0 ); } print $shortened, "\t", join( '', @parts ), "\n"; } shorten( $v, $limit ); } } } my $ctree = {}; while (<DATA>) { chomp; my $ref = $ctree; for ( split // ) { no warnings 'void'; $ref->{$_}->{''}; # looks like a decent autovivification bug ;-) $ref = $ref->{$_}; } $ref->{''} = undef; } #dd $ctree; my $stree = collapse($ctree); #dd $stree; shorten( $stree => 5 ); __DATA__ A2990_duallayer_1 A2990_duallayer_2 A2990_duallayer_3 A2990_duallayer_4 A2990_duallayer_5 A2990_duallayer_6 A2990_duallayer_7 A2990_duallayer_8 A2990_duallayer_9 A2990_duallayer_10 LXP_01 LXP_02 LXP_03 LXP_04 LXP_05 LXP_06 LXP_07 LXP_08 LXP_09 LXP_10 LXP_11 LXP_12 LXP_13 LXP_14 LXP_15 LXP_16 LXP_17 LXP_18 Normal_1 Normal_2 Normal_3 Normal_4 Normal_5 Normal_6 Lenoc3_carina_A Lenoc3_carina_B Lenoc3_carina_C Lenoc3_duallayer_1 Lenoc3_duallayer_2 Lenoc3_duallayer_3 Lenoc5_carina_1 Lenoc5_carina_2 Lenoc5_carina_3 Lenoc5_duallayer_1 Lenoc5_duallayer_2 Lenoc5_duallayer_3

Some explanation: I've started by parsing the input into a character-level suffix tree. This tree uses hashes, the keys are the substrings, the values are hashrefs of what can follow. A possible ending position of the string is marked by "" => undef. Then collapsed it to a substring-level suffix tree and descended into that to shorten the ids.

Sorry for the late post, yesterday I haven't got enough time to write this.

p.s.: Am I right this is guaranteed to produce unique ids? I'm not sure, but it seems good.

update: answered my own question: Yes, the uniqueness of the shortened ids is guaranteed, no matter what set of optional characters are left out.