1: =pod
2:
3: @longest = pinyan_UCS($string)
4:
5: This function returns the set of the longest substrings
6: in a given string. It seems rather efficient, even though
7: it calls C<index()> quite a bit. I found that using a hash
8: to figure if I'd seen a character had adverse effects.
9:
10: =cut
11:
12: sub pinyan_UCS {
13: my $str = shift;
14: my $len = length $str;
15: my ($diff,$biggest) = (0,0);
16: my ($jump,@ahead,@matches);
17:
18: for (my $i = 0; $i < $len; ) {
19: my $match = [ $i, $len ];
20: if ($len - $i >= $biggest) {
21: for (my $k = $i; $k < $match->[1]; $k++) {
22: $ahead[$k] ||= index($str, substr($str,$k,1), $k+1);
23: if ($ahead[$k] != -1 and $match->[1] > $ahead[$k]) {
24: $match->[1] = $ahead[$k];
25: $jump = $k;
26: }
27: }
28:
29: $diff = $match->[1] - $match->[0];
30:
31: if ($diff > $biggest) { ($biggest,@matches) = ($diff,$match) }
32: elsif ($diff == $biggest) { push @matches, $match; }
33: }
34: else { last }
35:
36: $i = ++$jump;
37: }
38:
39: return map substr($str, $_->[0], $_->[1] - $_->[0]), @matches;
40: }
Re: Unique-Character Substring
by salvadors (Pilgrim) on Jan 20, 2001 at 19:59 UTC
|
sub tony_UCS {
local $_ = reverse shift;
my $longest = my $tohere = "";
while (my $char = chop $_) {
my $idx = index($tohere, $char);
if ($idx != -1) {
$longest = $tohere if (length $tohere > length $longest);
$tohere = substr($tohere, $idx + 1) . $char;
} else {
$tohere .= $char;
}
}
$longest = $tohere if (length $tohere > length $longest);
return $longest;
}
This is about 3 times faster on a short test string, and quickly goes up to 5 times faster for slightly longer strings...
Tony | [reply] [Watch: Dir/Any] [d/l] |
|
My turn to offer corrections. :-)
First of all index is a O(n) operation. If you build a
hash first and use that it is still logarithmic (not sure
if study does that) but you lose.
Also if there are multiple substrings of the maximum
length you return only one of them. Try the string
"hello hello" out. Jeff finds "lo he" and "o hel".
You only find "lo he".
| [reply] [Watch: Dir/Any] |
|
sub tony_UCS {
local $_ = reverse shift;
my %longest; my $longest = 0; my $tohere = "";
while (my $char = chop $_) {
my $idx = index($tohere, $char);
if ($idx != -1) {
push @{$longest{$longest = length $tohere}}, $tohere
if (length $tohere >= $longest);
$tohere = substr($tohere, $idx + 1) . $char;
} else {
$tohere .= $char;
}
}
push @{$longest{$longest = length $tohere}}, $tohere
if (length $tohere >= $longest);
return @{$longest{$longest}};
}
It's still considerably faster than the original, even if not O(n). :)
Tony | [reply] [Watch: Dir/Any] [d/l] |
|
Wow... very nice. I see exactly what you did. You reversed the domain of the index() -- instead of seeing where 'x' occurs next in the string, you see if it's already occurred in the substring.
Damn. Nice.
japhy --
Perl and Regex Hacker
| [reply] [Watch: Dir/Any] |
Re (tilly) 1: Unique-Character Substring
by tilly (Archbishop) on Jan 20, 2001 at 14:16 UTC
|
I found your description rather confusing. You are not
interested in substrings, rather in substrings that do not
find repeated characters. (I take it that UCS means Unique
Character Substrings?) For comparison here is another
approach based on recursive splits of the string rather
than repeated calls to index. I have no idea how it
compares for efficiency.
sub tilly_UCS {
my $str = shift;
# Try all cuts. Those that don't fall in 2 are repeats
foreach my $char (split //, $str) {
my @cut = split /\Q$char\E/, $str, -1;
if (2 != @cut) {
my @rejoined = map "$cut[$_-1]$char$cut[$_]", 1..$#cut;
@rejoined = sort {length $b <=> $a} @rejoined;
my @unique = tilly_UCS(shift @rejoined);
foreach my $str (@rejoined) {
if (length($str) < length($unique[0])) {
last; # Avoid useless work, cannot improve
}
my @found = tilly_UCS($str);
if (length($found[0]) > length($unique[0])) {
@unique = @found;
}
elsif (length($found[0]) == length($unique[0])) {
push @unique, @found;
}
}
return @unique;
}
}
# No repeats
return $str;
}
BTW those that don't know what the third argument to
shift did should look it up. You will avoid a commmon
misunderstanding that can lead to bugs. | [reply] [Watch: Dir/Any] [d/l] |
|
This doesn't do the same as the original. When fed "The quick brown fox jumps over the lazy dog" it returns "quick brown quick brown quick brown quick brown" rather than the "quick brown" of the original.
I also found it strange that you return @unique under one condition, and $str under another...
And I assume you meant the third argument to split rather than shift?
Tony
| [reply] [Watch: Dir/Any] |
|
Oops, on both counts. I meant split, not shift.
As for the different returns, in one case I verified that
it was a unique string and so return the original, in the
other I verified it was not and returned the substrings in
it. However I don't keep context about locations in the
original string, so I cannot tell whether a duplicate is
because I found the same string through two paths. In your
example through cutting on 'h' then 'e' I get to ' quick
brown fox jumped ov' no matter which way I split on
'h' first.
That is fixable, but not easily.
| [reply] [Watch: Dir/Any] |
Re: Unique-Character Substring
by dchetlin (Friar) on Jan 21, 2001 at 09:08 UTC
|
Well, heck, if we're going to do this re-posting in a different
medium thing, here's what I had on FWP. Without benchmarking,
I put my money on it being faster.
#!/usr/bin/perl -w
use strict;
local $" = ',';
while (<DATA>) {
chomp;
my @chars = split //;
my $index = -1;
my %seen = ();
my $answer = [""];
while (++$index < @chars) {
if (exists $seen{$chars[$index]}) {
answer_set(\%seen, $answer);
$index = $seen{$chars[$index]};
%seen = ();
} else {
$seen{$chars[$index]} = $index;
}
}
answer_set(\%seen, $answer);
print "$_: @$answer\n";
}
sub answer_set {
my $seen = shift;
goto 'JUMP_' . ((keys %$seen <=> length $_[0][0]) + 1);
JUMP_0:
return;
JUMP_2:
$_[0] = [];
JUMP_1:
push @{$_[0]},
join "", sort { $seen->{$a} <=> $seen->{$b} } keys %$seen;
return;
}
__DATA__
mississippi
abcbef
adbcbef
that
abcdefa
abcdabc
testing
this test
abcdefg
-dlc | [reply] [Watch: Dir/Any] [d/l] |
|
Without benchmarking, I put my money on it being faster.
I'd hold on to your money, if I were you. On a short string mine is about 4 times faster. On a longer string it's about 10 times faster...
Tony
| [reply] [Watch: Dir/Any] |
|
|