Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

suffix arrays

by Anonymous Monk
on Jul 18, 2003 at 20:13 UTC ( #275740=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I was recently reading an article in DDJ(April 2001) about suffix arrays. As described in the article, the procedure in C is to 1) fill an array with pointers to every position in a string, 2) sort the array, and 3) search the array for long phrases. This is an extremely fast way to find the longest duplicated substrings. The crux of the technique is to create pointers into a string, which in C is really an array, and avoids the problem of copying or moving lots of substrings in memory.

My question is, how could this be duplicated in Perl? Is there a technique for creating references to parts inside a scalar? I know there is a module for this, but it is simply a frontend to a c program. Can this be done strictly in Perl without using a pile of memory?

I took a shot at doing this, based on the DDJ code, but it is slow as bejeezus. Please Perl Monks, grant me your wisdom!!

#!/usr/bin/perl -w use strict; my $input; { local $/=undef; $input=<>; } my @arr=(0 .. (length($input)-2)); my @arr2=sort { makestr($a) cmp makestr($b) } @arr; my $maxlen=0; my $maxindex=0; for(my $i=0; $i<(length($input)-2); $i++) { my $temp=checklen($arr2[$i], $arr2[$i+1], \$input); if ($temp > $maxlen) { $maxlen=$temp; $maxindex=$i; } } print "maxlen:$maxlen index:$maxindex string:",substr($input,0,$maxlen +),"\n"; sub makestr{ return substr($input, $_[0], (length($input)-$_[0])); } sub checklen{ my $aa= $_[0]; my $bb=$_[1]; my $input=$_[2]; my $counter=0; my $len=length($$input)-1; while(($aa < $len)&& ($bb < $len)) { last unless (substr($$input, $aa++,1) eq substr($$input, $bb++,1)) +; $counter++; } return $counter; }

Replies are listed 'Best First'.
Re: suffix arrays
by Zaxo (Archbishop) on Jul 18, 2003 at 20:59 UTC

    A reference to substr's return value has the properties you want,

    my $foo = "A string to work on."; my @substrs = map {\substr $foo, $_} 0 .. length($foo) - 1; my @sorted = sort { $$a cmp $$b } @substrs;
    That the reference is to the original string is shown by:
    $ perl -e '$foo = "A short string\n"; $bar = \substr $foo, 0, 1; $$bar + = "The"; print $foo' The short string $

    After Compline,
    Zaxo

      Unfortunately, as of 5.8.0, that still doesn't work properly.

      perl> print join "\n", map{ \substr 'The quick brown fox', $_ } 0 .. +19 LVALUE(0x15d7cf4) LVALUE(0x15d7cf4) LVALUE(0x15d7cf4) LVALUE(0x15d7cf4) LVALUE(0x15d7cf4) LVALUE(0x15d7cf4) LVALUE(0x15d7cf4) LVALUE(0x15d7cf4) LVALUE(0x15d7cf4) LVALUE(0x15d7cf4) LVALUE(0x15d7cf4) LVALUE(0x15d7cf4) LVALUE(0x15d7cf4) LVALUE(0x15d7cf4) LVALUE(0x15d7cf4) LVALUE(0x15d7cf4) LVALUE(0x15d7cf4) LVALUE(0x15d7cf4) LVALUE(0x15d7cf4) LVALUE(0x15d7cf4)

      As you can see, each time you take an Lvalue from substr, it re-uses the same address. So you end up with and array of pointers to the last char in the string.


      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller

        Hmmm... In 5.8.1-RC2,

        perl -e'my $foo = "A string"; my @foo = map{\substr $foo, $_} 0..7; pr +int map {$$_,$/} @foo;' A string string string tring ring ing ng g $
        5.8.1 does more of what I mean!

        After Compline,
        Zaxo

Re: suffix arrays
by BrowserUk (Pope) on Jul 18, 2003 at 21:27 UTC

    You can get away with using the Lvalues from substr in the array as suggested by Zaxo above, but you have to jump through a few hoops to prevent the problem I alluded to.

    Basically, there seems to be a single slot allocated for all the Lvalues generated by any given statement that includes a substr. That's a clumsy way to put it, but I can't think of a better way to say it.

    To work around the limitation, it is necessary to use a 'different' substr each time, hence the need for the eval in this code. The downside is that the eval markly reduces the efficiency. The algorithm works fine though and this implementation shoudl be a bit quicker than yours.

    #! perl -slw use strict; sub ATOa{ map{ eval '\substr( $_[0], $_ )' } 0 .. length( $_[0] ); } my @Aoa = sort{ $$a cmp $$b } ATOa $ARGV[0]; #print $$_ for @Aoa; my $max = ''; for my $i ( 0 .. ($#Aoa-1) ) { my $p = 0; ++$p while substr( ${$Aoa[ $i ]}, 0, $p ) eq substr( ${$Aoa[ $i+1 ]}, 0, $p ); $max = substr( ${$Aoa[ $i ]}, 0, $p-1 ) if length $max < $p; } print "The longest common substring in\n'$ARGV[0]'\n is '$max'"; __DATA__ P:\test>test "the infernal invention was the source or eternal facinat +ion for the qualified professional and enthusiastic amateur alike" The longest common substring in 'the infernal invention was the source of eternal facination for the q +ualified professional and enthusiastic amateur alike' is 'ernal ' P:\test>test "Four and twenty virgins came down from Inverness. And wh +en they went back again there were four and twenty less" The longest common substring in 'Four and twenty virgins came down from Inverness. And when they went +back again there were four and twenty less' is 'our and twenty '

    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller

      The downside is that the eval markly reduces the efficiency.

      So, just do one eval() . . .

      sub ATOa { my $code = '(' . join( ',', map { "\\substr(\$_[0], $_)" } 0 .. length( +$_[0] ) ) . ')'; return eval $code; }
      :-)

      -sauoq
      "My two cents aren't worth a dime.";
      

      You can improve the performance slightly by only looking for matching strings larger than the current maximum.

      my $p = length($max);

      - Cees

Re: suffix arrays
by sauoq (Abbot) on Jul 18, 2003 at 20:54 UTC

    I'm a little confused about what you are asking for. You want to search a set of strings and try to find common suffixes? Your implementation doesn't seem to do that.

    Your implementation also isn't very efficiently written. Your sort of @arr, for instance, could benefit from a Schwartzian Transform.

    Looking into my crystal ball... I have a feeling hashes will play a part in your answer.

    Update: (After some research at ddj.com...) Oh, I see. You are talking about doing substring searches using suffix trees... Frankly, I'm not sure why you would bother to do that in Perl where index() and rindex() essentially do it for you. :-)

    Update: To be fair, there are some nifty things you can do with suffix trees that aren't as easy as using index(). That doesn't necessarily mean that using suffix trees is the best way to do them in Perl though. Do you have a specific application in mind or were you just hoping to play with suffix trees in Perl?

    -sauoq
    "My two cents aren't worth a dime.";
    
Re: suffix arrays
by jonadab (Parson) on Jul 19, 2003 at 14:45 UTC
    I took a shot at doing this, based on the DDJ code

    This will generally not work very well. You asked for wisdom, so here is some: Perl is not C. What is fast in C may be slow in Perl, and vice versa, but even more importantly, many "algorithms" designed for C are designed to take detailed control over minutia. In many cases they are not so much algorithms as implementations, or perhaps (as in this case) something of each.

    I generally have misgivings when I see people trying to do line-for-line, function-for-function adaptations from C to Perl. It's fundamentally a wrong approach for anything more complex than Hello World.

    I have even stronger misgivings about taking code that uses pointers in C and adapting it to use references in Perl. While it's true that Perl's references are conceptually analogous to C's pointers, it is NOT true that what is done with pointers in C should usually be done with references in Perl. Occasionally, yes, but not usually and certainly not always -- and probably not in this case. A lot of the things that are done in C with pointers are fiddly low-level details that in Perl are best left to the optimizer to sort out. Once you start down the path of trying to do a line-for-line translation, you will end up using hashes for structs, using references for pointers, implementing a linked-list structure with a $member{next} reference in each anonymous hash, instead of just using a list for crying out loud. This is an extreme example, but it demonstrates the point: structure-for-structure translations from C to Perl can be very suboptimal, both in terms of performance and programmer time.

    All of that is to say, perhaps you would get further by asking what is the best or fastest way to find the longest repeated substrings of a string in Perl. I personally don't know the answer to this question, but someone on Perlmonks might, and you're more likely to get a satisfactory answer to that one than the question that you posted, IMO.


    Quidquid latine dictum sit altum viditur.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://275740]
Approved by Enlil
Front-paged by Limbic~Region
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (6)
As of 2020-11-24 09:52 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?