Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Common Substrings

by Anonymous Monk
on Nov 15, 2005 at 10:14 UTC ( #508516=perlquestion: print w/ replies, xml ) Need Help??
Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

The following returns the longest substring that is common to two given strings and starts at the beginning of each, and the remaining characters in each of the arguments.

sub common { if ( $_[0] eq $_[1] ) { return( $_[0], "", "" ); } else { use bytes; my ( $pos, $len0, $len1 ) = ( 0, length($_[0]), length($_[1]) ); # find the offset of the first byte that differs while ( vec($_[0], $pos, 8) == vec($_[1], $pos, 8) ) { $pos++; last unless ( $pos < $len0 && $pos < $len1 ); } return( unpack("a${pos}", $_[0]), unpack("x${pos}a*", $_[0]), unpack("x${pos}a*", $_[1]), ); } }

Obviously, this looks more like C than Perl. Is there any other way to achieve these (and similar) results without using bytes that is efficient?

Thanks!

Comment on Common Substrings
Download Code
Re: Common Substrings
by Corion (Pope) on Nov 15, 2005 at 10:23 UTC

    How about

    sub common { if ( $_[0] eq $_[1] ) { return( $_[0], "", "" ); } else { use bytes; my $x = $_[0] ^ $_[1]; $x =~ m!^(\x00*)!; my $pos = length $1; my ( $len0, $len1 ) = ( length($_[0])-$pos, length($_[1])-$pos ) +; return( unpack("a${pos}", $_[0]), unpack("x${pos}a*", $_[0]), unpack("x${pos}a*", $_[1]), ); } }

    That way you scan the two strings, but you do it in C instead of doing it in Perl.

      Nice use of xor! And should be better than my loop.

      Still, my concern is breaking multibyte characters.

Re: Common Substrings
by salva (Monsignor) on Nov 15, 2005 at 10:27 UTC
    you should use substr instead of vec to extract characters from a string:
    sub common { my ($a, $b)=@_; return ($a, '', '') if $a eq $b; for (my $i=0;;$i++) { my $c=substr($a, $i, 1); my $d=substr($b, $i, 1); return (substr($a, 0, $i), $c, $d) if $c ne $d; } }
Re: Common Substrings
by salva (Monsignor) on Nov 15, 2005 at 10:48 UTC
    and this may be the most efficient way to do it in perl:
    sub common { my ($a, $b) = @_; my $min = 0; my $max = length $a < length $b ? length $a : length $b; while ($min < $max) { my $h = ($max - $min + 1) >> 1; my $c = substr($a, $min, $h); my $d = substr($b, $min, $h); if ($c eq $d) { $min += $h; } else { $max = $min + $h - 1; } } return (substr($a, 0, $min), substr($a, $min), substr($b, $min)); }

      Thanks!

      It took me a bit to understand that you were using a bisection method. But I agree that it should be reasonably efficient. I just have a tendency to avoid substr() since it always feels like a very expensive call.

      The strings that I am spliting are paths, and because my sub then looks for the place of a slash, which should be 8 bits, I think that it is safe for me to use a binary method.

        I just have a tendency to avoid substr() since it always feels like a very expensive call.

        substr is actually quite cheap as it doesn't copy the char data from the the original string, it just makes an alias to it.

        The strings that I am spliting are paths, and because my sub then looks for the place of a slash, which should be 8 bits

        I don't think so, on UTF8 strings byte offsets and char offsets can be different!

Re: Common Substrings
by blazar (Canon) on Nov 15, 2005 at 11:46 UTC
    Well, here's a regexp based approach. Of course if you know in advance that a certain charachter or substring won't ever be in your data, you can avoid the $sep machinery:
    sub common { my $sep=''; $sep |= $_ for map /:+/g, @_; $sep="<$sep:>"; local $_=join $sep, @_; /^(.*)(.*)$sep\1(.*)/; }

    Update: I hadn't noticed that you were focusing on efficiency. And from that POV I'm quite confident that this solution won't score very well. But as far as your concern about your code looking like C goes, this is certainly more perlish. Incidentally I would have used myself an approach like Corion's one but probably matching on /[^\0]/, using pos and substr instead of unpack.

Re: Common Substrings
by jonadab (Parson) on Nov 15, 2005 at 12:32 UTC

    The perlish way is to just use LCSS; and let somebody else worry about how it works exactly. As far as the "without using bytes", why does that matter? Is this some kind of contrived CS undergrad assignment or something? Whether the solution uses bytes or not is an irrelevant detail of the implementation, something you would not, in any normal situation, ever care about.

    Then you qualify your question with "that is efficient", but only a C programmer would ask that question without concrete evidence that the working implementation creates performance problems. Premature optimization is a root of all kinds of evil (for which some have strayed... and pierced themselves through with many sorrows). Get it working, get it working right, and only then worry about efficiency if it becomes a problem.

Re: Common Substrings
by Roy Johnson (Monsignor) on Nov 15, 2005 at 13:51 UTC
    Update: I basically duplicated blazar's code. Deleted to save your time.

    Caution: Contents may have been coded under pressure.
Re: Common Substrings
by Moron (Curate) on Nov 15, 2005 at 14:04 UTC
    A.k.a. "how to handle a non-nested concurrent loop"...
    sub common{ my @x = split( "", shift()); my $result = ''; my $x = shift @x; for my $y ( split( "", shift()) ) { ( $x eq $y ) or last; $result .= $y; $x = shift @x or last; } return $result; }

    -M

    Free your mind

Re: Common Substrings
by Anonymous Monk on Nov 15, 2005 at 15:09 UTC

    After some benchmarking...

    I will use the xor approach (fast!). I stand corrected about substr(), it is much faster than unpack, so I will also use it. Last but not least, using pos() will help as well (even if the tests were ambiguous between length() and pos(), to my mind the code gains in expressiveness).

    Thanks to you all!

    For completeness, I post my sub below. It is the backend of another that creates relative paths for a fixlinks util.

    After reading the chapter on unicode of the camel book, I conclude that this sub will do the rigth thing whatever locale, since Perl strings are either latin1 or utf8 encoded, and these are both ascii transparent, the requirement to look for a slash after deciding $pos does the trick. However, it is a shame that it cannot be generalized to a common substring function like the one I presented above.

    sub SLASH() { 47 } sub _common_path { if ( $_[0] eq $_[1] ) { return( $_[0], # all path components are common "", # nothing remains of first "", # nothing remains of second ); } else { use bytes; my ( $len0, $len1 ) = ( length($_[0]), length($_[1]) ); # find the offset of the first byte that differs my $pos = $_[0] ^ $_[1]; $pos =~ m/[^\x00]/g; $pos = pos($pos) - 1; # if some bytes are common but the last one wasn't the separator # we must decide which path components are common if ( $pos > 0 && vec($_[0], ($pos - 1), 8) != SLASH ) { # check if first path is just longer than the second if ( $pos == $len1 && vec($_[0], $pos, 8) == SLASH ) { $pos++; return( substr($_[0], 0, $pos), # common path with slash substr($_[0], $pos), # extra in first "", # nothing remains of second ); } # check if second path is just longer than the first if ( $pos == $len0 && vec($_[1], $pos, 8) == SLASH ) { $pos++; return( substr($_[1], 0, $pos), # common path with slash "", # nothing remains of first substr($_[1], $pos), # extra in second ); } # otherwise, rewind until last common path component while ( $pos > 0 ) { $pos--; if ( vec($_[0], $pos, 8) == SLASH ) { $pos++; # and keep the common slash last; } } } return( substr($_[0], 0, $pos), # common path components (with slash) substr($_[0], $pos), # extra in first substr($_[1], $pos), # extra in second ); } }
      The scalars returned by substr inside a block where use bytes holds, never have the utf8 flag set. For instance:
      $ perl -de 1 ... DB<43> $a="\x{1234}/foo" DB<44> x ord substr $a, 0, 1 0 4660 DB<45> sub bsubstr { use bytes; substr $_[0], $_[1], $_[2] } DB<46> x ord bsubstr $a, 0, 1 0 225 DB<47> x ord bsubstr $a, 1, 1 0 136 DB<48> x ord bsubstr $a, 2, 1 0 180 DB<49> x ord bsubstr $a, 3, 1 0 47

        Does that mean that I simply need no bytes; before returning the values?

Re: Common Substrings
by murugu (Curate) on Nov 15, 2005 at 15:15 UTC

    Hi,

    Here is a brute force try....

    sub common { my ($a,$b)=@_; my ($long,$short); if (length ($a) <= length ($b)){ ($long,$short) = ($b,$a); } else { ($long,$short) = ($a,$b); } my $i=0; my $common=""; while ($i<length($short)){ my $str = substr($short,$i,1); my $t=$i; $i++; next unless ($long=~/$str/); while ($long=~/$str/) { $t++; last unless (my $uni = substr($short,$t,1)); last unless ($long=~m/$str$uni/); $str.=$uni; } $common = $str if (length($common)<length($str)); } return $common; }

    Regards,
    Murugesan Kandasamy
    use perl for(;;);

Re: Common Substrings
by robin (Chaplain) on Nov 15, 2005 at 21:15 UTC
    This code;
    sub common_path { my ($p, $q) = ("$_[0]/", "$_[1]/"); (my $pos = ($p ^ $q)) =~ /[^\0]/g; my $i = 1 + rindex($p, "/", pos($pos) - 2); return ( substr($p, 0, $i), substr($p, $i, -1), substr($q, $i, -1) ); }
    behaves the same as the last one you posted. It's more efficient as well as simpler (according to my benchmarks).

    I don't think there's any good reason to use bytes for this. Were you doing it for efficiency reasons?

      Thank you, that is what I was looking for!

      It only needs a test for equality at the beginning to behave the same. Otherwise there is no match and pos() will return undef, which will generate the relevant warning and a zero $i. (A test on the match could also resolve this.)

      Plus, doing it this way makes a separate routine to find the common path redundant for the purposes of generating relative paths, as there are no byte/character issues.

      In fact, the use bytes was only needed to determine $len0 and $len1, and should have been enclosed: do{use bytes;length $string;} is recommended in the camel book.

      Some comments on what I meant by efficient...

      Needless to say that my program will spend most of its life busy with filesystem interaction, and there isn't much that can be done in this area. In fact, any improvement on the relative path generation may not even be noticed when doing a recursive: fixlinks /

      My first implementation used split() in a similar way to the answer given above by Moron. But I wasn't happy with the need to change the structure of the arguments data; i.e. if I receive scalars and compute another scalar, there must be a very good reason to transform the arguments into lists!

      In my opinion this is a reasonable concern, that is only about taking care in the solution to a given problem. (Maybe I should state that I have never been a C programmer.) Anyway, this got me thinking about the way coreutils/lib/canonicalize.c does things, and took me to a use bytes solution.

      If I wasn't under the wrong impression that substr() was expensive, I could have written a solution like the method by salva, and after time might have remembered bisection. (The irony being that one is generating a list of numbers that converge to the desired position.)

      Still, the xor approach by Corion is the best. As you say there is no good reason to use bytes, it is just a case (for me, very instructional) when a binary operation helps processing strings.

      Regards to All!

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://508516]
Approved by Corion
Front-paged by broquaint
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (8)
As of 2014-08-22 11:47 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (156 votes), past polls