Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Selecting the difference between two strings

by qazwart (Scribe)
on Sep 26, 2006 at 21:01 UTC ( [id://575022]=perlquestion: print w/replies, xml ) Need Help??

qazwart has asked for the wisdom of the Perl Monks concerning the following question:

I have two strings:

    //depot/Efp/Celox/CELOX-3.5.0/CeloxStart.sln
    //depot/Efp/Celox/MAIN/CeloxStart.sln
What I want to do is remove the part from both the beginning and end of this these two strings that match, and only leave the part that doesn't match.

In other words, I want to remove <font//depot/Efp/Celox from the front of the two strings and /CeloxStart.sln from the rear of these two strings. Leaving string #1 as CELOX-3.5.0 and the second string as MAIN Now, I know I can parse the strings one character at a time, but I was hoping for a method of mashing the two strings together in one or two steps -- maybe using regular expressions or maybe ANDing or ORing the two strings together, but how?

Any ideas of an efficient way of doing this? I have thousands of these records to process, and I want to do it as efficiently as possible.

  • Comment on Selecting the difference between two strings

Replies are listed 'Best First'.
Re: Selecting the difference between two strings
by GrandFather (Saint) on Sep 26, 2006 at 22:26 UTC

    Not And or Or, but Xor (^):

    use strict; use warnings; my $str1 = '//depot/Efp/Celox/CELOX-3.5.0/CeloxStart.sln'; my $str2 = '//depot/Efp/Celox/MAIN/CeloxStart.sln'; my $str1r = $str1; my $str2r = $str2; # Find and strip common prefix my $match = ($str1 ^ $str2) =~ /^(\x00*)/; substr $str1r, 0, $+[1], '' if $match; substr $str2r, 0, $+[1], '' if $match; # Find and strip common suffix $match = (reverse ($str1) ^ reverse ($str2)) =~ /^(\x00*)/; substr $str1r, -$+[1], length ($str1r), '' if $match; substr $str2r, -$+[1], length ($str2r), '' if $match; print "$str1r\n$str2r\n";

    Prints:

    CELOX-3.5.0 MAIN

    Update: process both strings


    DWIM is Perl's answer to Gödel
      There we go!

      I figured out that it was XOR on my way home, but I wasn't quite sure how to apply it. Plus, I forgot about the reverse function which is what I need in order to strip the end of the string. Instead, I was going to pad left the strings to make them the same length.

      Thank you for your answer. I'll try it out when I get to work tomorrow.

Re: Selecting the difference between two strings
by hgolden (Pilgrim) on Sep 26, 2006 at 21:06 UTC
    Hey

    I'm sure that someone will come up with a better way of doing this, but my first thought is to split those strings by "/" into arrays, and then use List::Compare to find the intersections, which you can then easily remove.

    Hays

    Update: Or much better, split then use Algorithim::Diff.

    Update question inspired by the response below: Are the comparisons just pairwise, or are you doing many at a time? The array solution above works for an arbitrary number, but if the comparisons are just two at a time, the below code is much slicker.

Re: Selecting the difference between two strings
by duff (Parson) on Sep 26, 2006 at 21:17 UTC

    Here's the first thing I thought of:

    #!/usr/bin/perl my $s1 = "//depot/Efp/Celox/CELOX-3.5.0/CeloxStart.sln"; my $s2 = "//depot/Efp/Celox/MAIN/CeloxStart.sln"; @s1 = split "/", $s1; @s2 = split "/", $s2; my $i = 0; $i++ while $s1[$i] eq $s2[$i]; # find end of common begin +ning my $j = -1; $j-- while $s1[$j] eq $s2[$j]; # find beginning of common + end print join("/", @s1[$i..(@s1+$j)]), "\n"; print join("/", @s2[$i..(@s2+$j)]), "\n";
    There are probably better ways :-)

    update: Though I have to wonder why you're worrying about efficiency. You only have to do this for "thousands of records". Is that really "thousands of records per second" or something where efficienct execution really matters? Or are you just prematurely optimizing?

      This problem has to do with the Perforce version control system, and finding the branching structure in the source archive. I am parsing the set of file merges and branches and finding the branching structure of the source archive. There could easily be tens of thousands or hundreds of thousands of these records that have to be parsed.

      The truth is I can take a few shortcuts to make this task easier. For example, it is common practice in Perforce to put branch names in all caps, so I could really search for the first instance of m</[A-Z_]+/> and find what I'm looking for. Or, I can simply assume that the fourth set of characters between the slashes is the name of the branch because that's where it is in my particular source archive.

      However, I look at this and say These are strings, I want to manipulate them, and that's what you use Perl for. There's gotta be a way to simply say "Take these two strings and snip the matching parts". Maybe not quite that simple, but you get the idea.

      I'm really just looking for a Perl solution to this problem. Something that will make Python programmers weep and shell scripters green with envy.

        I don't know much about Perforce, but from what you've said, there are probably easier ways to do this than what you're asking for.

        If all of the information is of the format that you show in your original post, you could pretty easily form an indented list showing the branches.

        For simplicity, let's call the text between slashes units. Now, sort the list ASCIbetically, and do a very simple loop that preserves the entry above. By comparing the current entry to the above entry, you could determine the number of tabs to print before the current entry. I.E if they have no units in common, we would print no spaces. If they have three units in common, but are both longer than three units, then we insert three tabs. If they have three units in common, but the above entry is only three units long, then we need four tabs.

        Anyway, you can fit the particulars to what you want. To me, a Perl solution is one that's elegant in Perl, and not one that takes advantage of Perl's string abilities.

        Hays

        I am parsing the set of file merges and branches and finding the branching structure of the source archive. There could easily be tens of thousands or hundreds of thousands of these records that have to be parsed.

        Well, how complicated is the branching structure, really? Is it something as simple as this?

        - branch1 - /- branch2 -\ /-- branch3 --\ *** ---- branch4 ----[same sub-structure for all branches] \-- branch5 --/ \- branch6 -/ \ ... /
        Or is it more complicated? Do some branches only contain one or another subset of the overall structure (branch FOO contains everything, but branch BAR only contains component X)? Or are there additional versioning branches at lower levels along some paths (branch FOO contains components X, Y and Z, but there are two versions of Y within FOO)?

        If it's really as simple as my crude diagram, you only need to look for common strings from one direction (left-to-right, or "top-down"). If it's not that simple, you need to be a little more explicit about what you want to derive from the overall structure. What sort of representation for the (more complicated) branching structure would be desirable?

        (Not in the sense of "tell us what you want so we can do it for you", but rather "make sure you really know what you want so you work on solving the right problems".)

Re: Selecting the difference between two strings
by jwkrahn (Abbot) on Sep 27, 2006 at 03:26 UTC
    $ perl -le' my $string1 = q[//depot/Efp/Celox/CELOX-3.5.0/CeloxStart.sln]; my $string2 = q[//depot/Efp/Celox/MAIN/CeloxStart.sln]; my $pre = length( ( ( $string1 ^ $string2 ) =~ /^(\0*)/ )[ 0 ] ); my $post = length( ( ( reverse( $string1 ) ^ reverse( $string2 ) ) =~ +/^(\0*)/ )[ 0 ] ); + print for substr( $string1, $pre, -$post ), substr( $string2, $pre, -$ +post ) ' CELOX-3.5.0 MAIN
      Thanks for the reply. This is what I was looking for.

      I have to XOR the strings. This puts it all together. I'll take a closer look tomorrow at work.

Re: Selecting the difference between two strings
by Fletch (Bishop) on Sep 27, 2006 at 01:59 UTC

    I think "longest common substring" in supersearch may turn up some useful pointers.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://575022]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others musing on the Monastery: (4)
As of 2025-06-18 17:22 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?
    erzuuliAnonymous Monks are no longer allowed to use Super Search, due to an excessive use of this resource by robots.