Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

searching for strings

by steph_bow (Pilgrim)
on Aug 06, 2007 at 09:12 UTC ( #630779=perlquestion: print w/ replies, xml ) Need Help??
steph_bow has asked for the wisdom of the Perl Monks concerning the following question:


Dear Monks,

I am a looking for strings which are similar.
The differences between these strings concern only the last character or the last number.

Furthermore, the differences are + 1 / - 1 for the number

And the differences are + 1 / -1 for the place of the letters in the alphabet
What are the tools to use ?
For example, here is my start file
AAA30 BBC5 SHT12H DAL33B BBC49 AAA31 DAL33A BBC6 SHT12G BBC50

Here is the output file I would like :
AAA30;AAA31 BBC5;BBC6 SHT12H;SHT12G DAL33B;DAL33A BBC49;BBC50 AAA31;AAA30 DAL33A;DAL33B BBC6;BBC5 SHT12G;SHT12H BBC50;BBC49

Update :

1) Thanks a lot for your answers, that's very nice of you

2) for the names whose last characters are numbers, for example, AAA30, I would like to look for the AAA29 and the AAA31, and for the NB6, I would like to search for NB5 and NB7.
There are only at maximum two numbers that may change

3)for the names whose last characters are letters, they end with one or two letters

4)if there is only one letter at the end, then I look for the letters next to the one in the order of alphabet for example, if it is AU33H, I would like to look for AU33G and AU33I

5)if there are two letters at the end, then it is only the first one that will change and it is also a letter that will be very close in the order of alphabet for ex : DLH4KF : I would like to look for DLH4JF and DLH4LF

6)there are no negative numbers

7)when a string ends by Z or 99, there is no need to look AA, or 100, ...

8)the length of the string of the couple looked for must be the same (it excludes cases such as the couple ABF99 / ABF100)

Comment on searching for strings
Select or Download Code
Re: searching for strings
by moritz (Cardinal) on Aug 06, 2007 at 10:12 UTC
    If the prefix (ie everything except the last character) is uniq, you could strip off the last character, and use the remaning string as a hash key.

    In the hash you store a list of matching strings.

    When you have created the hash, you should iterate through it and check if the +1/-1 condition holds.

Re: searching for strings
by shmem (Canon) on Aug 06, 2007 at 10:45 UTC
    What are the tools to use ?
    substr, ord, chr (or pack and unpack), hashes (see perldata), pre/post in/de-crement (for strings, only increment works), integer/string comparison (== and eq) (see perlop).

    --shmem

    _($_=" "x(1<<5)."?\n".q·/)Oo.  G°\        /
                                  /\_¯/(q    /
    ----------------------------  \__(m.====·.(_("always off the crowd"))."·
    ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}
Re: searching for strings
by BrowserUk (Pope) on Aug 06, 2007 at 11:01 UTC

    Definitely one of those things that appears on the surface to be trivial, but turns out to be much harder:

    Here's one way

    #! perl -slw use strict; use Data::Dump qw[ pp ]; my %hash; while( <DATA> ) { chomp; my( $pre, $temp ) = m[(^.+?)(\d+|\D)$]; $temp = $pre . ++$temp; if( exists $hash{ $_ } ) { push @{ $hash{ $_ } }, $_; } elsif( exists $hash{ $temp } ) { push @{ $hash { $temp } }, $_; } else { $hash{ $temp } = [ $_ ]; $hash{ $_ } = [ $_ ]; } } print join ';', @{ $_ } for grep{ @{ $_ } == 2 } values %hash; __DATA__ AAA30 BBC5 SHT12H DAL33B BBC49 AAA31 DAL33A BBC6 SHT12G BBC50

    Output:

    C:\test>junk DAL33B;DAL33A AAA30;AAA31 BBC49;BBC50 BBC5;BBC6 SHT12H;SHT12G

    If you need the doubled-up output, just print them twice with the elements reversed.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      BrowserUk,
      I was trying to think of more edge cases and I think I have found a few steph_bow needs to weigh in on.

      For instance, what if a string is ABC0? Should ABC1 and ABC-1 be checked?

      Also, what if the string is XYZ? Should XYY and XYAA be checked?

      Is the alphabet circular - IOW, should AAA look at AAZ and AAB?

      Cheers - L~R

      Corrected XYZ -> XYAA rather than XYZA
        I have found a few steph_bow needs to weigh in on.

        Agreed. That said, the test set provided is amazingly complete given how concise it is.

        There is nothing accidental about either the choice of sample, or its ordering. A lot of thought went into its provision.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: searching for strings
by RMGir (Prior) on Aug 06, 2007 at 11:02 UTC
    Your problem _sounds_ complicated, but it's pretty simple, really.

    First off, the 2 matching strings have to be the same length, so you can use the length function to check that.

    Next, if you use substr to look at the last character of 2 strings, the first n-1 characters have to be equal and the last characters have to differ by 1 when compared using ord, or the strings aren't a match.

    Finally, you can use split to split your strings using a sequence of digits as the delimiter, like my @fields=split /(\d+)/,$str; Using the /(\d+)/, the delimiter will get captured too, and now you can make sure that the string fields are the same, and the number field differs only by one.

    Once you've got all that bundled into a comparison function, all you have to do is scan the list for matches for each string, and pair them up...

    I think you can get the rest of the way to your answer from here...


    Mike
      Next, if you use substr to look at the last character of 2 strings, the first n-1 characters have to be equal and the last characters have to differ by 1 when compared using ord, or the strings aren't a match.

      That doesn't work for 'BBC49' and 'BBC50'.

      Also, the process you are describing is O(N2). Feasible, but impractical for large lists.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
      RMGir,
      First off, the 2 matching strings have to be the same length, so you can use the length function to check that.

      I think the problem is more complicated than you think. As BrowserUk points out - the OP has indicated 'BBC49' and 'BBC50' is a match. Even if you corrected for that, I think your first discriminating factor would be wrong. Consider ABC99 and ABC100.

      Cheers - L~R

Re: searching for strings
by Anonymous Monk on Aug 06, 2007 at 11:10 UTC
    You can sort the input and then compare each string with its neighbors instead of using a hash.
      That would not work as ABC9 and ABC10 would not necessarily be next to each other.

      CountZero

      A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

Re: searching for strings
by oha (Friar) on Aug 06, 2007 at 11:30 UTC
    you can use the magic of ++:
    my @d = qw(AAA30 BBC5 SHT12H DAL33B BBC49 AAA31 DAL33A BBC6 SHT12G BBC +50); my $last = ""; foreach my $v ( sort {$a cmp $b} @d) { my $l1 = $last; my $l2 = chop $l1; print "$last,$v\n" if $v eq $l1.++$l2; $last = $v; }

      Tried that one also. It doesn't work either.

      C:\test>perl my @d = qw(AAA30 BBC5 SHT12H DAL33B BBC49 AAA31 DAL33A BBC6 SHT12G BBC +50); my $last = ""; foreach my $v ( sort {$a cmp $b} @d) { my $l1 = $last; my $l2 = chop $l1; print "$last,$v\n" if $v eq $l1.++$l2; $last = $v; } ^Z AAA30,AAA31 DAL33A,DAL33B SHT12G,SHT12H

      BBC5;BBC6 and BBC49;BBC50 are missing. Working out why is interesting.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        right sorry! cauz i can't expect the sorting of BBC5 and BBC6 without BBC50 beetween.
        the following works:
        my @d = qw(AAA30 BBC5 SHT12H DAL33B BBC49 AAA31 DAL33A BBC6 SHT12G BBC +50); my %seen; map { print "$seen{$_},$_\n" if $seen{$_}; } map { my $v = $_; my $l = chop $v; $seen{$v.++$l}=$_; } @d;
        and nicer, imho :)

        Oha

Re: searching for strings
by shoness (Friar) on Aug 06, 2007 at 12:06 UTC
    What do you do when the pattern you search for is at a boundary, taking either increment or decrement out of range? In your example above, you match near 'AAA30'. Do you only look for 'AAA30' and 'AAA31'? Do you also look for 'AAA29'? If so, the hash-key idea won't work, because you're not only changing the last character. You could do something like this for each of pattern:
    my ($pat, $char) = $pattern =~ m/(.*)(.)$/; my $char_up = $char; my $char_down = $char; ++$char_up unless ($char =~ m/[9Z]/i); --$char_down unless ($char =~ m/[0A]/i); my $search = qr/$pat$char|$pat$char_up|$pat$char_down/; print $search;
    How many patterns do you have anyway? How many lines do you search?
      the -- decrement operator is not magical for non-digits.

      CountZero

      A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

        Sorry! How strange of Perl to be asymmetric! Why is "--" not as magical as "++"? This will work instead...
        $char_up =~ tr/A-Z0-9/B-ZZ1-99/; $char_down =~ tr/A-Z0-9/AA-Y00-8/;
Re: searching for strings
by Anno (Deacon) on Aug 06, 2007 at 12:51 UTC
    This is a rather complex task. What have you tried so far?

    The basic tools would be

  • a pattern that splits a line into a head (AAA BBC DAL33) and a tail (the final number or single character used for pairing). This could be used: my( $head, $tail) = /^(.+?)(\d+|[[:alpha:]])$/; though your description allows for alternatives.
  • One or more hashes to organize the data in a way that the pairings can easily be extracted.
  • Here is one way to do it:

    my (%num, %alpha); while ( <DATA> ) { chomp; my ( $head, $tail) = /^(.+?)(\d+|[[:alpha:]])$/; if ( $tail =~ /^\d+$/ ) { $num{ $head}->{ $tail} = $_; } else { $alpha{ $head}->{ ord $tail} = $_; } } my @pairs = ( map( extract_pairs( $_) => values %num), map( extract_pairs( $_) => values %alpha), ); print "$_\n" for @pairs; exit; sub extract_pairs { my $h = shift; my @pairs; for ( keys %$h ) { for my $partner ( $_ - 1, $_ + 1 ) { if ( exists $h->{ $partner} ) { push @pairs, "$h->{ $_};$h->{ $partner}"; delete @$h{ $_, $partner}; } } } @pairs; } __DATA__ AAA30 BBC5 SHT12H DAL33B BBC49 AAA31 DAL33A BBC6 SHT12G BBC50
    This code will ignore singletons for which a pairing partner cannot be found. If there are more than two consecutive pairs for a single head, say if you had BBC7 besides BBC5 and BBC6 the bevavior is undefined. It will pick some pair(s) and may ignore others.

    Anno

Re: searching for strings
by Zielony (Acolyte) on Aug 06, 2007 at 13:41 UTC
    #!/usr/bin/perl use strict; sub simstr { $_ = shift; if (/([a-z])$/i) { my $ch = $1; $ch++; $ch =~ s/.*(.)$/$1/; substr ($_, -1) = $ch; } elsif (/\d+$/) { $_++; } $_; } for (<DATA>) { chomp; print "$_;" . simstr ($_) . "\n"; } __DATA__ AAA30 BBC5 SHT12H DAL33B BBC49 AAA31 DAL33A BBC6 SHT12G BBC50
      Consider these two data elements: CBG99 and CBG100. When they are placed into your code, the match is CBG99;CBH00 Very fun problem here! :-) -P
        #!/usr/bin/perl use strict; sub simstr { $_ = shift; /([a-z]|\d+)$/i; my $ch = $1; $ch++; $ch =~ s/.*(.)$/$1/ if ($ch =~ /[a-z]/i); substr ($_, -(length $1)) = $ch; $_; } for (<DATA>) { chomp; print "$_;" . simstr ($_) . "\n"; } __DATA__ AAA30 BBC5 SHT12H DAL33B BBC49 AAA31 DAL33A BBC6 SHT12G BBC50 CBG99 WXYZ
        Now it works and isn't even a bit funny. ;-)
Re: searching for strings
by ww (Bishop) on Aug 06, 2007 at 15:34 UTC

    The code below is probably inelegant and perhaps even wrongheaded
    but I'm posting, in part, TIMTOWTDI; in part, to illustrate some of the well-taken issues with the likes of AAA29 vs AAA30 (above), and in part, because even though this is NOT a solution, $work calls:

    #!usr/bin/perl use strict; use warnings; use vars qw (@fields $field $seen); while (<DATA>) { chomp $_; push @fields, $_; } for $field(@fields) { $seen = $field; our $base_seen = substr($seen, 0, -1); # all but last char of $se +en our $root = chop($seen); # get last char of $seen + test1($root, $base_seen); } #####subs sub test1 { # Excluding the last char, test whether the f +ields match for $field(@fields) { our $base_field = substr($field, 0, -1); # all but last char of +$field if ($main::base_seen eq $base_field) { test2($main::root, $main::base_seen, $field); } } } sub test2 { # bases matched, now test last char for a matc +h +/- 1 my $root = shift(@_); my $t1 = (shift(@_) . $root); my $test1 = shift(@_); if ( $t1 eq $test1 ) { # Returning, t1 is identical to test1 (haven't figured out how t +o skip this comparison entirely) return; } else { # print "\nIn else of test2, ready to cmp: $t1, $test1\n"; my $rev_last_t1 = reverse($t1); # reverse, for e +ase of m// my $rev_last_test1 = reverse($test1); #test "last" char which is now the initial char in $rev_last_t1 +via reverse my $last_rlt1 = ord($rev_last_t1 =~ /(.)/); # numify alphas + for +/- below my $last_test1 = ord($rev_last_test1 =~ /(.)/); if ( $last_rlt1 == ( ($last_test1++) || ($last_test1--) ) ) { print "same base and adjacent terminal chars: $t1;$test1\n"; } } } # end # one value added to OP's data: __DATA__ AAA30 BBC5 SHT12H DAL33B BBC49 AAA31 DAL33A BBC6 SHT12G BBC50 AAA37 AAA29

    OUTPUT:

    same base and adjacent terminal chars: AAA30;AAA31 same base and adjacent terminal chars: AAA30;AAA37 # False positive same base and adjacent terminal chars: BBC5;BBC6 same base and adjacent terminal chars: SHT12H;SHT12G same base and adjacent terminal chars: DAL33B;DAL33A same base and adjacent terminal chars: AAA31;AAA30 same base and adjacent terminal chars: AAA31;AAA37 # False positive same base and adjacent terminal chars: DAL33A;DAL33B same base and adjacent terminal chars: BBC6;BBC5 same base and adjacent terminal chars: SHT12G;SHT12H same base and adjacent terminal chars: AAA37;AAA30 # False positive same base and adjacent terminal chars: AAA37;AAA31 # False positive

    NOTE1: Fails on AAA29 vs AAA30 and on BBC49 vs BBC50 for (the intended?) reading of OP's definition; if OP had specified "digit" instead of "number" this would not be a failure, but, as is, makes the problem challenging
    NOTE2: Fails on AAA37 vs (AAA30 or AAA31) by any reading; this is a glitch in the code /me offers above.
    NOTE3: Use any one of several mechanisms, including hashes, to eliminate dupes such as "AAA30;AAA31" and "AAA31;AAA30")

Re: searching for strings
by CountZero (Bishop) on Aug 06, 2007 at 15:45 UTC
    As a partial answer to your problem, here are two subroutines which give you the previous and next values:
    use strict; my $plus; my $minus; while (<DATA>) { chomp; my $next = next_seq($_); my $previous = previous_seq($_); print "$_, $next, $previous\n"; } sub next_seq { my $pattern = shift; chomp $pattern; if ($pattern =~ m/(.*?)(\d+)$/) { return $1 . ($2 + 1); } else { $pattern =~ m/(.*)(.)/; return $1 . chr(ord($2) + 1); } } sub previous_seq { my $pattern = shift; chomp $pattern; if ($pattern =~ m/(.*?)(\d+)$/) { return $1 . ($2 - 1); } else { $pattern =~ m/(.*)(.)/; return $1 . chr(ord($2) - 1); } } __DATA__ AAA30 BBC5 SHT12H DAL33B BBC49 AAA31 DAL33A BBC6 SHT12G BBC50 ABCZ
    Warning: I'm not sure it will work with unicode-strings and anyhow it will only take into account the last character (unless the pattern ends in a one or more digits). So @ precedes A and [ follows Z.

    Now all you have to do is go through the list and search for items which match the next or previous values.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

Re: searching for strings
by GrandFather (Cardinal) on Aug 06, 2007 at 19:01 UTC

    Yet another way to do it:

    use strict; use warnings; my %data; my @reverse; while (<DATA>) { chomp; my ($prefix, $suffix) = /^([A-Z0-9]+?)(\d+|.)$/; if (exists $data{$prefix}) { my $mapFunc = $suffix =~ /[A-Z]/ ? sub {ord $_[0]} : sub {$_[0 +]}; for (@{$data{$prefix}}) { next if abs ($mapFunc->($suffix) - $mapFunc->($_)) != 1; print "$prefix$_;$prefix$suffix\n"; push @reverse, "$prefix$suffix;$prefix$_\n"; } } push @{$data{$prefix}}, $suffix; } print @reverse; __DATA__ AAA30 BBC5 SHT12H DAL33B BBC49 AAA31 DAL33A BBC6 SHT12G BBC50

    Prints:

    AAA30;AAA31 DAL33B;DAL33A BBC5;BBC6 SHT12H;SHT12G BBC49;BBC50 AAA31;AAA30 DAL33A;DAL33B BBC6;BBC5 SHT12G;SHT12H BBC50;BBC49

    DWIM is Perl's answer to Gödel
Re: searching for strings
by wind (Priest) on Aug 06, 2007 at 20:56 UTC
    There are two things that make this much easier. One, is to iterate over the data twice, the first time to set up your relationships, the second time to report their existance. And two, rely solely on incrementing +1, as it is the easier of the relationships to calculate because of the special magic of ++ on strings.

    You still must define the characteristics of your boundary conditions though. Ideally, we want the increment and decrement relationships to be one to one. What is the increment of "Bar999"? If it is "Bar1000", then your relationship is no longer 1 to 1 as there is also "Bar0999".

    Anyway, here is an implementation without the boundary conditions fully defined. I've added two data entries for lines that do not match:
    use Fcntl qw(SEEK_SET); use strict; # Calculate Relationships. # - Rely on increment, as it's the easier of the two to calculate. my %decrement; my %increment; # Synonymous with existance. my $start_of_DATA = tell DATA; while (<DATA>) { chomp; my $item = $_; # Note concerning parsing # - This regex requires that a prefix exist. if ($item =~ m{ (\w+) ( (?<!\d)\d+ | (?<![A-Z])[A-Z]+ ) \z}x) { my ($prefix, $suffix) = ($1, $2); # Note: this is the primary spot where there might be changes +in rules. # - What happens when the character ends in 'Z'? Currently # That would translate to 'AA'. # - What happens when number is 999? Currently that would tra +nslate # to '1000'. # Fix the rules here, and everything else will translate. (my $suffix_next = $suffix)++; my $item_next = $prefix . $suffix_next; $increment{$item} = $item_next; $decrement{$item_next} = $item; } else { die "Invalid data: $_"; } } # Reparse DATA seek(DATA, $start_of_DATA, SEEK_SET); while (<DATA>) { chomp; print; if ($increment{ $decrement{ $_ } }) { print ";$decrement{$_}"; } elsif ($increment{ $increment{ $_ } }) { print ";$increment{$_}"; } print "\n"; } __DATA__ AAA30 BBC5 SHT12H DAL33B BBC49 AAA31 BBC8 BBC3 DAL33A BBC6 SHT12G BBC50
    And the output is:
    >perl scratch.pl AAA30;AAA31 BBC5;BBC6 SHT12H;SHT12G DAL33B;DAL33A BBC49;BBC50 AAA31;AAA30 BBC8 BBC3 DAL33A;DAL33B BBC6;BBC5 SHT12G;SHT12H BBC50;BBC49
    - Miller

      steph_bow asked "could you tell me what is the signification of (?<!\d) in your code ?"

      The purpose of the negative look behind assertions in my regular expression was to ensure that there was a key included in the data value. This was not strictly part of your stated requirement, but it seemed implied and therefore should be enforced.

      If this didn't matter, than we could simply rely on non-greedy matching in order to separate the key from the suffix.

      while (<DATA>) { chomp; if (m{\A (.*?) (\d+ | [A-Z]+) \z}x) { printf "Key = %-6s - Suffix = %s\n", $1, $2; } else { die "Invalid data: $_"; } } __DATA__ AAA30 BBC5 SHT12H DAL33B BBC49 AAA31 BBC8 BBC3 DAL33A BBC6 SHT12G BBC50

      However, what about the case of a value of '123456' or 'ABCDEF'? Currently those two values would validate to a key of the empty string "", and a suffix of the entire string. To avoid this, we start by changing the key matching from * (0 or more) to + (1 or more). However, this will just eat up 1 character of our suffix. We therefore add negative look behind assertions to the suffix matching in order to ensure that the suffix is matched on a boundary.

      This might have been a little obsessive, but it's always a good idea to validate your data so that you are absolutely sure that the rules of your logic are being followed.

      - Miller

Re: searching for strings
by injunjoel (Priest) on Aug 07, 2007 at 19:35 UTC
    Late as usual I see...
    Here is my suggestion
    #!/usr/bin/perl use strict; { local %_; while(<DATA>){ chomp; if(/(\d+|[A-Z])$/){ $_{substr($_,0,(length($_)-length($1)))}->{$1} = $_; } } for(sort keys %_){ if(scalar(keys %{$_{$_}}) % 2 == 0){ local @_ = sort{$a <=> $b || ord($a) <=> ord($b)}keys %{$_ +{$_}}; for(my $i = 0; $i < scalar(@_); $i+=2){ if( ($_[$i] =~ /\d$/ && abs($_[$i] - $_[$i+1]) == 1) | +| (abs(ord($_[$i]) - ord($_[$i+1])) == 1) ){ print $_{$_}->{$_[$i]}.";".$_{$_}->{$_[$i+1]}."\n" + ; print $_{$_}->{$_[$i+1]}.";".$_{$_}->{$_[$i]}."\n" + ; } } } } } __DATA__ AAA30 BBC5 SHT12H DAL33B BBC49 AAA31 BBC8 BBC3 DAL33A BBC6 SHT12G BBC50
    Output
    AAA30;AAA31 AAA31;AAA30 BBC49;BBC50 BBC50;BBC49 DAL33A;DAL33B DAL33B;DAL33A SHT12G;SHT12H SHT12H;SHT12G
    Not exactly the OP's output but close...
    Just a thought :)

    -InjunJoel
    "I do not feel obliged to believe that the same God who endowed us with sense, reason and intellect has intended us to forego their use." -Galileo

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://630779]
Approved by marto
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (8)
As of 2014-09-16 10:16 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (4 votes), past polls