Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Best method to eliminate substrings from array

by catemp (Initiate)
on Jun 26, 2019 at 16:56 UTC ( [id://11101960]=perlquestion: print w/replies, xml ) Need Help??

catemp has asked for the wisdom of the Perl Monks concerning the following question:

I have an array that contains over 1 million strings of varying sizes from 10 characters to 250 characters. Each string is a set of two or more values separated by pipes ("|"). I need to eliminate any strings that are substrings of other strings within the array. For example, if the array contains ("A|B|C", "A|B|C|D|E"), then "A|B|C" should be dropped and "A|B|C|D|E" should be kept. I tried using "any" from List::MoreUtils, but it either kept everything or removed everything. Nothing I tried within the BLOCK worked: push(@arrCompletedChains, $strChain) if any { index($_, $strChain) < 0 } @arrWorkingCompletedChains;

Here is the code I am currently using, but it takes an extremely long time to run:

for my $strChain (@arrWorkingCompletedChains) { my $found = false; foreach (@arrWorkingCompletedChains) { if ($strChain ne $_ && index($_, $strChain) >= 0) { $found = true; last; } } if (!$found) { push(@arrCompletedChains, $strChain); } }

Any suggestions on how to improve the speed of this code would be greatly appreciated.

Replies are listed 'Best First'.
Re: Best method to eliminate substrings from array
by choroba (Cardinal) on Jun 26, 2019 at 20:09 UTC
    If the number of all possible parts isn't too high, you can assign numbers to the parts and use vectors where each positions in the vector says whether the corresponding part is present or not. I used 8 bits to represent each part, but 2 bits should be enough if you need to reduce space (1 bit isn't enough, we need 4 different values, as will be explained shortly).

    For example, A001|B002 has already been seen, the assigned numbers are 1 for A001 and 2 for A002. We then read B002|A001|C003. C003 will be assigned the number 3, so the stored vector is 011 and the new one is 111. If we use 2 instead of 1 in the stored vectors, we can just bitwise or the two numbers and see what should be done: "22" | "111" = 331, where 1 means "present in the new only" and 3 means "present in both" (2 would be "present in stored only"). If there are 3's only, we've already seen exactly the same combination of parts. If there's no 1, the new combination was contained in a stored one, if there's no 2, the new combination contains the old one.

    #!/usr/bin/perl use warnings; use strict; use feature qw{ say }; my $last_part = 1; my (%part, %store); while (my $line = <DATA>) { chomp $line; my @parts = split /\|/, $line; ! exists $part{$_} and $part{$_} = $last_part++ for @parts; my $string = 0 x ($last_part - 1); substr $string, $part{$_}, 1, 1 for @parts; my %back = reverse %part; my $store = 1; for my $seen (keys %store) { my $result = "$seen" | "$string"; if ($result =~ /^3+$/ # Same as old. || $result !~ /1/ # Contained in old. ) { $store = 0; } elsif ($result !~ /2/) { # Contains old. delete $store{$seen}; } } undef $store{$string =~ tr/1/2/r} if $store; } say 'Kept: '; my %back = reverse %part; for my $stored (keys %store) { say join '|', map substr($stored, $_, 1) ? $back{$_} : (), 1 .. length $stored; } __DATA__ A001|B002 C003|A001|B002 B002|A001 C003|D004|A001 E005|F006 D004|C003

    Update: Switched to bitwise string or from +, so Math::BigInt is not needed.

    map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
Re: Best method to eliminate substrings from array
by kcott (Archbishop) on Jun 27, 2019 at 08:32 UTC

    G'day catemp,

    Welcome to the Monastery.

    After looking through the thread, I couldn't see any specifics about how you wanted the output ordered. Here's three solutions with different output orders; they become progressively more complicated and, accordingly, will take longer to run.

    I started with the data you posted below. I then added a few more lines for test purposes:

    6N7936|6N5049|9L1366|1189902|1413983|8B2026|1M3381|7K3377|3H5788|1F785 +4|8W1152|8

    appears three times: two adjacent; one separated.

    6Y0248|6T7765|9L1366|1189902|1413983|8B2026|1M3381|7K3377|3H5788

    is a new substring of others starting with 6Y0248 but is separated from them; it's the new first record.

    Here's the code containing all three solutions:

    #!/usr/bin/env perl use strict; use warnings; my @in_data = qw{ 6Y0248|6T7765|9L1366|1189902|1413983|8B2026|1M3381|7K3377|3H5788 2N0472|6N8595|9L1366|1189902|1413983|8B2026|1M3381|7K3377|3H5788|1F785 +4|8W1152|8R0721|9C5344|6W6672|9G7101|3023908|6Y1352|4P0489|2757803 3419308|3514531|3525716|3557019|3586192|3635776|3783741 3T3625|6T7765|9L1366|1189902|1413983|8B2026|1M3381|7K3377|3H5788|1F785 +4 3T3625|6T7765|9L1366|1189902|1413983|8B2026|1M3381|7K3377|3H5788|1F785 +4|8W1152|8R0721 3T3628|6T7765|9L1366|1189902|1413983|8B2026|1M3381|7K3377|3H5788|1F785 +4|8W1152|8R0721|9C5344|6W6672|9G7101|3023908|6Y1352|4P0489|1336934 4N4906|6N6481|9L1366|1189902|1413983|8B2026|1M3381|7K3377 4N4906|6N6481|9L1366|1189902|1413983|8B2026|1M3381|7K3377|3H5788 6N7936|6N5049|9L1366|1189902|1413983|8B2026|1M3381|7K3377|3H5788|1F785 +4|8W1152|8 6N7936|6N5049|9L1366|1189902|1413983|8B2026|1M3381|7K3377|3H5788|1F785 +4|8W1152|8 R0721|9C5344|6W6672|9G7101|3023908|6Y1352|4P0489|2757803 6N7936|6N5049|9L1366|1189902|1413983|8B2026|1M3381|7K3377|3H5788|1F785 +4|8W1152|8 6Y0248|6T7765|9L1366|1189902|1413983|8B2026|1M3381|7K3377|3H5788|1F785 +4|8W1152|8R0721|9C5344|6W6672|9G7101|3023908|6Y1352|4P0489|1336934 6Y0248|6T7765|9L1366|1189902|1413983|8B2026|1M3381|7K3377 6Y0248|6T7765|9L1366|1189902|1413983|8B2026|1M3381|7K3377|3H5788|1F785 +4|8W1152 }; print "Input data:\n", '-' x 40, "\n"; print "$_\n" for @in_data; { my @out_data; print '-' x 40, "\nOutput data (unordered):\n", '-' x 40, "\n"; my $long_string; for my $string (reverse sort @in_data) { if (defined $long_string) { next if index($long_string, $string) == 0; } $long_string = $string; print "$string\n"; } } { my @out_data; print '-' x 40, "\nOutput data (ordered by first field):\n", '-' x + 40, "\n"; my $long_string; for my $string (reverse sort @in_data) { if (defined $long_string) { next if index($long_string, $string) == 0; } $long_string = $string; push @out_data, $string; } print "$_\n" for sort @out_data; } { my @out_data; print '-' x 40, "\nOutput data (original order):\n", '-' x 40, "\n +"; my $long_string; my $pos = 0; for my $pos_string (sort { $b->[1] cmp $a->[1] } map [ $pos++, $_ +], @in_data) { if (defined $long_string) { next if index($long_string, $pos_string->[1]) == 0; } $long_string = $pos_string->[1]; push @out_data, $pos_string; } print "$_->[1]\n" for sort { $a->[0] <=> $b->[0] } @out_data; } print '-' x 40, "\n";

    Note the use of anonymous blocks to limit the scope of identically named variables (e.g. @out_data and $long_string).

    Here's the output:

    Input data: ---------------------------------------- 6Y0248|6T7765|9L1366|1189902|1413983|8B2026|1M3381|7K3377|3H5788 2N0472|6N8595|9L1366|1189902|1413983|8B2026|1M3381|7K3377|3H5788|1F785 +4|8W1152|8R0721|9C5344|6W6672|9G7101|3023908|6Y1352|4P0489|2757803 3419308|3514531|3525716|3557019|3586192|3635776|3783741 3T3625|6T7765|9L1366|1189902|1413983|8B2026|1M3381|7K3377|3H5788|1F785 +4 3T3625|6T7765|9L1366|1189902|1413983|8B2026|1M3381|7K3377|3H5788|1F785 +4|8W1152|8R0721 3T3628|6T7765|9L1366|1189902|1413983|8B2026|1M3381|7K3377|3H5788|1F785 +4|8W1152|8R0721|9C5344|6W6672|9G7101|3023908|6Y1352|4P0489|1336934 4N4906|6N6481|9L1366|1189902|1413983|8B2026|1M3381|7K3377 4N4906|6N6481|9L1366|1189902|1413983|8B2026|1M3381|7K3377|3H5788 6N7936|6N5049|9L1366|1189902|1413983|8B2026|1M3381|7K3377|3H5788|1F785 +4|8W1152|8 6N7936|6N5049|9L1366|1189902|1413983|8B2026|1M3381|7K3377|3H5788|1F785 +4|8W1152|8 R0721|9C5344|6W6672|9G7101|3023908|6Y1352|4P0489|2757803 6N7936|6N5049|9L1366|1189902|1413983|8B2026|1M3381|7K3377|3H5788|1F785 +4|8W1152|8 6Y0248|6T7765|9L1366|1189902|1413983|8B2026|1M3381|7K3377|3H5788|1F785 +4|8W1152|8R0721|9C5344|6W6672|9G7101|3023908|6Y1352|4P0489|1336934 6Y0248|6T7765|9L1366|1189902|1413983|8B2026|1M3381|7K3377 6Y0248|6T7765|9L1366|1189902|1413983|8B2026|1M3381|7K3377|3H5788|1F785 +4|8W1152 ---------------------------------------- Output data (unordered): ---------------------------------------- R0721|9C5344|6W6672|9G7101|3023908|6Y1352|4P0489|2757803 6Y0248|6T7765|9L1366|1189902|1413983|8B2026|1M3381|7K3377|3H5788|1F785 +4|8W1152|8R0721|9C5344|6W6672|9G7101|3023908|6Y1352|4P0489|1336934 6N7936|6N5049|9L1366|1189902|1413983|8B2026|1M3381|7K3377|3H5788|1F785 +4|8W1152|8 4N4906|6N6481|9L1366|1189902|1413983|8B2026|1M3381|7K3377|3H5788 3T3628|6T7765|9L1366|1189902|1413983|8B2026|1M3381|7K3377|3H5788|1F785 +4|8W1152|8R0721|9C5344|6W6672|9G7101|3023908|6Y1352|4P0489|1336934 3T3625|6T7765|9L1366|1189902|1413983|8B2026|1M3381|7K3377|3H5788|1F785 +4|8W1152|8R0721 3419308|3514531|3525716|3557019|3586192|3635776|3783741 2N0472|6N8595|9L1366|1189902|1413983|8B2026|1M3381|7K3377|3H5788|1F785 +4|8W1152|8R0721|9C5344|6W6672|9G7101|3023908|6Y1352|4P0489|2757803 ---------------------------------------- Output data (ordered by first field): ---------------------------------------- 2N0472|6N8595|9L1366|1189902|1413983|8B2026|1M3381|7K3377|3H5788|1F785 +4|8W1152|8R0721|9C5344|6W6672|9G7101|3023908|6Y1352|4P0489|2757803 3419308|3514531|3525716|3557019|3586192|3635776|3783741 3T3625|6T7765|9L1366|1189902|1413983|8B2026|1M3381|7K3377|3H5788|1F785 +4|8W1152|8R0721 3T3628|6T7765|9L1366|1189902|1413983|8B2026|1M3381|7K3377|3H5788|1F785 +4|8W1152|8R0721|9C5344|6W6672|9G7101|3023908|6Y1352|4P0489|1336934 4N4906|6N6481|9L1366|1189902|1413983|8B2026|1M3381|7K3377|3H5788 6N7936|6N5049|9L1366|1189902|1413983|8B2026|1M3381|7K3377|3H5788|1F785 +4|8W1152|8 6Y0248|6T7765|9L1366|1189902|1413983|8B2026|1M3381|7K3377|3H5788|1F785 +4|8W1152|8R0721|9C5344|6W6672|9G7101|3023908|6Y1352|4P0489|1336934 R0721|9C5344|6W6672|9G7101|3023908|6Y1352|4P0489|2757803 ---------------------------------------- Output data (original order): ---------------------------------------- 2N0472|6N8595|9L1366|1189902|1413983|8B2026|1M3381|7K3377|3H5788|1F785 +4|8W1152|8R0721|9C5344|6W6672|9G7101|3023908|6Y1352|4P0489|2757803 3419308|3514531|3525716|3557019|3586192|3635776|3783741 3T3625|6T7765|9L1366|1189902|1413983|8B2026|1M3381|7K3377|3H5788|1F785 +4|8W1152|8R0721 3T3628|6T7765|9L1366|1189902|1413983|8B2026|1M3381|7K3377|3H5788|1F785 +4|8W1152|8R0721|9C5344|6W6672|9G7101|3023908|6Y1352|4P0489|1336934 4N4906|6N6481|9L1366|1189902|1413983|8B2026|1M3381|7K3377|3H5788 6N7936|6N5049|9L1366|1189902|1413983|8B2026|1M3381|7K3377|3H5788|1F785 +4|8W1152|8 R0721|9C5344|6W6672|9G7101|3023908|6Y1352|4P0489|2757803 6Y0248|6T7765|9L1366|1189902|1413983|8B2026|1M3381|7K3377|3H5788|1F785 +4|8W1152|8R0721|9C5344|6W6672|9G7101|3023908|6Y1352|4P0489|1336934 ----------------------------------------

    — Ken

Re: Best method to eliminate substrings from array
by Paladin (Vicar) on Jun 26, 2019 at 18:27 UTC
    How about something like this (this does depend on them being sorted shortest to longest):
    my @data = sort { length($a) <=> length($b) } qw( 2N0472|6N8595|9L1366|1189902|1413983|8B2026|1M3381|7K3377|3H5788|1 +F7854|8W1152|8R0721|9C5344|6W6672|9G7101|3023908|6Y1352|4P0489|27578 +03 3419308|3514531|3525716|3557019|3586192|3635776|3783741 3T3625|6T7765|9L1366|1189902|1413983|8B2026|1M3381|7K3377|3H5788|1 +F7854 3T3625|6T7765|9L1366|1189902|1413983|8B2026|1M3381|7K3377|3H5788|1 +F7854|8W1152|8R0721 3T3628|6T7765|9L1366|1189902|1413983|8B2026|1M3381|7K3377|3H5788|1 +F7854|8W1152|8R0721|9C5344|6W6672|9G7101|3023908|6Y1352|4P0489|13369 +34 4N4906|6N6481|9L1366|1189902|1413983|8B2026|1M3381|7K3377 4N4906|6N6481|9L1366|1189902|1413983|8B2026|1M3381|7K3377|3H5788 6N7936|6N5049|9L1366|1189902|1413983|8B2026|1M3381|7K3377|3H5788|1 +F7854|8W1152|8R0721|9C5344|6W6672|9G7101|3023908|6Y1352|4P0489|275780 +3 6Y0248|6T7765|9L1366|1189902|1413983|8B2026|1M3381|7K3377|3H5788|1 +F7854|8W1152|8R0721|9C5344|6W6672|9G7101|3023908|6Y1352|4P0489|13369 +34 6Y0248|6T7765|9L1366|1189902|1413983|8B2026|1M3381|7K3377 6Y0248|6T7765|9L1366|1189902|1413983|8B2026|1M3381|7K3377|3H5788|1 +F7854|8W1152 ); my %uniq; foreach my $elem (@data) { my @parts = split /\|/, $elem; foreach my $p (0 .. $#parts) { my $e = join '|', @parts[0..$p]; delete $uniq{$e} if exists $uniq{$e}; } $uniq{$elem} = 1; } print "$_\n" for keys %uniq;

      Some comments:

      • In your code here, an input part number group item is treated as a subset of another group only if it is anchored at the left end of the larger group. E.g., the items  7K3377|3H5788 8W1152 4P0489|2757803 added to the list of test input data will not be excluded from output, but, of course,  2N0472|6N8595 2N0472 will be.

        In the OPed code, the if-block
            if ($strChain ne $_ && index($_, $strChain) >= 0) { $found = true;  last; }
        implies that a part number group is a subset if it is found anywhere (per the  >= comparison) in the larger group (and is not identical to the larger group).

      • Additionally, the OPed code implies that duplicated items in the input appear unchanged in the output (if they are not part of any larger group), e.g.,  123 ... 123 in the input would appear as  123 ... 123 in the output. In your code, these items would be made unique.
      • Also, the OPed code would produce output in the same order as the input items (less subsets), although this implied requirement seems less imperative than the others. Because it's taken directly from a hash, your code will produce output in random order.


      Give a man a fish:  <%-{-{-{-<

        Here OP says that the list of part numbers are to be treated as sets/subsets, so while the original code matches sub-strings, OP says later that is incorrect. My code treats the long strings as ordered sets, which seems to be what the OP wanted. If the OP really wants to treat the list of parts as a non-ordered set, it's easy enough to add a sort to the join line.

        OP also says here they are sorting the original list anyways, so the input order seems to be irrelevant.

        I'm not quite sure what you mean by the duplicated items part. Essentially what my code does is break each line (set), into individual part numbers (elements), then checks if for each prefix of elements, does that one already exist in the final result, and if it does, remove it from the final result, as this current line will supersede it. So if the current line was "A|B|A|B|C", it first checks if "A" is in the result; If so, remove it. Then checks "A|B", then "A|B|A", etc. until finally adding the entire line "A|B|A|B|C" to the final result. If later in the file, the line "A|B|A|B|C|N" is found, at that point, the "A|B|A|B|C" would get removed.

Re: Best method to eliminate substrings from array
by jwkrahn (Abbot) on Jun 26, 2019 at 18:53 UTC
    push(@arrCompletedChains, $strChain);

    As it says in perlsyn:

    If any part of LIST is an array, "foreach" will get very confused if you add or remove elements within the loop body, for example with "splice". So don't do that.

    UNTESTED, but you could see if this is faster:

    use List::Util 'first'; # sort largest first my @sorted = sort { length $b <=> length $a } @arrWorkingCompletedChai +ns; my %to_delete; # remove shortest first while ( my $next = pop @sorted ) { if ( first { index( $_, $next ) >= 0 } @sorted ) { ++$to_delete{ $next }; } | @arrWorkingCompletedChains = grep !$to_delete{ $_ }, @arrWorkingComple +tedChains;
Re: Best method to eliminate substrings from array (updated)
by AnomalousMonk (Archbishop) on Jun 27, 2019 at 02:14 UTC

    Here's an approach based on index. This is presented in a unit-test format (see How to ask better questions using Test::More and sample data); if you have a test case that's problematic, please let me know it using the test-case syntax detailed in the code below (see the  @Tests array). I haven't tried this code with an input list of a million 250-character part number groups, but I think it should fit into a laptop-ish memory footprint! I've made no attempt to Benchmark this; you're welcome to do so, and if you do, please let us know the results!

    File toss_included_substrings_1.pl:

    Update: Note on Output Order. The output order of the  no_subgroups() function is similar to that for List::MoreUtils::uniq() (also in recent versions of List::Util): identical to the input order except that "subgroup" items are removed from the stream. The one question I have concerns identical repeats of part numbers or part number groups. The | My current code allows repeated identical items to pass through unmolested (update: the OPed code suggests this specification). In the notation of the test code:
        [ [ qw(foo foo) ] => [ qw(foo foo) ] ]
    But whether an item can be a subgroup of itself is unclear to me. Two other possibilities for handling such repeats would be:
        [ [ qw(foo foo) ] => [ qw(foo) ] ] Repeats made unique a la uniq();
        [ [ qw(foo foo) ] => [ ] ] Repeats removed because they are subgroups of themselves.
    Either of these latter two behaviors can be achieved simply by changing the position and number of calls to  uniq() within the  no_subgroups() function. If you're interested in this, I can post further info.


    Give a man a fish:  <%-{-{-{-<

Re: Best method to eliminate substrings from array
by AnomalousMonk (Archbishop) on Jun 26, 2019 at 17:13 UTC
    Each string is a set of two or more values separated by pipes ("|"). ... eliminate any strings that are substrings of other strings ...

    What's a "value"? Can you have a string 'AA|BB|CC? If so, is 'A|B' a valid substring of it? (This would be true if testing with index.) Basically, I don't understand the concept of "substring."


    Give a man a fish:  <%-{-{-{-<

      The values are actually part numbers so they might be something like 124182 or 3718317. The strings are lists of parts that are related to each other and when I generate the lists I might get a string of five parts and then another string containing those same five parts plus one more. I would want to only keep the string with the six parts because the five part string is essentially a substring of the six part string. Instead of substring you could say subset.
Re: Best method to eliminate substrings from array
by AnomalousMonk (Archbishop) on Jun 26, 2019 at 18:35 UTC
Re: Best method to eliminate substrings from array
by choroba (Cardinal) on Jun 26, 2019 at 18:05 UTC
    How many parts are there in the file?

    map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
Re: Best method to eliminate substrings from array
by Anonymous Monk on Jun 26, 2019 at 17:17 UTC
    Do all the values overlap with one big set or can there be disjoint sets? Are they always ordered? Can there be duplicates?

      I am currently eliminating duplicates @arrWorkingCompletedChains = uniq @arrWorkingCompletedChains; and sorting the list largest to smallest @arrWorkingCompletedChains = sort { length($b) <=> length($a) } @arrWorkingCompletedChains;

      What I need to do is eliminate any overlapping string sets from the working array so the end result can be stored in our database

        you didn't answer one of the questions... more sample data plz

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11101960]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others studying the Monastery: (5)
As of 2024-04-13 21:33 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found