Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change

CPAN Module to determing overlap of 2 lists?

by wazat (Monk)
on Aug 10, 2020 at 21:27 UTC ( #11120564=perlquestion: print w/replies, xml ) Need Help??

wazat has asked for the wisdom of the Perl Monks concerning the following question:

I want to determine the overlap between the end of one list and the beginning of another. I have a quick brute force solution, but prefer to use an existing module if one is available. I didn't find one when I searched metacpan, but that doesn't mean it doesn't exist.

# example @list1 = qw(a b c); @list2 = qw(b c d); # a b c # b c d # 1 2 list_overlap( \@list1, \@list2 ); # returns 2 @list1 = qw(a b c d b); @list2 = qw(b c d); # a b c d b # b c d # 1 list_overlap( \@list1, \@list2 ); # returns 1

I don't know if this problem has a name.

Intended use is to combine two bash_history files captured on different dates into a single bash_history file that doesn't duplicate the overlap between the files. (Yes, if I set HISTFILESIZE to -1, then bash would maintain an unlimited history file.

__edit__ As pointed out, I should also specify that I want the largest overlap.

Replies are listed 'Best First'.
Re: CPAN Module to determing overlap of 2 lists?
by choroba (Archbishop) on Aug 10, 2020 at 22:20 UTC
    When multiple solutions are possible, which one do you prefer? E.g.
    @list1 = qw( a b c d c ); @list2 = qw( c d c x ); # or qw( c d c x );
    map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]

      Good point, my problem statement is loose.

      I want the maximum.

      @list1 = qw( a b c d c ); @list2 = qw( c d c x ); # 3
Re: CPAN Module to determing overlap of 2 lists?
by tybalt89 (Prior) on Aug 11, 2020 at 01:30 UTC
    #!/usr/bin/perl use strict; # use warnings; my $file1 = <<END; one two three four five six END my $file2 = <<END; two three four five six seven END my $marker = '***MARKER***'; # something not in either string my $combine = "$file1$marker$file2" =~ s/(.*)\K\Q$marker\E\1//sr; print $combine;


    one two three four five six seven

      I hadn't thought much about a regex solution.

      To ensure the match start is a complete line, requires a small tweak.

      my $combine = "$file1$marker$file2" =~ s/(?:\A|\n)(.*)\K\Q$marker\E\1/ +/sr;
        3 suggestions
        • you don't need complete lines to make it work, but anchoring to line start might prove to be faster
        • I'd include characters below ASCII 8 to the "marker" to play safe, see also discussion surrounding the similar $;
        • you might be interested to check with re "debug" , how the backtracking of the .* submatch performs. I'd guess you prefer it to grow from right to left instead of shrinking from left to right. I know the regex engine can do this depending on the anchors.
        I haven't checked the last point since performance might not be your biggest issue.

        Cheers Rolf
        (addicted to the Perl Programming Language :)
        Wikisyntax for the Monastery

Re: CPAN Module to determing overlap of 2 lists?
by kcott (Bishop) on Aug 11, 2020 at 07:58 UTC

    G'day wazat,

    Here's an implementation of &list_overlap. You could add it to a module if that would be useful to you.

    #!/usr/bin/env perl use strict; use warnings; use constant { REF => 0, DATA => 1, EXP => 2, NAME => 3, JOIN => $;, }; use Test::More; my @ref = qw{a b c d c}; my @same = qw{a b c d c}; my @long = qw{c d c x y z}; my @short = qw{c x y z}; my @none = qw{x y z}; my @tests = ( [ [@ref], [@same], 5, 'Same' ], [ [@ref], [@long], 3, 'Long' ], [ [@ref], [@short], 1, 'Short' ], [ [@ref], [@none], 0, 'None' ], ); plan tests => 0+@tests; for my $test (@tests) { my $got = list_overlap($test->[REF], $test->[DATA]); is($got, $test->[EXP], $test->[NAME]); } sub list_overlap { my ($ref, $data) = @_; my $got = 0; my ($ref_len, $data_len) = (0+@$ref, 0+@$data); my $start = $ref_len > $data_len ? $ref_len - $data_len : 0; for my $i ($start .. $ref_len - 1) { if (join(JOIN, @{$ref}[$i .. $#$ref]) eq join(JOIN, @{$data}[0 .. $#$ref - $i]) ) { $got = 1 + $#$ref - $i; last; } } return $got; }

    I find $; useful for this type of join because it's rarely used elsewhere. Pick something else if that's not appropriate.


    1..4 ok 1 - Same ok 2 - Long ok 3 - Short ok 4 - None

    — Ken

      That works.

      It feels a bit resource intensive given the repeated temporary strings created in the joins.

        I've generally found that Perl's functions that operate on strings (join, index, etc.) are typically very fast. You can compare with other solutions using Benchmark; you can profile to find hotspots (e.g. Devel::NYTProf).

        I can see a number of micro-optimisations you could make: replace $ref_len - 1 with $#$ref and replace 1 + $#$ref with $ref_len. That may gain you absolutely nothing but, I suppose, the code would be a little neater.

        You could do a single join. Then repeatedly chop the front off (index to locate JOIN then substr); again, that may gain you nothing but perhaps worth a try.

        I, and no doubt others, would be interested in the results if you do decide to benchmark this against your current "brute force solution", a regex solution, and anything else you may have come up with.

        — Ken

Re: CPAN Module to determing overlap of 2 lists?
by LanX (Cardinal) on Aug 10, 2020 at 23:23 UTC
    Hint: you could just slurp the two files in a variable each and let index do the heavy lifting... :)

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery

      Yes, index could be more efficient.

      I'd need some logic to adjust the size of the window I'm trying to match, but it could work.

      I might also look at rindex

Re: CPAN Module to determing overlap of 2 lists?
by Cristoforo (Curate) on Aug 10, 2020 at 23:35 UTC

      I'm not sure.

      Looking at Set::Scalar, I assume the set elements are unique (i.e. (a,b) insert a is still (a,b)), and that the elements are unordered (i.e. (a,b) = (b,a)).

      I my case the elements are ordered and duplicates are allowed.

      Have I misunderstood Set::Scalar ?

Re: CPAN Module to determing overlap of 2 lists?
by perlfan (Priest) on Aug 11, 2020 at 02:51 UTC
    I don't know if this problem has a name.

    It's a constrained variation of the longest common subsequence problem.

    Anything you do can be easily defeated unless you set some parameters since it's not merely as simple as the trivial example you describe above. For example, what if on two consecutive days you issue the same exact commands. You'd end up with 100% overlap. What is it you're trying to do because your problem description doesn't seem to match up realistically with your contrived overlap example.

    And if the files are differentiated by date already, what's the point of even trying to match them up like this?

    Update: Using your contrived example where there is true overlap and no repeated elements out of order, then you could iterate over both files in order, each line becomes a key in a Tie::Hash::Indexed tie'd hash (++'d to maintain count). Then all the keys with values of 2 can be taken as the area of overlap. But as I said above, this will not work if there are any repeated lines above or below. So it won't work for you. I do think your problem is ill defined and ambigous. Even your second example has more than one solution based on what you want.

      Hmmm, seems more like a constrained variation of Longest common substring problem. Confusion probably due to looseness of my description.

      I agree that the combined files aren't guaranteed to represent the true sequence of commands. The example you provided is one case of aliasing that demonstrates this. Even more problematic is that multiple interactive shells will write to the same history file on exit.

      The dates of all the commands in the history file are not known, only the upper bound. The file is a rolling window of the entered commands. If fewer commands are entered than the hist file size, the oldest commands are are eliminated from the file.

Re: CPAN Module to determing overlap of 2 lists?
by Anonymous Monk on Aug 11, 2020 at 10:14 UTC
    Yes, if I set HISTFILESIZE to -1, then bash would maintain an unlimited history file

    I wish I'd known that, I just keep making the number bigger!

    The following doesn't help with the current problem, but: this will be a trivial problem in the future if you add HISTTIMEFORMAT='%Y%m%d-%H%M%S ' to ~/.bashrc, or to /etc/bash.bashrc. Then start a new terminal session and issue a few commands, then issue history | tail and you'll see what I mean. You'll have to do something like history | sort -k 2 | grep 'what you want to see', as somehow things are still out of sequence. I think that happens to me because I generally have 3 or more terminal windows open, doing things in each, and the history file gets written (I think) when the bash session ends,

      I never looked at HISTTIMEFORMAT.

      If I set HISTTIMEFORMAT, bash captures the time in the history file as what appears to be epoch seconds. This would definitely speed things up.

      #1597175957 echo $$ #1597175963 echo $HOME #1597175967 ls -la #1597176017 ls #1597176017 ls #1597176018 ls

      As you mention, the history file gets written when the shell exits. I too usually have multiple shells at the same time. I guess if we need to separate different sessions we could set HISTFILE to a distinct file name in .bashrc

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://11120564]
Approved by Paladin
Front-paged by haukex
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others lurking in the Monastery: (5)
As of 2020-09-22 08:15 GMT
Find Nodes?
    Voting Booth?
    If at first I donít succeed, I Ö

    Results (128 votes). Check out past polls.