Checking for new files

by New Novice (Sexton)
Enlightened Ones,

I downloaded a (huge) number of webpages from the internet using a perl routine some time ago. Now I want to check if there are any files that I missed or that are new.

It is actually more complicated than this (as I have to retrieve information from a database first, open up webpages and then extract bits and pieces from the resulting pages). What it boils down to, however, is that I have a list of the files on my computer and that I can generate a list of the webpages. I can link these two by a common piece of information (there is an ID-number on the webpages, which I use to construct the file name). Now the question is, how do I create a third list that gives me all the webpages that my new search of the database returned but which I haven't downloaded yet (i.e., are not contained in the list of files). How can I compare the elements in two lists with a view to elements that are contained in one but not the other.

Re: Checking for new files
by Corion (Patriarch) on Jan 28, 2005 at 09:30 UTC
    perldoc -q difference
    How do I compute the difference of two arrays? How do I compute the intersect ion of two arrays?

    Use a hash. Here's code to do both and more. It assumes that each element is unique in a given array:

    @union = @intersection = @difference = (); %count = (); foreach $element (@array1, @array2) { $count{$element}++ } foreach $element (keys %count) { push @union, $element; push @{ $count{$element} > 1 ? \@intersection : \@difference }, $element; }

    Note that this is the *symmetric difference*, that is, all elements in either A or in B but not in both. Think of it as an xor operation.

    The example computes the symmetric difference, but most likely you will only be interested in the pages that are new on the web and missing in your local copy, so you will want to modify the check as follows so it only gives you the locally missing items:

    use strict; my (@local) = get_local_ids(); my (@remote) = get_remote_ids(); my %have_local = (); foreach $element (@local) { $have_local{$element}++ }; foreach $id (@remote) { next if $have_local{$id}; retrieve($id); $have_local{$id}++; }
Re: Checking for new files
by ambrus (Abbot) on Jan 28, 2005 at 10:21 UTC

    Sort the two lists and compare with comm:

    $ cat a one two three four five six seven $ cat b two four five six seven $ sort a > a.sorted $ sort b > b.sorted $ comm -23 a.sorted b.sorted one three $

    Update 2009 sep 2.

    See Re^2: Joining two files on common field for a list of other nodes where unix textutils is suggested to merge files.

      Use zsh! :)

      $ comm -23 <(sort a) <(sort b)

