in reply to Re^3: Using filepath method to identify an .html page
in thread Using filepath method to identify an .html page

The perl code will produce the same hash for "abc.html" as for "bca.html"

In any case, the likelihood of a hash collision for any non-trivial website is substantial. If you hash 100 files you have about a 40% chance of a collision.

If you hash 220 files, the likelihood is about 90%
  • Comment on Re^4: Using filepath method to identify an .html page

Replies are listed 'Best First'.
Re^5: Using filepath method to identify an .html page
by blue_cowdawg (Monsignor) on Jan 22, 2013 at 19:34 UTC
        The perl code will produce the same hash for "abc.html" as for "bca.html"

    Which underscores the point I made earlier about adding collision detection and rehashing logic to whatever algorithm you use. One workround I've seen:

    | handwaving here... my @i = split(//,$url); # put each letter in it's own bin my $j=0; # Initailize our my $k=1; # hashing increment values my @m=(); # workspace foreach my $n(@i){ my $q=ord($n); # ASCII for character $k += $j; # Increment our hash offset $q += $k; # add our "old" value $j = $k; # store that. push @m,$q; # save the offsetted value } my $hashval=0; #initialize our hash value # Generate that map { $hashval = ($hashval + $_) % 10000} @m;
    Using that method ABC.html and CBA.html now have different values because each letter position's value gets bumped up increasingly from left to right.


    Peter L. Berghold -- Unix Professional
    Peter -at- Berghold -dot- Net; AOL IM redcowdawg Yahoo IM: blue_cowdawg