http://www.perlmonks.org?node_id=566897


in reply to Re: Quantitative Change instead of Boolean
in thread Quantitative Change instead of Boolean

I've taken a look at Digest::Nilsimsa. It is a cool thing. However, I've found a problem.
for my $d ( 30 .. 36 ) { my $this = $nil->text2digest( ( 'a' x $d ) . 'b' ) ; my $that = $nil->text2digest( ( 'a' x ( $d + 1 ) ) ) ; print nilcomp( $this , $that ) ; } sub nilcomp { my $diff = 0 ; my $diff2 = 0 ; my @this = split /|/ , shift ; my @that = split /|/ , shift ; for my $a ( 0 .. scalar(@this)-1 ) { $diff++ if $this[$a] ne $that[$a] ; my $is = hex $this[$a] ; my $at = hex $that[$a] ; if ( $is != $at ) { $diff2 += abs $is - $at ; } } return ( join "" , @this) . qq(\n) . ( join "" , @that) . qq(\n) . $diff . qq( characters different\n) . ( abs $diff2 ) . qq( bits different\n\n); }
gives you
000000000000900000010021000008105000080010000004000c400000000008 0000000000009000000000200000080040000000000000040008400000000000 8 characters different 25 bits different 000000000000900000010021000008105000080010000004000c400000000008 0000000000009000000000200000080040000000000000040008400000000000 8 characters different 25 bits different 000000000000900000010021000008105000080010000004000c400000000008 0000000000009000000000200000080040000000000000040008400000000000 8 characters different 25 bits different 000000000000900000010021000008105000080010000004000c400000000008 0000000000009000000000200000080040000000000000040008400000000000 8 characters different 25 bits different 000000000000900000010021000008105000080010000004000c400000000008 0000000000009000000000200000080040000000000000040008400000000000 8 characters different 25 bits different 0000000000009000000000200000080040000000000000040008400000000000 0000000000009000000000200000080040000000000000040008400000000000 0 characters different 0 bits different 0000000000009000000000200000080040000000000000040008400000000000 0000000000009000000000200000080040000000000000040008400000000000 0 characters different 0 bits different
If all I was trying to use this on were 35-character data sets, that'd be cool, but I'm trying to run this on whole web pages. I pull out all markup and whitespace, I'll still be in the headers by the time the 35th character rolled around. I love it in theory, but in practice, the data's too big for the module. So, I could do this:  $output =~ s[(.{35})][$nilsimsa->text2digest($1)]ge ; or something of the sort, but that seems ... goofy. But it does point out that taking length $output and comparing it to last time should indicate a small change, if they're only a small number of characters apart.

.sig goes here