Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

Re^2: Quantitative Change instead of Boolean

by titivillus (Beadle)
on Aug 11, 2006 at 17:40 UTC ( #566897=note: print w/ replies, xml ) Need Help??


in reply to Re: Quantitative Change instead of Boolean
in thread Quantitative Change instead of Boolean

I've taken a look at Digest::Nilsimsa. It is a cool thing. However, I've found a problem.

for my $d ( 30 .. 36 ) { my $this = $nil->text2digest( ( 'a' x $d ) . 'b' ) ; my $that = $nil->text2digest( ( 'a' x ( $d + 1 ) ) ) ; print nilcomp( $this , $that ) ; } sub nilcomp { my $diff = 0 ; my $diff2 = 0 ; my @this = split /|/ , shift ; my @that = split /|/ , shift ; for my $a ( 0 .. scalar(@this)-1 ) { $diff++ if $this[$a] ne $that[$a] ; my $is = hex $this[$a] ; my $at = hex $that[$a] ; if ( $is != $at ) { $diff2 += abs $is - $at ; } } return ( join "" , @this) . qq(\n) . ( join "" , @that) . qq(\n) . $diff . qq( characters different\n) . ( abs $diff2 ) . qq( bits different\n\n); }
gives you
000000000000900000010021000008105000080010000004000c400000000008 0000000000009000000000200000080040000000000000040008400000000000 8 characters different 25 bits different 000000000000900000010021000008105000080010000004000c400000000008 0000000000009000000000200000080040000000000000040008400000000000 8 characters different 25 bits different 000000000000900000010021000008105000080010000004000c400000000008 0000000000009000000000200000080040000000000000040008400000000000 8 characters different 25 bits different 000000000000900000010021000008105000080010000004000c400000000008 0000000000009000000000200000080040000000000000040008400000000000 8 characters different 25 bits different 000000000000900000010021000008105000080010000004000c400000000008 0000000000009000000000200000080040000000000000040008400000000000 8 characters different 25 bits different 0000000000009000000000200000080040000000000000040008400000000000 0000000000009000000000200000080040000000000000040008400000000000 0 characters different 0 bits different 0000000000009000000000200000080040000000000000040008400000000000 0000000000009000000000200000080040000000000000040008400000000000 0 characters different 0 bits different
If all I was trying to use this on were 35-character data sets, that'd be cool, but I'm trying to run this on whole web pages. I pull out all markup and whitespace, I'll still be in the headers by the time the 35th character rolled around. I love it in theory, but in practice, the data's too big for the module. So, I could do this:  $output =~ s[(.{35})][$nilsimsa->text2digest($1)]ge ; or something of the sort, but that seems ... goofy. But it does point out that taking length $output and comparing it to last time should indicate a small change, if they're only a small number of characters apart.

.sig goes here


Comment on Re^2: Quantitative Change instead of Boolean
Select or Download Code

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://566897]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (6)
As of 2015-07-30 04:53 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (270 votes), past polls