Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Re: Quantitative Change instead of Boolean

by Eimi Metamorphoumai (Deacon)
on Jun 30, 2006 at 15:26 UTC ( #558597=note: print w/replies, xml ) Need Help??


in reply to Quantitative Change instead of Boolean

I'd suggest you look at Digest::Nilsimsa, which should be a pretty good starting point. It creates something like a hash, but set up so that "similar" text produces similar hashes.
  • Comment on Re: Quantitative Change instead of Boolean

Replies are listed 'Best First'.
Re^2: Quantitative Change instead of Boolean
by titivillus (Scribe) on Aug 11, 2006 at 17:40 UTC
    I've taken a look at Digest::Nilsimsa. It is a cool thing. However, I've found a problem.
    for my $d ( 30 .. 36 ) { my $this = $nil->text2digest( ( 'a' x $d ) . 'b' ) ; my $that = $nil->text2digest( ( 'a' x ( $d + 1 ) ) ) ; print nilcomp( $this , $that ) ; } sub nilcomp { my $diff = 0 ; my $diff2 = 0 ; my @this = split /|/ , shift ; my @that = split /|/ , shift ; for my $a ( 0 .. scalar(@this)-1 ) { $diff++ if $this[$a] ne $that[$a] ; my $is = hex $this[$a] ; my $at = hex $that[$a] ; if ( $is != $at ) { $diff2 += abs $is - $at ; } } return ( join "" , @this) . qq(\n) . ( join "" , @that) . qq(\n) . $diff . qq( characters different\n) . ( abs $diff2 ) . qq( bits different\n\n); }
    gives you
    000000000000900000010021000008105000080010000004000c400000000008 0000000000009000000000200000080040000000000000040008400000000000 8 characters different 25 bits different 000000000000900000010021000008105000080010000004000c400000000008 0000000000009000000000200000080040000000000000040008400000000000 8 characters different 25 bits different 000000000000900000010021000008105000080010000004000c400000000008 0000000000009000000000200000080040000000000000040008400000000000 8 characters different 25 bits different 000000000000900000010021000008105000080010000004000c400000000008 0000000000009000000000200000080040000000000000040008400000000000 8 characters different 25 bits different 000000000000900000010021000008105000080010000004000c400000000008 0000000000009000000000200000080040000000000000040008400000000000 8 characters different 25 bits different 0000000000009000000000200000080040000000000000040008400000000000 0000000000009000000000200000080040000000000000040008400000000000 0 characters different 0 bits different 0000000000009000000000200000080040000000000000040008400000000000 0000000000009000000000200000080040000000000000040008400000000000 0 characters different 0 bits different
    If all I was trying to use this on were 35-character data sets, that'd be cool, but I'm trying to run this on whole web pages. I pull out all markup and whitespace, I'll still be in the headers by the time the 35th character rolled around. I love it in theory, but in practice, the data's too big for the module. So, I could do this:  $output =~ s[(.{35})][$nilsimsa->text2digest($1)]ge ; or something of the sort, but that seems ... goofy. But it does point out that taking length $output and comparing it to last time should indicate a small change, if they're only a small number of characters apart.

    .sig goes here

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://558597]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (6)
As of 2020-10-21 14:06 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    My favourite web site is:












    Results (217 votes). Check out past polls.

    Notices?