http://www.perlmonks.org?node_id=913756


in reply to Binary value of string

$sum = 0; $sum += $_ for unpack 'C*', "This is a string to change to binary"; print $sum;; 3325

Or

print unpack '%32C*', "This is a string to change to binary";; 3325

But the result will be modulo 2**32 (assuming it ever gets that high; which won;t happen until your strings are at least 16 million characters long.

If that was a concern, on a perl with 64-bit ints you can do:

print unpack '%64C*', "This is a string to change to binary";; 3325

Which will handle strings to at least 72057594037927936 bytes, which should be enough to be going on with :)


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Replies are listed 'Best First'.
Re^2: Binary value of string
by Yary (Pilgrim) on Jul 11, 2011 at 19:35 UTC

    I like seeing answers that use both "ord" and "unpack". It brings up a question, is it a binary buffer or a buffer of text? The two methods can give different answers with Unicode. Unpack with "C" will extract values byte by byte, ord always works character by character- which could be multi-byte characters.

    If you use "W" with unpack, then it will behave the same as ord.

      If you use "W" with unpack, then it will behave the same as ord.

      Is there any point in checksumming using unicode ordinals?

      Sum-the-bytes checksums are pretty useless -- you can perform any transpositions, shuffle or reverse the entire string and detect nothing -- that's why CRC's and Adler etc. were invented.

      The only (scant) merit of sum-the-bytes is that it is very fast. What would be achieved by slowing that to a crawl by forcing it to pick its way through the technical abortion that is multi-byte character encodings? You certainly aren't going to gain any greater guarantee of integrity.

      My gut feel is that as the are so many different "unicode standard" encodings out there in the wild, the chances of getting false positives from undetected transmission errors using sum-the-ordinals values, is far higher then using sum-the-bytes values.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

        My gut feel is that as the are so many different "unicode standard" encodings out there in the wild, the chances of getting false positives from undetected transmission errors using sum-the-ordinals values, is far higher then using sum-the-bytes values.

        I don't understand this. I can't think of any issue that would affect

        sum unpack 'W*', decode 'UTF-8', $utf8

        that wouldn't also affect

        sum unpack 'C*', $utf8