Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Re: The Unicode Bug with Transliteration or Substitution

by graff (Chancellor)
on May 04, 2014 at 03:15 UTC ( [id://1084918]=note: print w/replies, xml ) Need Help??


in reply to The Unicode Bug with Transliteration or Substitution

Regarding the different behavior of some machines (given that they all have Perl 5.8.3), I'm sorry that I don't have explicit details that would be relevant, but looking over some email from 2004 just now, I noticed that Dan Kogai was releasing updates to the Encode module independently of perl releases. I know there were some subtle but notable bugs in earlier Encode releases.

I wonder if your various systems with 5.8.3 might have different versions of Encode.

  • Comment on Re: The Unicode Bug with Transliteration or Substitution

Replies are listed 'Best First'.
Re^2: The Unicode Bug with Transliteration or Substitution
by choroba (Cardinal) on May 04, 2014 at 19:59 UTC
    Thank you. Some of the machines indeed had a different version of the Encode module. There are still some, though, that have the same version, but produce different results. The only difference I can see is one of them is 32 bit, while the second one is 64 bit (but Perl is 32 bit).

    Update: I ran the process via strace on both machines. One of the many differences I noticed was the size of the read buffer: on the 32 bit machine, read(3 is called with the buffer size of 32768, while on the 64 machine, the size is 65536. There might be a problem if a multibyte character is split between two subsequent buffers. It would also explain why the output is not different when the input is processed line by line (no line is longer than 32768 bytes). It still doesn't explain why substitution fixes the problem, though.

    لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
      I would have expected that if a file larger than the low-level input buffer size has been slurped, then the contents of multiple, consecutive read (3) calls have simply been concatenated without further ado into a single string buffer, prior to whatever processing comes next in the script. Given that the perl version and Encode version are the same, differences in cpu "native word" size and read buffer size should have no impact. (Rather, if the word/buffer size had any impact, it should affect other behaviors on slurped files, not just tr/ / /s vs. s/ +/ /g.)

      So, when you compared your two machines that were both 32-bit 5.8.3 with the same version of Encode, but 32-bit vs. 64-bit cpus (smaller and larger read buffers), which one had the strange behavior with tr/ / /s going crazy?

      Did the differences in Encode versions on other machines show any relation to the strange behavior? (Were you able to look at the release notes of the later Encode version(s) to see if anything relevant was fixed?)

        Thanks for help. I discovered another possible source of the problems. The Perl binary on all the machines has the same md5sum, which means the 64-bit newer Linux runs the old 32-bit Perl binary compiled on a different machine and system. I'd guess it means all bets are off - there's no way how to fix it other than recompiling Perl from source. I was able to do that and the new 5.8.3 behaves correctly. Now, I just need to convince the managers we need to switch.
        لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1084918]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chanting in the Monastery: (3)
As of 2024-04-20 02:54 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found