Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris

Re^2: Removing Non-Ascii chars from text file

by Anonymous Monk
on Nov 19, 2012 at 08:34 UTC ( #1004482=note: print w/replies, xml ) Need Help??

in reply to Re: Removing Non-Ascii chars from text file
in thread Removing Non-Ascii chars from text file

^\x20-\x7E This is not ASCII, this is real ascii: ^\x00-\x7F Otherwise it will trim out newlines and other special characters that are part of ascii table!
  • Comment on Re^2: Removing Non-Ascii chars from text file

Replies are listed 'Best First'.
Re^3: Removing Non-Ascii chars from text file
by jdporter (Canon) on Nov 21, 2012 at 14:45 UTC

    Correct. ASCII "includes definitions for 128 characters: 33 are non-printing control characters... and 95 printable characters..."
    See this scanned copy of the original "American Standard Code for Information Interchange (ASCII)" from 1963, the 5th page in particular. This definition is also enshrined in Internet RFC 20.

Re^3: Removing Non-Ascii chars from text file
by Anonymous Monk on Nov 21, 2012 at 08:48 UTC

    <c> ^\x20-\x7E <c> This is not ASCI

    Sure it is, 32 through 126 (precisely all the characters that aren't 32 through 126 )

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1004482]
[LanX]: before digging into deep debugging ... I have a strange UTF8 problem, probably it rings a bell:
[LanX]: two utf8 strings from different sources are base64 encoded, but after joining both the umlauts in teh second get deleted
[Corion]: LanX: You can't just join two base64 strings together
[LanX]: (not a high priority bug because I can use some HTML entities in the second string)
[Corion]: base64 is padded to a multiple of 4 chars (or something)
[LanX]: misunderstanding, I joined them before converting to base64
[Corion]: Also, I would be wary of encodings and try to make really sure that both input strings are UTF-8. Maybe join the input strings from one source together to see whether they decode as bad or not
[Corion]: LanX: Then the problem should persist without encoding to base64 too ;)
[LanX]: I think it's a flag problem ... I'll produce a reprodocable example for SOPW
[Corion]: "flag problem" to me sounds like "contains UTF-8 bytes but was never properly decoded to an UTF-8 string"

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (11)
As of 2017-01-16 13:55 GMT
Find Nodes?
    Voting Booth?
    Do you watch meteor showers?

    Results (150 votes). Check out past polls.