Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

PERL UNIX and strangeness and char conversion

by Anonymous Monk
on Jan 18, 2021 at 16:26 UTC ( [id://11127065]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to replace <tags> in a document. I noticed something interesting:

if I do 'more FOO.txt' I see:

<93> abcde <94>
if I view the same file with ptked or gedi, I see:
\x{93} abcde \x{94}
SOOO, is the 'more' filter correct or is the editor written in PERL correct?

I want to convert the stupidness to plain ASCII(?) punctuation marks so the grammar is correct, it looks better, and I can search with a perl/ptk program I am writing. Has someone already written a program to do this?

I have already figgered out how to convert the EOL chars to "/n" .

Replies are listed 'Best First'.
Re: PERL UNIX and strangeness and char conversion
by hippo (Bishop) on Jan 18, 2021 at 16:35 UTC
Re: PERL UNIX and strangeness and char conversion
by ikegami (Patriarch) on Jan 19, 2021 at 12:18 UTC

    The file doesn't contain <93> or \x{93}; it contains a byte with a value of 0x93. This is garbage (where found) when UTF-8 is expected. The two programs simply handle this garbage differently.


    What you have is probably a file encoded using cp1252 rather than UTF-8. «“» and «”» are encoded as 93 and 94 respectively when using cp1252.

    Convert the file's encoding from cp1252 to UTF-8 before using it with tools expecting UTF-8.

    iconv -f cp1252 -t UTF-8 file.cp1252 >file.utf8
Re: PERL UNIX and strangeness and char conversion
by tybalt89 (Monsignor) on Jan 18, 2021 at 19:54 UTC

    ""more" and those editors are all correct. They are just showing a different representation of the same unprintable character.

    perl -pe 'tr/\x80-\xff/?/' <oldfile >convertedfile

    There. I've converted all that strangeness to a plain ASCII ? just like you asked.

      OK! I THINK I GET IT

      perl -pe 'tr/\x93-\x94/"/' <INFILE >OUTFILE
      Inserts the "
      perl -pe 'tr/\x97/,/' <INFILE >OUTFILE
      Inserts the ,
      perl -pe 'tr/\x91-\x92/'/' <INFILE >OUTFILE
      Croaks.

      If I insert "\" to escape the second ' IT STILL CROAKS!
      What is special about \x91 and \x92 in the above statement?
      At least I made progress this time...

        The bash shell does not take escapes, try

        perl -pe 'tr/\x91-\x92/\x27/' <INFILE >OUTFILE
        instead.

Re: PERL UNIX and strangeness and char conversion
by jcb (Parson) on Jan 18, 2021 at 23:59 UTC
    if I do 'more FOO.txt' I see: <93> abcde <94> if I view the same file with ptked or gedi, I see: \x{93} abcde \x{94}

    That is the hint that your "<tags>" are not tags at all. The more program displays bytes with values greater than 127 as "<xx>" where "xx" is two hex digits.

    As other monks mentioned on a previous question, this is very likely to be Microsoft "smart quotes" garbage if those are in the proper positions in the text, although standards declare the 0x93 and 0x94 codepoints to be in the C1 group of control characters. (So of course Microsoft would use them as graphic characters...) You probably want the old "demoronizer" tool; it was written specifically to fix this type of stupidity.

Re: PERL UNIX and strangeness and char conversion
by Anonymous Monk on Jan 18, 2021 at 20:31 UTC
    Apps vary considerably as to how they display "non-ASCII characters." And, separately, whether they understand UTF-x (Unicode). Both are probably telling you, each in their own way, that they see bytes with MSB=1.
Re: PERL UNIX and strangeness and char conversion
by betmatt (Scribe) on Jan 20, 2021 at 10:52 UTC
    This is a plea from somebody that maybe needs to know more Computer Science. Can you please rewrite this question giving a bit of background. I feel that I should understand the question, but I can't make head or tail of it. A bit of background would go a long way for the more novice browser.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11127065]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others musing on the Monastery: (6)
As of 2024-03-29 01:07 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found