Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Slightly OT - Matching a strange character

by NovMonk (Chaplain)
on May 19, 2004 at 15:33 UTC ( #354650=perlquestion: print w/replies, xml ) Need Help??

NovMonk has asked for the wisdom of the Perl Monks concerning the following question:

Greetings Esteemed Monks,

I'm processing a text file that has a funny character in it. I'd like to convert this character to something less funny, but I'm not sure how to match it because I don't know what it is. A further problem is, when I try to paste in my problem line here, it converts to "Don't know" so I can't even figure out how to show it to you. In my file the 't looks like a Chinese ideogram or something.

I realize this isn't strictly perl, but I am using the information in a Perl script. Has anyone seen anything like this, or can you point me somewhere I might look to translate this character? I've checked control chracter type lists in my battered Perl by Example and tried googling, but I'm not sure what to call what I'm looking for.

It might help to know that the file in question is a MS Word file saved as a tab delimited text file.

Thanks as always for the help.

Pax vobiscum,

NovMonk

  • Comment on Slightly OT - Matching a strange character

Replies are listed 'Best First'.
Re: Slightly OT - Matching a strange character
by EdwardG (Vicar) on May 19, 2004 at 15:43 UTC
    I'm processing a text file that has a funny character in it. I'd like to convert this character to something less funny
    %perl -lne "s/Charlie Chaplin/George Bush/g;print" < myfile

     

Re: Slightly OT - Matching a strange character
by waswas-fng (Curate) on May 19, 2004 at 15:47 UTC
    on unix you may use od -c <filename> to get a dump of all of the characters escaped. output is something like this:
    % head -2 /etc/hosts ## # Host Database % head -2 /etc/hosts | od -c 0000000 # # \n # H o s t D a t a b + a 0000020 s e \n + 0000023


    -Waswas
      Right -- though a hex dump tends to be more useful in cases like this. And of course, if the odd byte happens to be near the end of a large file, this approach can be tedious...

      Here's my favorite -- it prints a histogram of byte values in a data stream or file:

      #!/usr/bin/perl use strict; my @ch; while (<>) { $ch[ord()]++ for ( split( // )); } printf "%6d %.2x\n", $ch[$_], $_ for ( grep {$ch[$_]} 0..$#ch );
      And it would be easy to make a couple additions/alterations so that it tabulates utf8 characters instead of bytes.

      Or just cat -v?

Re: Slightly OT - Matching a strange character
by NovMonk (Chaplain) on May 19, 2004 at 16:06 UTC
    Thanks EdwardG for the pun, and Waswas for the direction. The character in question is \222, and it is now neutralized. As are a couple of others I didn't see which had also sneaked in uninvited. Thanks for your help.

    Pax vobiscum,

    NovMonk

    Update: Here's my code that worked:

    while (<FILE>){ s/\222/'/g; s/\226/-/g; <do more stuff>; }

    The character in question is an octal, which I found in an escaped character table once I used Waswas's idea to see what it really was. Thanks again everyone.

      May I recommend
      tr/\222\226/'-/;
      instead of your two s///'s?

      The PerlMonk tr/// Advocate
        You may indeed. You are a God. Or well, ok, only a Bishop. Regardless, thanks!

        NovMonk

Re: Slightly OT - Matching a strange character
by TomDLux (Vicar) on May 19, 2004 at 16:15 UTC

    Open the file in emacs, cursor over to the character and use C-x =.

    Oh, ok, a Perl solution. Making a few assumptions about the variable storing the string, and the text in the region of the funny character ...

    use utf8; $text_before = "abc"; $text_after = "def"; $text =~ s/($text_before).($text_after)/$1X$2/

    --
    TTTATCGGTCGTTATATAGATGTTTGCA

Re: Slightly OT - Matching a strange character
by Mr. Muskrat (Canon) on May 19, 2004 at 23:52 UTC

    I'm processing a text file that has a funny character in it. I'd like to convert this character to something less funny

    Is it a ☺? Those are hilarious!

    s/☺/☠/g;

    Update: It seems that I should read the replies before making a joke because EdwardG beat me to it!

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://354650]
Approved by EdwardG
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (2)
As of 2022-05-29 05:34 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Do you prefer to work remotely?



    Results (101 votes). Check out past polls.

    Notices?