Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

unknown character in between text

by soumyapanda (Acolyte)
on Sep 17, 2011 at 09:29 UTC ( [id://926525]=perlquestion: print w/replies, xml ) Need Help??

soumyapanda has asked for the wisdom of the Perl Monks concerning the following question:

original text

Some Text, § 8

after processing the data and printing in text file

Some Text, § 8

Hi,when am trying to read,process and print the above data in a text file am getting an extra text like  which is not present in the original data. i tried replacing the character using space in regex in which i failed. can u please help me in finding the character( or space am not sure what it is ).

Replies are listed 'Best First'.
Re: unknown character in between text
by choroba (Cardinal) on Sep 17, 2011 at 09:42 UTC
    Do you use warnings? Have you set utf-8 encoding both to your input and output?
Re: unknown character in between text
by Anonymous Monk on Sep 17, 2011 at 09:50 UTC
Re: unknown character in between text
by ikegami (Patriarch) on Sep 18, 2011 at 02:10 UTC
    The thread is heading in the wrong direction. It's fixing a symptom rather than the problem. The problem is surely a lack of encoding and/or decoding inputs and outputs as choroba pointed out. If we had more to go on, we could help you.
Re: unknown character in between text
by pvaldes (Chaplain) on Sep 17, 2011 at 11:53 UTC
    use utf8;

    hexadecimal, unicode code point, name:

    §: a7c2, U+00A7, Section sign

    Â: 82c3, U+00C2, Latin Capital Letter A with circumflex

      The docs say about use utf8:
      Do not use this pragma for anything else than telling Perl that your script is written in UTF-8.

      It does not magically transform your arbitrarily encoded input into equally arbitrarily encoded output.

      What we see here is Unicoded input (two byte wide), containing the bytes (in hex) C2 and A7. Upon output, somehow the unicode-ness of the characters was lost (we can only guess how that happened, as no code is provided) and we now see two characters, one being  (which is ASCII hex C2) and the other being § (which is ASCII hex A7).

      use utf8 is unlikely to solve this problem.

      CountZero

      A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor usele²ss variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

        Nitpick.

        There are no ASCII above 127 (hex 7F). What you have is Latin-1.

Re: unknown character in between text
by DanielSpaniel (Scribe) on Sep 17, 2011 at 12:42 UTC

    If you wanted to replace the offending character with its HTML entity, without using modules, you could maybe do something like this, which should work in many cases, (or use the commented line instead to replace the character with a space):

    for (my $x=0;$x<length($string);$x++) { if (ord(substr($string,$x,1))>127) { substr($string,$x,1)='&#'.ord(substr($string,$x,1)).';'; # substr($string,$x,1)=' '; # or use this line instead } }

    ... or, if you just wanted to know what a character is meant to be, then you could do something like this:

    for (my $x=0;$x<length($string);$x++) { print ord(substr($string,$x,1)),"\t",substr($string,$x,1),"\n"; }

    Hope that helps, although all the modules and tools mentioned above are useful methods too. (and I'm sure some guru could likely condense the code above into a single line).

      eeew :p

      s/([^\000-\200])/'&#'.ord($1).';'/ge s/([^\000-\200])/sprintf '&#x%X;', ord $1/ge

        :-) That looks tidier! Haven't tested it, but I get the gist ... I knew someone would be able to condense into a handful of bytes! But why 200 in the character class? I'm sure there must be a good reason, but I think I am more familiar with just seeing \000-\177

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://926525]
Approved by Old_Gray_Bear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others learning in the Monastery: (4)
As of 2024-04-19 13:57 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found