AssFace has asked for the wisdom of the Perl Monks concerning the following question:

I have a script that I have mentioned on here before. It pulls down EBay feedback and also news headlines and then superimposes that text over images from recent news articles.

I have noticed that sometimes the text that it writes out doesn't have the white background that the other text does (it uses Image::Magick's annotate call and then in there sets the background color on the text).
It will also occasionally just not have any text at all.

I've narrowed the problem down to "special" characters that are in the text that it is pulling off of its various sources.

I don't know what these characters are - it is just scraping from a web page - and I don't see the characters in there, but then in the files that it populates with the text, if I view it over ssh with the less command, then some examples that I see are:

and more importantly:

Now the characters up there are what show up when viewed via less in me ssh connection - but they aren't really there like that. It looks like it misinterprets the characters and displays that instead, not knowing what else to do.
The German text seems to be okay and when it gets put into the images as text, it will still show up as whatever letter it is supposed to be - usually with an umlaut or accent, or whatever.

But what I'm most confused about is the last one - what looks like control-M-control-M at the beginning of a line - that is the text that usually then shows up without a background (like Image::Magick is somehow breaking on that text).

I have code in to strip out various characters - but I'm not sure how to strip those out since I don't even know what they are - looking at it in my terminal doesn't offer hope.
I know that I can add a RedEx to yank out anything that isn't a letter/number - but then there are the punctuation marks and whatnot.

Any help/suggestions would be great - thanks!

There are some odd things afoot now, in the Villa Straylight.

Replies are listed 'Best First'.
Re: Stripping out special characters
by Util (Priest) on May 11, 2003 at 21:36 UTC

    Tools to assist, and thoughts:

    Generate a file of all 256 characters, for viewing in your viewer of choice, to match up something like <FC> to its actual character code.

    perl -e 'print map chr,0..255' > chars.all

    object dump: dump a file in hex format. You could copy mystery characters into a file, and run this on the file.

    od -t x1 chars.all

    Dump a scalar in hex format. I use this when writing programs that decode binary files.

    sub hex_it { return join ' ', map {sprintf '%2.2x', $_} unpack('C*', $_[0]); }

    Data::Dumper's Useqq can be set to 1, causing dumps to be encoded like you would write them in a Perl double-quoted string; just right for developing a regex.

    use Data::Dumper; $Data::Dumper::Useqq = 1; print Dumper $wierd;

    Many of the strange characters are probably coming from people pasting text direct from MS Word, which is infamous for causing these kinds of problems. Rather than just removing the characters, you may want to paste them into Word to see what they really mean, and write your regex to translate to the nearest equivalent. For example,

    tr{\x93\x94}{""}; # Translate MS Word SmartQuotes into regular quotes.

    Control-M is also known as "\r", Carriage Return, or just CR. Control-J is also known as "\n", Line Feed, or just LF. The names are left over from the old teletype days. Different systems uses different characters (sometimes more than one) to end a line of text; this is called the "newline" for that system. Unix uses LF, while Windows uses CRLF. When you view Windows text on a Unix system, you see the CR that is left over after your viewer interprets the LF.