in reply to Re^2: Reg Ex to strip MS smart quotes
in thread Reg Ex to strip MS smart quotes

Are you sure? What problems are you having? Here's the snippet from the code that translates smart-quotes:

$s =~ s/\x93/"/g; $s =~ s/\x94/"/g;

And here's how I've modified the core demoronise sub:

sub de_cp1252 { my( $self, $s ) = @_; # Map incompatible CP-1252 characters $s =~ s/\x82/,/g; $s =~ s-\x83-<em>f</em>-g; $s =~ s/\x84/,,/g; $s =~ s/\x85/.../g; $s =~ s/\x88/^/g; $s =~ s-\x89- /-g; $s =~ s/\x8B/</g; $s =~ s/\x8C/Oe/g; $s =~ s/\x91/'/g; $s =~ s/\x92/'/g; $s =~ s/\x93/"/g; $s =~ s/\x94/"/g; $s =~ s/\x95/*/g; $s =~ s/\x96/-/g; $s =~ s/\x97/--/g; $s =~ s-\x98-<sup>~</sup>-g; $s =~ s-\x99-<sup>TM</sup>-g; $s =~ s/\x9B/>/g; $s =~ s/\x9C/oe/g; # Now check for any remaining untranslated characters. if ($s =~ m/[\x00-\x08\x10-\x1F\x80-\x9F]/) { for( my $i = 0; $i < length($s); $i++) { my $c = substr($s, $i, 1); if ($c =~ m/[\x00-\x09\x10-\x1F\x80-\x9F]/) { printf(STDERR "warning--untranslated character 0x%02X i +n input line %s\n", unpack('C', $c), $s ); } } } $s; }

I didn't really care about the other stuff (such as bad html or unicode) - just translating the known cp1252 misplaced characters into something reasonable.


Replies are listed 'Best First'.
Re^4: Reg Ex to strip MS smart quotes
by freddo411 (Chaplain) on Aug 19, 2005 at 20:35 UTC
    Bingo. That snippit is perfect.

    Interestingly, I found demoronizer and I kept looking because I thought it only worked on HTML and output HTML entities.

    Thanks again.

    Nothing is too wonderful to be true
    -- Michael Faraday