Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Re: strip html tags and special characters in perl while inserting the text in to database.

by bart (Canon)
on Jun 10, 2007 at 12:09 UTC ( #620293=note: print w/ replies, xml ) Need Help??


in reply to strip html tags and special characters in perl while inserting the text in to database.

That looks like UTF8 to me, misrepresented as Latin-1. Treat it as UTF-8, before processing it any further. It'll suddenly look a lot more manageable.

If you're using perl 5.8, then use the module Encode (the function decode, that would be my first guess) that comes with it.

And don't just strip accented characters, you're replacing one kind of error with another. I have no objection to dumbing down special quote-like characters, to plaing double quotes and apostrophes. (I'm guessing " " actually represents a nbsp, which you can safely replace with plain spaces.)


Comment on Re: strip html tags and special characters in perl while inserting the text in to database.
Download Code
Re^2: strip html tags and special characters in perl while inserting the text in to database.
by valavanp (Curate) on Jun 10, 2007 at 17:12 UTC
    Hi monks, First of all thanks for the suggestions. I tried the following code from the links which are provided by you monks:
    use strict; use warnings; use Encode qw( _utf8_on ); my $resume = "”"; print $resume, "\n"; _utf8_on($resume); print $resume, "\n";
    When i execute the above code it gives me the same output in both print statements. I want the corresponding special character for the $resume variable. Please correct me if i am wrong in the above code. Thanks.
      I want the corresponding special character for the $resume variable.

      I don't understand what that means. Can you explain more carefully what you really want? Also, can you please try to be more clear about what is being assigned as the value of $resume?

      It actually seems that you are assigning a three-byte value:  "\xE2\x80\x9D" -- this happens to be interpretable as the utf8 encoding for the unicode character U+201D "RIGHT DOUBLE QUOTATION MARK". Do you want to replace this with the ASCII double-quote character?

      my $resume = "\x{201D}"; print "$resume\n"; $resume =~ s/\x{201d}/"/g; print "$resume\n";
      (updated to make sure the s/// applies to the value of $resume)

      To do that sort of replacement in a "general" sense (i.e. replace all "wide-character" versions of punctuation marks with ASCII versions of same wherever possible), you probably want Text::Unidecode:

      #!/usr/bin/perl use strict; use Text::Unidecode; my $resume = "\x{201d}"; print unidecode( $resume ), "\n";

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://620293]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (12)
As of 2014-09-22 12:17 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (191 votes), past polls