Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

strip html tags and special characters in perl while inserting the text in to database.

by valavanp (Curate)
on Jun 09, 2007 at 08:12 UTC ( #620160=perlquestion: print w/ replies, xml ) Need Help??
valavanp has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks, I am inserting the following text into the database.
<p>text comes here </p><p>text within after in  attorney’s fees, + Inc’s principle</p>
It contains html tags and some special characters. I don't want to insert that. Is there any way of doing this. My approach for this will be regular expressions or is there any module to do that. When using regular expressions i don't know the exact special character to match. Can anyone suggest a way for doing this? Thanks in advance.

Comment on strip html tags and special characters in perl while inserting the text in to database.
Download Code
Re: strip html tags and special characters in perl while inserting the text in to database.
by naikonta (Curate) on Jun 09, 2007 at 08:37 UTC
    Yes, HTML::Strip and Strip HTML. I see you've been here for a while, so I suppose you know about the Search box, and/or the Q&A section. What didn't you get from them? What have you tried?

    Open source softwares? Share and enjoy. Make profit from them if you can. Yet, share and enjoy!

Re: strip html tags and special characters in perl while inserting the text in to database.
by blazar (Canon) on Jun 09, 2007 at 16:07 UTC
    When using regular expressions i don't know the exact special character to match.

    (naikonta already told you 'bout HTML tags.) This all depends on how special the charachters you want to exclude really are. What do you want? ASCII only? And if so, then what's to be of the excluded charachters? Are you just stripping them? Do you want to encode them some way?

Re: strip html tags and special characters in perl while inserting the text in to database.
by graff (Chancellor) on Jun 09, 2007 at 16:26 UTC
    You could try this:
    $_ = "<p>text comes here </p><p>text within after in  attorney’s +fees,  Inc’s principle</p>"; s/<[^>]+>//g; s/[^[:ascii:]]+//g; print;
    Have you read anything yet about perl regular expressions (perlretut and perlre)? You should -- the docs are really quite good.
Re: strip html tags and special characters in perl while inserting the text in to database.
by bart (Canon) on Jun 10, 2007 at 12:09 UTC
    That looks like UTF8 to me, misrepresented as Latin-1. Treat it as UTF-8, before processing it any further. It'll suddenly look a lot more manageable.

    If you're using perl 5.8, then use the module Encode (the function decode, that would be my first guess) that comes with it.

    And don't just strip accented characters, you're replacing one kind of error with another. I have no objection to dumbing down special quote-like characters, to plaing double quotes and apostrophes. (I'm guessing "Â " actually represents a nbsp, which you can safely replace with plain spaces.)

      Hi monks, First of all thanks for the suggestions. I tried the following code from the links which are provided by you monks:
      use strict; use warnings; use Encode qw( _utf8_on ); my $resume = "”"; print $resume, "\n"; _utf8_on($resume); print $resume, "\n";
      When i execute the above code it gives me the same output in both print statements. I want the corresponding special character for the $resume variable. Please correct me if i am wrong in the above code. Thanks.
        I want the corresponding special character for the $resume variable.

        I don't understand what that means. Can you explain more carefully what you really want? Also, can you please try to be more clear about what is being assigned as the value of $resume?

        It actually seems that you are assigning a three-byte value:  "\xE2\x80\x9D" -- this happens to be interpretable as the utf8 encoding for the unicode character U+201D "RIGHT DOUBLE QUOTATION MARK". Do you want to replace this with the ASCII double-quote character?

        my $resume = "\x{201D}"; print "$resume\n"; $resume =~ s/\x{201d}/"/g; print "$resume\n";
        (updated to make sure the s/// applies to the value of $resume)

        To do that sort of replacement in a "general" sense (i.e. replace all "wide-character" versions of punctuation marks with ASCII versions of same wherever possible), you probably want Text::Unidecode:

        #!/usr/bin/perl use strict; use Text::Unidecode; my $resume = "\x{201d}"; print unidecode( $resume ), "\n";

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://620160]
Approved by naikonta
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (14)
As of 2014-09-17 12:21 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (74 votes), past polls