Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

how to calculate the length for other language content in PERL?

by vasanthgk91 (Acolyte)
on May 15, 2013 at 06:53 UTC ( #1033620=perlquestion: print w/ replies, xml ) Need Help??
vasanthgk91 has asked for the wisdom of the Perl Monks concerning the following question:

my $title="சிவகாசி அ&# +2992;ுகே வெடி வ&#3007 +;பத்து : 2 பேர் + உயிரிழப்ப&#300 +9;"; use Encode::Guess; my $enc = guess_encoding($title, qw/euc-jp shiftjis 7bit-jis/); ref($enc) or die "Can't guess: $enc"; # trap error this way $title= $enc->decode($title); my $title_length=length($title);

output I get wrong==268

Comment on how to calculate the length for other language content in PERL?
Download Code
Re: how to calculate the length for other language content in PERL?
by thomas895 (Friar) on May 15, 2013 at 07:06 UTC

    That looks like HTML character codes. Your method just counts how many characters are in that string, which seems to be 268.

    The fix is simple. Install HTML::Entities, and follow the examples given(like the one in the synopsis), and go from there.

    HTH

    ~Thomas~ 
    "Excuse me for butting in, but I'm interrupt-driven..."
      That is not a html entities..that is a tamil language content..when i paste in textarea It's look like that
        sorry vasanth!

        Yes, I know that. I meant your string with all the ampersands -- those are HTML Entities. They represent those characters, but in order to get the actual characters, you need to decode that first.
        This is where HTML::Entities comes in. Feed your string through decode_entities, and then do your thing with encode_guess, if needed still. The length of the output of decode_entities($title) is the string you should find the length of.

        ~Thomas~ 
        "Excuse me for butting in, but I'm interrupt-driven..."
Re: how to calculate the length for other language content in PERL?
by kcott (Abbot) on May 16, 2013 at 03:07 UTC

    G'day vasanthgk91,

    When posting Unicode in <code> tags, you'll get HTML character entity references. This is one instance where you should markup your code in <pre> tags.

    When your source contains Unicode characters, you should use the utf8 pragma.

    Example not using utf8:

    $ perl -Mstrict -Mwarnings -E '
        my $text = q{சிவகாசி அருகே வெடி விபத்து : 2 பேர் உயிரிழப்பு};
        say length $text;                                            
    '
    120
    

    Example using utf8:

    $ perl -Mstrict -Mwarnings -E '
        use utf8;                                                    
        my $text = q{சிவகாசி அருகே வெடி விபத்து : 2 பேர் உயிரிழப்பு};
        say length $text;
    '
    46
    

    Having no idea what constitutes a character in the Tamil language, I'll leave you tell me if 46 is the right answer.

    Update: Actually, that does add up. 7(spaces) + 1(:) + 1(2) + 37 * 7(length '&#nnnn;') = 268, cf. "output I get wrong==268"

    $ perl -E 'say(7+1+1+37*7)' 268

    -- Ken

      thank u
Re: how to calculate the length for other language content in PERL?
by ambrus (Abbot) on May 21, 2013 at 07:41 UTC

    If your CGI gets ampersand escapes submitted by a browser, you're doing something wrong. The browser uses those only as a fallback when it cannot encode characters in a field in the encoding requested, and if you get that fallback you will not be able to decode the original text completely because ampersands themselves are not encoded.

    The solution to this is to ask the browser to submit the form with the values encoded in utf-8 encoding. There are two ways to do this, either of which you have to apply when you serve the form that the browser will them submit. The better way is to send the html page with the form itself utf-8 encoded and tell this to the browser, with either the Content-Type header or the corresponding meta http-equiv element. The worse way is to set the accept-charset attribute on each form element. Note that the CGI might not actually get feedback from the browser on what encoding the browser has actually used, so the best you can do is to always request utf-8 encoding for a form.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1033620]
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (14)
As of 2014-07-11 13:24 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    When choosing user names for websites, I prefer to use:








    Results (224 votes), past polls