Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?

Re^2: Perl detect utf8, iso-8859-1 encoding

by swiftlet (Acolyte)
on Jul 24, 2020 at 10:06 UTC ( #11119746=note: print w/replies, xml ) Need Help??

in reply to Re: Perl detect utf8, iso-8859-1 encoding
in thread Perl detect utf8, iso-8859-1 encoding

Actually I shouldn't call it "crashed", it just can't detect if there are 2 "" Here is the code to duplicate the problem, one "" is fine, a space between 2 " " is fine, "" (%F6%F1) is fine but not ""
use utf8; use Text::Unaccent; use Encode::Detect::Detector; ## my $author = "Sch%F6ttl"; my $author = "Sch%F6%F6ttl"; $author =~ s/%([a-zA-Z0-9][a-zA-Z0-9])/pack('C',hex($1))/eg; my $encoding = Encode::Detect::Detector::detect($author); print "encoding: $encoding: $author <br>\n"; if($encoding){ $author = unac_string($encoding, $author); print "after unac: $author<br>\n"; }

Replies are listed 'Best First'.
Re^3: Perl detect utf8, iso-8859-1 encoding
by jeffenstein (Friar) on Jul 24, 2020 at 14:32 UTC

    I'm guessing the bug is in Text::Unaccent, but it's directly using the iconv C library, so I can't easily say for sure.

    However, maybe this can work:

    use strict; use feature qw(unicode_strings say); use Unicode::Normalize 'NFD'; my $author = "Sch\x{f6}\x{f6}ttl"; $author = NFD $author; $author =~ s/\p{Combining_Diacritical_Marks}//g; say $author;

    This doesn't include and decode() or encode() of the incoming/outgoing strings. Also, I think that this can also break in cases where there are multiple combining characters.

      thanks for your code, sorry I didn't explain the problem clear enough.

      The input could be encoded in iso-8859-1 \x{f6}\x{f6}, or, maybe in utf-8, \x{c3}\x{b6}, I have to find out what is the charset first.

      Encode::Detect::Detector is the one I am using to find out what is the charset of the string, utf-8 or iso-8859-1,

      the logic is like: $charset = = Encode::Detect::Detector::detect($input); if($charset eq 'UTF-8'){ # do NFC ... }elsif($charset eq 'iso-8859-1'){ # do NFD ... }
      Text::Unaccent unac_string($charset, $str) in my case. Text::Unaccent is working well if Detector can find it the correct code, it failed if Detector failed, of course, no charset.

      Encode::Detect::Detector normally working well, but failed if input = \x{f6}\x{f6}.

        Ah, oops. My fault there. You did explain it well, but I wasn't paying enough attention. I'll leave as is, just in case someone finds it useful one day

Re^3: Perl detect utf8, iso-8859-1 encoding
by Anonymous Monk on Jul 24, 2020 at 16:32 UTC
    Well, that would be a bug ... please go to CPAN and report it.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://11119746]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (9)
As of 2021-06-22 08:48 GMT
Find Nodes?
    Voting Booth?
    What does the "s" stand for in "perls"? (Whence perls)

    Results (102 votes). Check out past polls.