Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Perl detect utf8, iso-8859-1 encoding

by swiftlet (Acolyte)
on Jul 24, 2020 at 09:05 UTC ( [id://11119739]=perlquestion: print w/replies, xml ) Need Help??

swiftlet has asked for the wisdom of the Perl Monks concerning the following question:

Currently I am using Encode::Detect::Detector to detect the encoding of accented characters, and then use Text::Unaccent to normalize it.

Encode::Detect::Detector is working well normally,

author=Schöttl, encoding: UTF-8, author Schottl author=Sch%F6ttl, encoding: windows-1252, author Schottl
Our Detector may crash if a user enters something like:
Sch%F6%F6ttl = Schööttl (2 "ö") author=Sch%F6%F6ttl, encoding: , author

Is there any other better solution, modules, to detect the encoding if the users input something like "öö"?

I have form accept-charset="utf8", but I can't stop search engine using old iso-8859-1 links.

Replies are listed 'Best First'.
Re: Perl detect utf8, iso-8859-1 encoding
by haukex (Archbishop) on Jul 24, 2020 at 13:54 UTC

    If you're just trying to tell the difference between those two encodings, then note that a lot of text encoded with Latin1 is not valid UTF-8, so simply attempting to decode it as UTF-8 will already give you a very good hint. I demonstrated this with some code (plus heuristics) in this node.

Re: Perl detect utf8, iso-8859-1 encoding
by Your Mother (Archbishop) on Jul 24, 2020 at 19:24 UTC

    Since no one has said it: detecting for encoding is always broken and should only ever be used as a last resort. If you have any way to either be sure of the encoding or enforce it up front, use it. If at all possible, look further up the chain for a way to do it correctly. Detection is incorrect, even if useful when there are no other options.

      I have charset="UTF-8" in "Content-Type", "<meta>" tag, and accept-charset="UTF-8", anything else can I enforce it?

      Since our old iso-8859-1 links may have been saved by users or indexed by search engines, I can't find another way to get the search backward-compatible without detection, any suggestions?

        That sounds like a good first link in the chain. If you are receiving all your input through forms from UTF-8 declared pages, then you are only receiving UTF-8 data and you can treat it that way, ignoring all other encodings and definitely have no need to guess. If that’s all true and you’re having problems, probably you are not decoding correctly from the first step to the next processing steps. We’d need more info about the full processing chain to guide you there. This is overwhelming—and overkill because most basic processing chains don’t need to consider most of it—but it is the Rosetta Stone for the issues: Why does modern Perl avoid UTF-8 by default?

Re: Perl detect utf8, iso-8859-1 encoding
by Corion (Patriarch) on Jul 24, 2020 at 09:10 UTC

    Can you show us some short, self-contained example Perl code using Encode::Detect::Detector that replicates the behaviour? This helps us see how the Perl code crashes and maybe we find a good way around this.

      Actually I shouldn't call it "crashed", it just can't detect if there are 2 "ö" Here is the code to duplicate the problem, one "ö" is fine, a space between 2 "ö ö" is fine, "öñ" (%F6%F1) is fine but not "öö"
      use utf8; use Text::Unaccent; use Encode::Detect::Detector; ## my $author = "Sch%F6ttl"; my $author = "Sch%F6%F6ttl"; $author =~ s/%([a-zA-Z0-9][a-zA-Z0-9])/pack('C',hex($1))/eg; my $encoding = Encode::Detect::Detector::detect($author); print "encoding: $encoding: $author <br>\n"; if($encoding){ $author = unac_string($encoding, $author); print "after unac: $author<br>\n"; }

        I'm guessing the bug is in Text::Unaccent, but it's directly using the iconv C library, so I can't easily say for sure.

        However, maybe this can work:

        use strict; use feature qw(unicode_strings say); use Unicode::Normalize 'NFD'; my $author = "Sch\x{f6}\x{f6}ttl"; $author = NFD $author; $author =~ s/\p{Combining_Diacritical_Marks}//g; say $author;

        This doesn't include and decode() or encode() of the incoming/outgoing strings. Also, I think that this can also break in cases where there are multiple combining characters.

        Well, that would be a bug ... please go to CPAN and report it.
Re: Perl detect utf8, iso-8859-1 encoding
by choroba (Cardinal) on Jul 24, 2020 at 20:13 UTC
    I have never needed to guess the encoding, but I've noticed Encode::Guess mentioned several times here. Have you tried it? Does it have the same problems as the Encode::Detect::Detector?

    map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
Re: Perl detect utf8, iso-8859-1 encoding
by bliako (Abbot) on Jul 25, 2020 at 08:21 UTC

    can statistical analysis of your input at the N-byte level help you? Specific to your texts?

    This algorithm usually involves statistical analysis of byte patterns, like frequency distribution of trigraphs of various languages encoded in each code page that will be detected; such statistical analysis can also be used to perform language detection. This process is not foolproof because it depends on statistical data.

    additionally as others have said, you can take advantage of this:

    One of the few cases where charset detection works reliably is detecting UTF-8. This is due to the large percentage of invalid byte sequences in UTF-8, so that text in any other encoding that uses bytes with the high bit set is extremely unlikely to pass a UTF-8 validity test. However, badly written charset detection routines do not run the reliable UTF-8 test first, and may decide that UTF-8 is some other encoding. For example, it was common that web sites in UTF-8 containing the name of the German city München were shown as München, due to the code deciding it was an ISO-8859 encoding before even testing to see if it was UTF-8.

    both quotations from https://en.wikipedia.org/wiki/Charset_detection

    1' Edit: n-dimensional statistical analysis of DNA sequences (or text, or ...) can help you with n-dimensional *sparse* histograms otherwise CPAN may be of help. Math::Histogram is not sparse. However for N=4 it will be OK.

    bw, bliako

Re: Perl detect utf8, iso-8859-1 encoding
by jcb (Parson) on Jul 25, 2020 at 00:02 UTC

    Fundamentally, you cannot reliably detect encodings. You can guess UTF-8 if the input is valid UTF-8, but that is still a guess at best.

    The problem is that pre-Unicode encodings actually made full use of the available 256 codepoints in an octet. UTF-8 must use those same 256 codepoints (and the lower 128 are ASCII), so all valid UTF-8 is also valid in other encodings. There is no general solution to this problem, although you might be able to make some headway with either a dictionary of valid names, or some rules for recognizing "plausible" names — that is, names that use only characters used in names from one language, since mixed-language names are highly unlikely.

    For the special case of deciding whether the input is UTF-8 as requested or ISO-Latin-1 due to following an outdated link, you can probably make good progress by simply checking if the input is valid UTF-8 and assuming ISO-Latin-1 if not. This is not exactly correct, but is probably a fair heuristic.

      simply checking if the input is valid UTF-8 and assuming ISO-Latin-1 if not

      Thanks! This is a good idea, but how could I find out if the input is a valid utf-8 or not? Both utf8::valid and utf8::is_utf8 are not working well in my examples

        To check whether data are valid UTF-8 is rather straightforward. Here's the example, slightly modified from the synopsis of Encode:
        use Encode qw(decode encode); $characters = decode('UTF-8', $octets, Encode::FB_CROAK | Encode::LEAVE_SRC);

        This code will die if there are invalid data, so you would wrap it into the exception handler of your choice, plain eval and Try::Tiny seem to be popular.

        BTW: as jcb already indicated, chances are excellent that if data pass as UTF-8, they actually are UTF-8. All bytes of multibyte characters in valid UTF-8 strings are in the range \x80 to \xFF, and in particular the bytes 2-4 are in the range \x80-\xBF. You just can't build readable text from characters in that range in any of the ISO-8859-* encodings, and about half of that range are "unprintable" control characters from ISO/IEC 6429.

Re: Perl detect utf8, iso-8859-1 encoding
by ikegami (Patriarch) on Jul 25, 2020 at 08:25 UTC
Re: Perl detect utf8, iso-8859-1 encoding
by swiftlet (Acolyte) on Jul 25, 2020 at 09:35 UTC

    I am afraid I do not have the luxury to discard all non-utf8 input, but I can simplify the code:

    if the input is not detected as utf8, just treat it as iso-8859-1

    use Text::Unaccent; use Encode::Detect::Detector; # my $author = "Sch%F6%E5ttl"; # my $author = "Sch%C3%A9ttl"; # my $author = "Sch%C3%B6ttl"; # my $author = "Sch%F6%F6ttl"; # my $author = "Sch%F6 %F4ttl"; my $author = "teoria elasticit%E0"; $author =~ s/%([a-zA-Z0-9][a-zA-Z0-9])/pack('C',hex($1))/eg; my $encoding = Encode::Detect::Detector::detect($author); if($encoding !~ m#utf-8#i){ $encoding = "iso-8859-1"; } if($encoding){ $author = unac_string($encoding, $author); print "after unac: $author<br>\n"; }

    Seems like it's working better, any potential problem?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11119739]
Approved by marto
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having an uproarious good time at the Monastery: (5)
As of 2024-06-20 15:25 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?
    erzuuli‥ 🛈The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.