Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw

Check UTF8

by jai_dgl (Beadle)
on Oct 22, 2008 at 16:27 UTC ( #718794=perlquestion: print w/replies, xml ) Need Help??
jai_dgl has asked for the wisdom of the Perl Monks concerning the following question:

Hi I have a file of text content, in which some sentence are utf8 format.
I need to read the file line by line
find and delete the entire line if its a utf8 format.
The File Looks like the Following
Top/Adult/World/Polska/Galerie/Mniejszości_seksualne/Geje Top/Adult/World/Polska/Galerie/SomeDir/Geje Top/Adult/Mniejszości_seksualne/Polska/Galerie/Geje
Need to remove line number 1 and 3.
and Preserve Line number 2

Replies are listed 'Best First'.
Re: Check UTF8
by ikegami (Pope) on Oct 22, 2008 at 16:40 UTC

    Beware of what you ask for. The following script removes every line that only contains valid UTF-8.

    #!/usr/bin/perl use strict; use warnings; use Encode qw( decode ); while (<>) { print if !eval { decode('UTF-8', $_, Encode::FB_CROAK); 1 }; }


    • From a file: infile > outfile
    • From STDIN: < infile > outfile
    • In-place: perl -i.bak file

    A better solution might be to convert the lines to another encoding.

    #!/usr/bin/perl use strict; use warnings; binmode(STDIN, ':encoding(UTF-8)'); binmode(STDOUT, ':encoding(iso-latin-1)'); print while <>;

    Same usage as the original program.

Re: Check UTF8
by halley (Prior) on Oct 22, 2008 at 16:32 UTC
    What have you tried so far? What did you expect to happen?

    I am only able to guess on your problem, as you gave very little detail. Do you want to keep lines that only use ASCII? Do you want to keep lines that are not UTF-8 but are valid Latin-1 or ISO-2022-JP or some other encoding?

    If it really is a matter of ASCII or non-ASCII UTF-8, just reject a line if it includes any character above chr(127). Other encodings will present a bit more challenge.

    [ e d @ h a l l e y . c c ]

Re: Check UTF8
by JavaFan (Canon) on Oct 22, 2008 at 19:59 UTC
    There isn't enough information to write a program that does so. Files are just streams of bytes. And while many bytestreams can be determined to not be valid UTF-8, the reverse isn't true. For instance, if you have a line in the file with bytes E2 A1 B9, is that a line with the three characters LATIN SMALL LETTER A WITH CIRCUMFLEX, INVERTED EXCLAMATION MARK, SUPERSCRIPT ONE (⡹ in Latin-1), or BRAILLE PATTERN DOTS-14567 (in UTF-8). And it maybe something different in one of the hundreds of other encodings that are out there.

    So, while you sometimes can determine that a line *isn't* UTF-8 (because not every byte sequence is valid UTF-8), you can never be sure a byte sequence is UTF-8 without additional information.

      True. So tell me: why on earth does the Unicode standard recommend against putting a BOM at the start of a UTF-8 file? Those guys must really like ambiguous data and the quandary it creates for software developers.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://718794]
Approved by Corion
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (4)
As of 2017-09-22 23:02 GMT
Find Nodes?
    Voting Booth?
    During the recent solar eclipse, I:

    Results (270 votes). Check out past polls.