in reply to Encoding TermExtract


I was bored, so I installed the library and ran your example.

The input file appears to be a properly encoded UTF8 file. (I say "appears" because I'm no expert in Unicode, and only looked over the first handful of characters in the file.)

$ hexdump -C in.txt | head -1 00000000 e6 88 96 e5 95 8f 20 57 41 4b 55 4d 4f 4e 20 31 |...... WA +KUMON 1| $ e6 88 96 e6 88 96 : U+6216: CJK UNIFIED IDEOGRAPH-6216 $ e5 95 8f e5 95 8f : U+554f: CJK UNIFIED IDEOGRAPH-554F

The output file, though, is definitely weird looking. It, too, appears to be properly encoded UTF8, but the characters look like Mojibake, because while the characters were properly encoded, none of them appear to be from the CJK character set that appear to dominate the input file.

$ hexdump -C out.txt | head -1 00000000 c3 a4 c2 b8 2c 20 35 35 35 2e 37 30 38 35 35 36 |...., 555 +.708556| $ c3 a4 c3 a4 : U+e4: LATIN SMALL LETTER A WITH DIAERESIS $ c2 b8 c2 b8 : U+b8: CEDILLA

I tried removing the ":encoding(UTF-8)" bit from your open statement and (as expected) the garbage is different: In this case, the encoding is now broken, but it looks "closer" since the data that shows up looks like fragments of UTF-8 encodings with prefixes in the CJK character set:

$ hexdump -C out2.txt | head -2 00000000 e4 b8 2c 20 35 35 35 2e 37 30 38 35 35 36 37 30 |.., 555.7 +0855670| 00000010 39 33 36 0a e7 ac 31 34 ef bc 32 30 30 38 ef bc |936...14. +.2008..| $ e4 b8 2c e4 b8 2c : U+4e2c: CJK UNIFIED IDEOGRAPH-4E2C ERROR! (turn on dbg pr +t stmt) $ e7 ac 31 e7 ac 31 : U+7b31: CJK UNIFIED IDEOGRAPH-7B31 ERROR! (turn on dbg pr +t stmt) $ ef bc 32 ef bc 32 : U+ff32: FULLWIDTH LATIN CAPITAL LETTER R ERROR! (turn on +dbg prt stmt)

Above, the first two bytes of the UTF-8 sequences look like they're from the CJK character set, but the third byte is out of the expected range. It looks like one of the steps in the bowels of the TermExtract software is truncating the characters due to encoding/decoding errors (likely) and/or mixing byte lengths and character lengths (possibly?).

I did a *quick* review of the TermExtract and found no encode/decode calls anywhere in the code. There were several functions that would open a file and read the data, but without any encoding/decoding. The most disturbing thing I found is a pair of functions (cut_GB in ChainesPlainTextGB, and cut_utf8 in ChainesPlainTextUC) that operate on the data as raw bytes.

I'm not going to dig any deeper than this, but it may be possible to modify the module to properly handle encoding/decoding. If I were going to try to do so, I'd examine all the locations where cut_utf8(), cut_GB() and substr() are called (there aren't all that many). If the encoding was handled correctly, you wouldn't need to have the hackery in cut_GB() and cut_utf8() to determine character lengths, nor would you (presumably) need to use anything other than 1 for the size in substr(). There may be other functions you'd need to review, but I can't think of any at the moment.


When your only tool is a hammer, all problems look like your thumb.

Replies are listed 'Best First'.
Re^2: Encoding TermExtract
by IB2017 (Pilgrim) on Jul 21, 2020 at 19:43 UTC

    Thank you for taking the time to look at this. I reviewed the source too and I was surprised that no encoding operation is at all to be seen in the scripts. I will try to work on this bit.