Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Re^2: HTML::Parser, file, print to Terminal

by victor_charlie (Novice)
on Jul 13, 2010 at 14:03 UTC ( [id://849272]=note: print w/replies, xml ) Need Help??


in reply to Re: HTML::Parser, file, print to Terminal
in thread HTML::Parser, file, print to Terminal

If I compose a short line of text in my gedit (Linux equivalent for Notepad).

Nüne istá baßt alongnöw.

and I use the simple code:

#!/usr/bin/perl -w # legaget.pl use strict; my $filename = "ligatext.txt"; open FILE, "<", $filename or die $1; while( my $line = <FILE> ) { print $line; } close(FILE);

My terminal does in fact show those ligatures. However, to grab a webpage.html off the web, that I might add is not W3.org compliant; they don't use the meta line encoding='iso-xxxx-x'> The browser shows the ligatures, the file saved of that webpage will show the ligatures in gedit word processor, but... the same code above will throw in the < ? > symbol with print to Terminal.

I might add I have fought this same thing with MSWord files, as MS puts the Unicode country code in the first byte of their Word.doc format as a hex.

Yes, I've read binmode <STDOUT> description, they don't show an example. Can you give me a short snippet, let me try it???? Maybe, Open a file, read a line at a time to print to Terminal?

Replies are listed 'Best First'.
Re^3: HTML::Parser, file, print to Terminal
by moritz (Cardinal) on Jul 13, 2010 at 14:19 UTC
      hexdump -C etest.txt 00000000 57 65 72 20 42 61 72 62 61 72 61 20 6c 69 76 65 |Wer Barba +ra live| 00000010 20 65 72 6c 65 62 65 6e 20 6d c3 b6 63 68 74 65 | erleben +m..chte| 00000020 2c 20 68 61 74 20 69 6e 20 4d c3 bc 6e 63 68 65 |, hat in +M..nche| 00000030 6e 20 69 6d 6d 65 72 20 77 69 65 64 65 72 20 64 |n immer w +ieder d| 00000040 69 65 20 47 65 6c 65 67 65 6e 68 65 69 74 2c 20 |ie Gelege +nheit, | 00000050 73 69 65 20 73 69 6e 67 65 6e 20 7a 75 20 68 c3 |sie singe +n zu h.| 00000060 b6 72 65 6e 2e 20 42 65 73 6f 6e 64 65 72 65 20 |.ren. Bes +ondere | 00000070 41 75 66 74 72 69 74 74 65 20 77 65 72 64 65 20 |Auftritte + werde | 00000080 69 63 68 20 61 62 20 73 6f 66 6f 72 74 20 69 6d |ich ab so +fort im| 00000090 20 41 6e 73 63 68 6c 75 c3 9f 20 61 6e 20 64 69 | Anschlu. +. an di| 000000a0 65 20 45 6e 67 65 6c 77 6f 72 74 65 20 61 6e 6b |e Engelwo +rte ank| 000000b0 c3 bc 6e 64 69 67 65 6e 2e 0a 0a |..ndigen. +..| 000000bb

      The above is a cut-n-paste from the webpage.html -- both the html and the txt show the same missing characters.

      I did read several things about UTF-8. I suppose the confusion lies in => if I create the file, I get my Latin-1. If I didn't create the file, there is only ASCII.

        The snippet you show is encoded in UTF-8.

        Next step: determine the encoding of the file in which umlauts display correctly on your terminal.

        Or even better: configure a clean UTF-8 enivronment.

        I suppose the confusion lies in => if I create the file, I get my Latin-1. If I didn't create the file, there is only ASCII.

        I'm confused indeed. If you don't create a file, it doesn't exist, neither with ASCII nor with UTF-8.

        Speaking of confusion, I think you try to achieve too much in one step. For example the title of your question metions HTML::Parser, which doesn't appear in the posting at all.

        So, small steps:

        • Make sure you know which encoding your terminal understands. There's no point in proceeding before you have done this step.
        • Find out what encodings your source files are. Seems to be UTF-8.
        • In your perl scripts, decode everything coming from the outside (except when a module does it for you), and encode everything. use utf8;, and write your program files in UTF-8.
        • If something doesn't work, find out where you violate any of the points of the previous steps.
        Perl 6 - links to (nearly) everything that is Perl 6.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://849272]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others about the Monastery: (4)
As of 2024-04-18 08:15 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found