Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

New Section Suggestion: Tip of the Day

by princepawn (Parson)
on Aug 21, 2001 at 01:25 UTC ( #106382=monkdiscuss: print w/ replies, xml ) Need Help??

XML uses Unicode in UTF-8 format to define all of its character data.

-- "Data Munging with Perl" by Dave Cross

<joke flavor=sarcastic type=ribbing>I bet you didn't know that. I bet +you are a happier, wiser person now that you do. </joke>

It would be nice if I could post small things that I didn't know before that will probably help others too.

I dont think this is quite a meditation but it is certainly useful.

Comment on New Section Suggestion: Tip of the Day
Download Code
Re: New Section Suggestion: Tip of the Day
by Beatnik (Parson) on Aug 21, 2001 at 01:40 UTC
    perl.org has their daily-tips mailing list (which seems to be not so daily).

    Greetz
    Beatnik
    ... Quidquid perl dictum sit, altum viditur.
Re: New Section Suggestion: Tip of the Day
by mirod (Canon) on Aug 21, 2001 at 02:00 UTC

    Well, actually this is not quite the whole story (sorry davorg and princepawn ;--).

    • XML does _not_ specify the encoding of the characters in a document,
    • it strongly encourages the use of UTF-8 or UTF-16 (which are 2 ways of encoding Unicode characters), in fact XML parsers are only required to recognized those 2 encodings,
    • if the encoding is _not_ UTF-8 or UTF-16 the the XML declaration must specify the encoding of the document, which hopefully the parser will understand,
    • XML::Parser only understands UTF-8, UTF-16 and ISO-8859-1 (latin-1, the encoding commonly used in Western Europe),
    • US-ASCII (non accented ASCII characters, all characters (but not control characters) under 127 is a subset of UTF-8. Which means that if you only have to deal with US/English XML data you don't have to bother about it (for now),
    • XML::Encodings adds support to a whole lot of common encodings (I think the only one really missing is one of the chinese encodings),
    • XML::Parser converts all characters to UTF-8 before passing them to the calling application,
    • the cleanest way to go back from UTF-8 to whatever encoding your system likes is to use the Text::Iconv module, provided your system has the iconv library installed,
    • a dirty (but sometimes useful) hack is to use the original_string method to get the... original string (pre-UTF-8 conversion), but then you will have to parse start and end tags to extract tag names and attributes,
    • if you are converting your XML to HTML you might also want to have a look at HTML::Entities.

    One last info: UTF-8 support is now pretty good in Perl but you will have to wait for 5.8 to get UTF-8 hash keys (important for attribute names) and full regexp support.

Re: New Section Suggestion: Tip of the Day
by LD2 (Curate) on Aug 21, 2001 at 02:08 UTC
    It's not a bad idea, but it has been briefly brought up before here and here. This may be messy, but I sort of liked the idea of just creating a node under mediations of Perl Tips or whatnot. That way, this doesn't create any new work for vroom - it's simple and easy. If a new section is created down the line - that node can be dissected and I'm sure the tips can be added then.
Re: New Section Suggestion: Tip of the Day
by little (Curate) on Aug 21, 2001 at 11:40 UTC
    Hey, just kiddin:
    Reading the replies I suggest you better call it prejudice of the day though.grin

    Have a nice day
    All decision is left to your taste

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: monkdiscuss [id://106382]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (15)
As of 2015-07-02 13:12 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (38 votes), past polls