Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

New Section Suggestion: Tip of the Day

by princepawn (Parson)
on Aug 21, 2001 at 01:25 UTC ( #106382=monkdiscuss: print w/ replies, xml ) Need Help??

XML uses Unicode in UTF-8 format to define all of its character data.

-- "Data Munging with Perl" by Dave Cross

<joke flavor=sarcastic type=ribbing>I bet you didn't know that. I bet +you are a happier, wiser person now that you do. </joke>

It would be nice if I could post small things that I didn't know before that will probably help others too.

I dont think this is quite a meditation but it is certainly useful.

Comment on New Section Suggestion: Tip of the Day
Download Code
Re: New Section Suggestion: Tip of the Day
by Beatnik (Parson) on Aug 21, 2001 at 01:40 UTC
    perl.org has their daily-tips mailing list (which seems to be not so daily).

    Greetz
    Beatnik
    ... Quidquid perl dictum sit, altum viditur.
Re: New Section Suggestion: Tip of the Day
by mirod (Canon) on Aug 21, 2001 at 02:00 UTC

    Well, actually this is not quite the whole story (sorry davorg and princepawn ;--).

    • XML does _not_ specify the encoding of the characters in a document,
    • it strongly encourages the use of UTF-8 or UTF-16 (which are 2 ways of encoding Unicode characters), in fact XML parsers are only required to recognized those 2 encodings,
    • if the encoding is _not_ UTF-8 or UTF-16 the the XML declaration must specify the encoding of the document, which hopefully the parser will understand,
    • XML::Parser only understands UTF-8, UTF-16 and ISO-8859-1 (latin-1, the encoding commonly used in Western Europe),
    • US-ASCII (non accented ASCII characters, all characters (but not control characters) under 127 is a subset of UTF-8. Which means that if you only have to deal with US/English XML data you don't have to bother about it (for now),
    • XML::Encodings adds support to a whole lot of common encodings (I think the only one really missing is one of the chinese encodings),
    • XML::Parser converts all characters to UTF-8 before passing them to the calling application,
    • the cleanest way to go back from UTF-8 to whatever encoding your system likes is to use the Text::Iconv module, provided your system has the iconv library installed,
    • a dirty (but sometimes useful) hack is to use the original_string method to get the... original string (pre-UTF-8 conversion), but then you will have to parse start and end tags to extract tag names and attributes,
    • if you are converting your XML to HTML you might also want to have a look at HTML::Entities.

    One last info: UTF-8 support is now pretty good in Perl but you will have to wait for 5.8 to get UTF-8 hash keys (important for attribute names) and full regexp support.

Re: New Section Suggestion: Tip of the Day
by LD2 (Curate) on Aug 21, 2001 at 02:08 UTC
    It's not a bad idea, but it has been briefly brought up before here and here. This may be messy, but I sort of liked the idea of just creating a node under mediations of Perl Tips or whatnot. That way, this doesn't create any new work for vroom - it's simple and easy. If a new section is created down the line - that node can be dissected and I'm sure the tips can be added then.
Re: New Section Suggestion: Tip of the Day
by little (Curate) on Aug 21, 2001 at 11:40 UTC
    Hey, just kiddin:
    Reading the replies I suggest you better call it prejudice of the day though.grin

    Have a nice day
    All decision is left to your taste

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: monkdiscuss [id://106382]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others about the Monastery: (7)
As of 2014-08-21 17:28 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (139 votes), past polls