Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

How to parse Chinese Characters using XML-Simple?

by Anonymous Monk
on Sep 18, 2006 at 10:04 UTC ( #573518=perlquestion: print w/ replies, xml ) Need Help??
Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

This question was transcribed from CPAN Forum by grantm

I am a beginner for Perl. I want to write and read some Chinese characters in an XML configuration file, but I failed. Who can tell me how to do it? Thanks!

Comment on How to parse Chinese Characters using XML-Simple?
Replies are listed 'Best First'.
Re: How to parse Chinese Characters using XML-Simple?
by grantm (Parson) on Sep 18, 2006 at 10:06 UTC

    You haven't given us much to work with, so hopefully the following examples are enough to get you started:

    Here's a sample XML file which I've called 'greetings.xml' (my 'Chinese' characters came via babelfish so they probably don't make sense):

    <?xml version='1.0' encoding='utf-8'?>
    <doc>
      <para lang="en">Hello World</para>
      <para lang="zh">你好世界</para>
    </doc>

    Here's a short CGI script which reads the file using XML::Simple and outputs some of the data in an HTML page:

    #!/usr/bin/perl use strict; use warnings; use XML::Simple qw(:strict); my $filename = 'greetings.xml'; # full path probably required my $xs = XML::Simple->new( ForceArray => [ 'para' ], KeyAttr => { para => 'lang' }, ); my $doc = $xs->xml_in($filename); binmode STDOUT, ':utf8'; print <<"EOF"; Content-type: text/html; charset=utf-8 <html> <head> <meta http-equiv="content-type" content="text/html; charset=utf-8" / +> <title>Test Page</title> </head> <body> <h1>$doc->{para}->{zh}->{content}</h1> </body> </html> EOF

    Of course your script does not need to be a CGI script and does not need to generate HTML. I chose to do it this way so that the output could be viewed in a web browser. If you just run the script from the command-line, you may not be able to read all the characters in your terminal window - it depends if your terminal is set up to handle UTF-8 data.

    The call to binmode(STDOUT, ':utf8') sets the output filehandle to UTF-8. This will suppress warnings about "Wide character in print". You might want to select an alternative output encoding.

    Here's an example of writing XML:

    #!/usr/bin/perl use strict; use warnings; use XML::Simple qw(:strict); use Data::Dumper; my $filename = 'greeting.xml'; my $xs = XML::Simple->new( ForceArray => [ 'para' ], KeyAttr => { para => 'lang' }, RootName => 'doc', ); my $data = { 'para' => { 'en' => { content => 'Hello World' }, 'zh' => { content => "\x{4f60}\x{597d}\x{4e16}\x{754c}" }, }, }; binmode STDOUT, ':utf8'; my $decl = "<?xml version='1.0' encoding='utf-8' standalone='yes'?>"; print $xs->xml_out($data, XMLDecl => $decl);

    In this example, I used the \x{abcd} format to specify Unicode characters. You could type the Chinese characters in directly if your editor supports UTF-8 and you include use utf8; at the top of your script.

    use utf8;
    
    my $data = {
        'para' => {
            'en' => { content => 'Hello World' },
            'zh' => { content => "你好世界" },
        },
    };
    

      Where was this question and reply 1 week ago when I had to figure out how to process huge XML files in 7 languages (including chinese and japanese)?!?

      But, yeah, it was all about the binmode ':utf8' line. Once that was set, everything else started behaving properly and I looked like I knew what I was doing.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://573518]
Approved by xorl
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (12)
As of 2015-07-30 12:49 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (271 votes), past polls