Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Convert HTML symbols to equivalent Unicode

by jai_dgl (Beadle)
on Apr 14, 2009 at 09:47 UTC ( #757360=perlquestion: print w/ replies, xml ) Need Help??
jai_dgl has asked for the wisdom of the Perl Monks concerning the following question:

Hi,
I need to parse some HTML files and have to write a XML
output file. In some cases I get a XML parser error.

Example:
The symbol ( REGISTERED SIGN ) need to be convert to its equivalent unicode U00AE
Is there any module to convert all the special character into
its Equivalent Unicode.
Note:
I don't want Decimal Equivalent or HTML entities as this XML file should be parsed in JSON.

Comment on Convert HTML symbols to equivalent Unicode
Re: Convert HTML symbols to equivalent Unicode
by Anonymous Monk on Apr 14, 2009 at 09:58 UTC
      If you want 00AE, you want UCS2, not UTF8
Re: Convert HTML symbols to equivalent Unicode
by grantm (Parson) on Apr 14, 2009 at 10:13 UTC
      Hey
      I used HTML::Entities it converts symbol to
      ® which is not parsed in XML.
      I need Exactly Unicode equivalent as U00AE.
      Is there a way to get ?

        I need Exactly Unicode equivalent as U00AE.

        That doesn't make any sense. Please speak in terms of HTML entities, Unicode characters, U+xxxx notation and perhaps UTF-8 encoding.

        • Are you asking how to get character U+00AE from "®"?

          decode_entities will get the character from ®.

        • Are you asking how to get the string "U+00AE" from character U+00AE?

          ord will get 0xAE from the character.

          sprintf can be used to format 0xAE as hex.

        For example,

        >perl -MHTML::Entities=decode_entities -e"printf qq{U+%04X\n}, ord(dec +ode_entities('®'))" U+00AE

        But then again, you also mentioned JSON. Whatever JSON module will handle serializing the character as "\u00AE" or similar from from the character U+00AE, so all you only need is decode_entities.

Re: Convert HTML symbols to equivalent Unicode
by dorward (Curate) on Apr 14, 2009 at 11:35 UTC

    Your question is very confusing.

    I need to parse some HTML files and have to write a XML output file. In some cases I get a XML parser error.

    • What are you using to parse the HTML?
    • What are you using to write the XML?
    • What "XML parser error" do you get?
    • Show us some code

    The symbol ( REGISTERED SIGN ) need to be convert to its equivalent unicode U00AE

    Is "" the result you get after parsing the HTML? i.e. the character and not an ampersand followed by an identifier and then a semi-colon? If so, then the parsing is working fine, and it is the output you need to worry about. (I mention this because PerlMonks takes HTML input, so you might have wanted to say ® rather then ®.)

    Any XML library should be outputting something appropriate when given as input. Either it will output an entity (which should be absolutely fine, since you should be parsing XML only with an XML parser which can handle such things) or it will output the character in whatever character encoding is being used (so you just need to make sure you are outputting UTF-8 (or whichever unicode encoding you want) — how you do that depends on which XML library you are using.

    I don't want Decimal Equivalent or HTML entities

    Named HTML entities could screw things up in XML, but the decimal entity should not.

    as this XML file should be parsed in JSON.

    This doesn't make sense. XML is a data format. JSON is a completely different data format.

    You can't parse anything in JSON.

    You could store an XML document as a string inside a JSON object, but that shouldn't prevent you from using entities — you would parse the JSON to extract the string of XML, then put that string in an XML parser to extract the data from it.

    What is the problem you are really trying to solve? You don't seem to have provided enough detail here.

      Thanks for your reply,
      I spider some web pages and pull some information from them.

      Then I use HTML::Entities::encode_entities() function to convert the special characters

      when is passed to it, it is converted to ®
      When HTML::Entities::encode_numeric() is used it gives ®

      My final idea is to write a XML file with its Unicode equivalent(U+00AE).

      Thanks

        I spider some web pages and pull some information from them.

        At this point you will either have the results of parsing HTML or you will have raw HTML.

        when is passed to it, it is converted to ®

        This suggests you are dealing with parsed HTML, with any entities already converted to real characters.

        In this case, there are two approaches you could take.

        1. Use a module designed for writing XML
        2. Use a generic template module

        If you use a module designed for writing XML then you can just pass the and the module will output either a raw or a numeric entity for it. Any XML parser that this XML gets read by should be able to cope with either, so it doesn't matter which you end up with and you don't need to worry about it.

        If you use a generic templating language, then you have to deal with using a raw or converting it with HTML::Entities::encode_numeric() or a similar module. Basically — you have to do all the things that a proper XML module would do for you.

        It sounds like you are using a generic templating language, but it is almost certainly better to use a real XML module.

        Afterthought: You might also be messing about with raw strings in the middle of your code. That way lies madness.

        Hello, I wants to numeric entities but below code will display à i need ex: © how can i get this output. my $encoded = HTML::Entities::encode_numeric($val, '^\n\x20-\x25\x27-\x7e'); Thanks, Umesh
Re: Convert HTML symbols to equivalent Unicode
by almut (Canon) on Apr 14, 2009 at 11:55 UTC
    In some cases I get a XML parser error.

    Which module(s) are you using? What's the exact error? ("The symbol ( REGISTERED SIGN ) need to be convert to its equivalent unicode U00AE" doesn't look like the error message — at least Google doesn't find any exact match (even with grammar fixed); and as we don't know what module's source code to grep, ...)

    As it is, the problem is underspecified. In order to give advice on what to do, we'd need to know what input encoding you have, and what output encoding you need. The U+00AE you mention is just the unicode codepoint, i.e. a mere number, which would always need to be encoded somehow for transmission and storage, e.g. as UTF-8, UTF-16LE, HTML entities, whatever...  Similarly, we can only guess that your input is maybe in ISO-Latin-1 encoding.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://757360]
Approved by marto
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (7)
As of 2014-12-22 08:49 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (113 votes), past polls