Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Re^2: replace string

by sandy1028 (Sexton)
on May 18, 2009 at 09:21 UTC ( #764607=note: print w/ replies, xml ) Need Help??


in reply to Re: replace string
in thread replace string

How to convert XHTML entity characters to XML form


Comment on Re^2: replace string
Re^3: replace string
by Anonymous Monk on May 18, 2009 at 09:24 UTC
    No need, XHTML is XML
      How to encode or decode ’ to ’
        Buy an encoder?

        If you provide sample data you may get more specific guidance. Your description is a bit too vague for anyone to be certain what you have as input and what you want as output. There are many possibilities.

        In addition to the suggestions already given, you may find perlunifaq and Encode helpful. I suspect you don't need Encode for what you are trying to do, but these will give you terminology and context to help you understand about encodings in general and about perl's internal representation of strings, which may be what you are trying to manipulate.

        Perl regular expressions support escape sequences that allow you to specify fairly arbitrary values in your string, including Unicode code points.

        \033 octal char (example: ESC) \x1B hex char (example: ESC) \x{263a} long hex char (example: Unicode SMILEY) \cK control char (example: VT) \N{name} named Unicode character

        It may be that all you need to do is specify the correct characters in your RE, using one of the escapes (probably long hex char or named Unicode character, depending on your preference). But it is possible you will have to decode your input first.

        If you use Devel::Peek's Dump to dump your input data and post that, then you might get more specific advice.

      The input is
      <b></b>Officially called <>“events,”< +/a> as "never events"
      the string should be converted to
      <b></b>Officially called <>“events,”< +/a>as "never events"
      How can I convert only such characters.
        Better idea would be to provide hex dump of data, so we know what the actual bytes are
        echo |hexdump 00000000: 45 43 48 4F 20 69 73 20 - 6F 6E 2E 0D 0A |ECHO is o +n. | 0000000d;
        or
        echo |od -tx1 0000000 45 43 48 4f 20 69 73 20 6f 6e 2e 0d 0a 0000015
        use strict; use warnings; my $input = "<b></b>Officially called <>“event +s,”</a> as "never events""; print "input: $input\n"; $input =~ s/“/“/g; # change the lines $input =~ s/”/”/gi; $input =~ s/’/’/gi; print "processed input: $input\n";

        produces

        input: <b></b>Officially called <>“events,&rdq +uo;</a> as "never events" processed input: <b></b>Officially called <>“e +vents,”</a> as "never events"

        This was done using your REs and appears to provide the output you are looking for.

      In case anybody visits this node again, I have a guess what sandy1028 is talking about. HTML (and XHTML) define a set of named character entities such as  . Generic XML parsers will not recognize these entities because they are application specific. So he or she needs, for whatever reason, to translate XHTML-specific named character entities to their corresponding numeric character entities for use in some non-HTML XML application.

      A few years ago I had to do this myself when reformatting some good old-fashioned HTML into something that could be used in an XSLT stylesheet. Man was that a pain.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://764607]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (8)
As of 2014-10-22 06:46 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    For retirement, I am banking on:










    Results (114 votes), past polls