Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?

A problem with dash typography

by hsmyers (Canon)
on Sep 08, 2015 at 18:36 UTC ( [id://1141367] : perlquestion . print w/replies, xml ) Need Help??

hsmyers has asked for the wisdom of the Perl Monks concerning the following question:

I have a great deal of text that has both endashes and emdashes (– and — respectively) within html files as plain text. Since my editor gladly converts this (nary a complaint) I usually don't pay any attention. However I recently noticed a problem with HTML::Entities encode_entities function; i.e.
encode_entities("How the Chimney–sweeper's cry,")
How the Chimney–sweeper's cry,
rather than:
How the Chimney”sweeper's cry,
Now that I've spotted the problem, I can easily do the necessary regex massage and have it go away, but I was wondering if anyone knows the necessary Unicode/UTF-8 incantation magic to avoid the problem in the first place (if in fact that is what is)? Note that the emdash is translated to — instead of „ I have not checked the other typical HTML typographical elements as yet, these are so common that the problem surfaced fairly quickly.

Note:I leave the typos as written, but I really meant &#8212 and &#8211 *sigh*

Note: seems pertainent...


"Never try to teach a pig to wastes your time and it annoys the pig."

Replies are listed 'Best First'.
Re: A problem with dash typography
by 1nickt (Canon) on Sep 08, 2015 at 18:57 UTC

    Hm, the correct numerical entity code for the em-dash is &#8212 ... otherwise —

    It works for me with the correct character encoding in and out:

    [12:04][nick:~/monks]$ perl -Mstrict -Mutf8 -MHTML::Entities -E ' binmode STDOUT,":utf8"; > say encode_entities("Chimney—sweeper"); > say encode_entities("Chimney–sweeper"); > say decode_entities("Chimney—sweeper"); > say decode_entities("Chimney–sweeper"); > ' Chimney—sweeper Chimney–sweeper Chimney—sweeper Chimney–sweeper
    Hope this helps!

    Edit: Decoded characters may not display properly here ...

    The way forward always starts with a minimal test.
      Sorry about the typos...that aside I believe you have nailed the necessary magic with the ':utf8'...excepting in this case it is required before I read the file. Will see what happens, thanks!


      "Never try to teach a pig to wastes your time and it annoys the pig."

        If you put:

        use utf8;
        at the top of the script, this tells Perl that your source code contains UTF8-encoded unicode characters.

        If you want to read and write UTF8, do this at the top of the script:

        binmode STDIN, ':utf8'; binmode STDOUT, ':utf8';
        Hope this helps!

        The way forward always starts with a minimal test.
Re: A problem with dash typography
by kcott (Archbishop) on Sep 09, 2015 at 02:57 UTC

    G'day hsmyers,

    When your source code contains UTF-8, you need to tell Perl about this. You do this with the utf8 pragma. See that documentation for more complete details on that (somewhat oversimplified) advice.

    Here's my test:

    #!/usr/bin/env perl -l use strict; use warnings; #use utf8; use HTML::Entities qw{encode_entities}; my $dash = 'DASH: "-"'; my $emdash = 'EMDASH: "—"'; my $endash = 'ENDASH: "–"'; print encode_entities($_) for ($dash, $emdash, $endash);

    Output from this code:

    DASH: "-" EMDASH: "—" ENDASH: "–"

    Output after uncommenting "#use utf8;":

    DASH: "-" EMDASH: "—" ENDASH: "–"

    — Ken

      I suspicioned as much, will fiddle with this...thanks!


      "Never try to teach a pig to wastes your time and it annoys the pig."