Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?

keeping diacritical marks in a string

by Foxpond Hollow (Sexton)
on Oct 08, 2009 at 03:59 UTC ( #799858=perlquestion: print w/replies, xml ) Need Help??

Foxpond Hollow has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks,

I come seeking a means of preserving diacritical marks in a string. The situation is that I am using LWP to access a website and copy certain parts of it into various strings. It's all bibliographic information for various books. The titles sometimes contain diacritical marks, ranging from your run of the mull umlaut and accent grave to your more bizarre Russian characters that I don't know the names of.

I'm not looking for a way of stripping the diacritics out. In fact, that's the problem. When I copy the text into the string, it copies as basic ASCII. I need it preserved as-is, because I'm turning it right around and searching a database with it, and that database expects it to still have the diacritics, and finds no results if it doesn't.

I'm not too familiar with encoding schemes, so I'm not really sure what I should be looking for in terms of modules and approaches. Any help would be appreciated. Thanks.

UPDATE: Here's a link to the page I'm working with:

The page I am working with

Upon closer inspection, I realized it is not actually converting the characters to basic ASCII. It is just removing them entirely. So "Das europäische Volksmärchen" becomes "Das europische Volksmrchen", which is why the searches weren't working. It turns out the database doesn't actually care about the accent marks, but I do kinda still need the letters.

The weird thing is that according to the source for the page, it's UTF-8 and there is no encoding on the characters themselves (i.e., no &xxxx codes), but I thought UTF-8 could be converted back to basic ASCII as needed? Is this something I need to actually implement in the code to make happen?

The code that fetches the page with the title on it is the following:

$HTML = HTTP::Request->new( GET => $MARC_page ); $HTML = $user_agent->request($HTML); $HTML = $HTML->content;

So $MARC_page is the actual link (provided above) to the page I need, LWP fetches it and after a couple steps passes all of the content into the $HTML scalar. The code that fetches the title from $HTML is the following:

if ($HTML =~ m{ 245\d{0,2} # MARC code 245 followed by 0-2 indicators .*? # followed by anything, ungreedy \|a\s # followed by a pipe and the subfield (.*?) # followed by the title, # which can be anything, ungreedy (?:\||<) # followed by a pipe and the next subfield # or, if no subfield, an opening HTML tag brac +ket }xmgs) { $title = MARC::Field->new('245','','', 'a' => "$1"); } else { $title = MARC::Field->new('245','','', 'a' => "field does +not exist"); }

I'm sure that didn't format nearly as well as I'd've liked, but hopefully it's still readable.

I'm using Perl 5.8.9. Sorry for not providing more info earlier, like I said, I wasn't even sure what info would be needed. Hopefully this will be more helpful. Thanks for any assistance.

UPDATE 2: Okay so the link I gave above doesn't work because that record has actually been deleted as part of routine maintenance. It's irrelevant to this, so don't worry about that. Here's a link to the same info that should still work:

I am hoping this one works, Melvyl has the most god awful URLs to work with that I've ever seen.

Replies are listed 'Best First'.
Re: keeping diacritical marks in a string
by FalseVinylShrub (Chaplain) on Oct 08, 2009 at 04:58 UTC

    Hi there

    An interesting question, but I think you will have to provide some more information:

    • What do the special characters look like in the original data?>
    • Are they encoded with &#xxx; or in some encoding scheme?
    • Is there a header indicating the encoding?
    • What do the characters look like when viewed as hex?
    • What is the bit of code that is removing the characters?
    • What version of perl are you using?

    I won't be online till tomorrow, but hopefully someone else will be able to help you if you provide that information.

    Cheers, F.V.S.

      I don't know if it alerts you when a post you've commented on is updated, so in case it doesn't, I've updated the post with the info you asked for. Note that the second update has the correct URL and you should ignore the URL in the first update.


        I can't see the obvious source of the problem. I think you need to dump out the result of the request before any processing and be sure exactly where the special characters are being lost. i.e. is it coming correctly out of LWP, is it the regex, could it be the MARC:: module, etc.

        As graff said it shouldn't be losing these characters, but there are a number of places where things can go wrong.

        It's all a bit complicated and I can't think of a good guide to it at the moment. On the other hand, I've never heard of Perl completely stripping special characters because of an encoding problem - normally, you would get a multi-byte utf-8 character treated as 2 or 3 characters if the encoding is not set correctly. So I suspect an error in some code somewhere - could it be that something is validating input and stripping out characters it doesn't think are "safe"...?

        Sorry I can't be of more help. Try to narrow it down to where they disappear and it will be solved eventually.

Re: keeping diacritical marks in a string
by graff (Chancellor) on Oct 08, 2009 at 05:00 UTC
    Some relevant sample data (or the web site url, if that's appropriate) would really help here, along with an actual code snippet that shows us what you are doing with the data.

    It matters what sort of character encoding the web site is using (some sort of latin-1? utf-8? something else?), and it also matters what your script is doing when opening file handles for input or output, making database connections, and using LWP methods. Oh, and it also matters what character encoding is being used in the database. (Is it the same or different compared to what is being used at the web site?)

    Lacking all those details, I don't think there's much we can say about your problem -- except that it sounds a bit implausible: if the web site content includes accented characters, I wouldn't expect a quiet conversion to "basic ASCII", unless your script is explicitly applying this sort of behavior somehow. I might expect warnings or errors or some sort of character-entity-reference stuff, if the data is ending up different from its original form.

Re: keeping diacritical marks in a string
by Utilitarian (Vicar) on Oct 08, 2009 at 07:11 UTC
    Hi Foxpound,
    The characters are almost certainly encoded as html entities on the web site ie é will be represented in the page as &eacute;or &#201;.

    in order to decode these you can use

    This will change entities in $string to Unicode characters, which is the most likely encoding in the database. Check out the module's documentation at HTML::Entities

    That said, more info would be useful as this solution only deals with the default case.

Re: keeping diacritical marks in a string
by graff (Chancellor) on Oct 10, 2009 at 09:34 UTC
    Thanks for the update. I had no luck with the urls you posted, but I was able to go to the web page, put in a request that yielded accented characters in the output, and use the resulting url to push that request through LWP.

    Since I got different content from what you were getting, your regex didn't really apply for my data (and I guess your regex isn't related to the problem anyway, since it has nothing to do with accented letters). Anyway, here's some code that demonstrates how the non-ascii content works:

    #!/usr/bin/perl use strict; use LWP::UserAgent; use HTTP::Request; use Encode; # you need this module binmode STDOUT, ":utf8"; my $ua = LWP::UserAgent->new; my $url = "some_url_that_works_for_you"; my $req = HTTP::Request->new( GET => $url ); my $res = $ua->request( $req ); $txt = decode( 'utf-8', $res->content ); # decode "external" utf8 to +"internal" my @accented = ( $txt =~ /(\w*?[^[:ascii:]]\w*)/g ); if ( @accented ) { printf( "found %d words with non-ascii characters.\n", scalar @acc +ented ); my @alphanumerics = grep /^\w+$/, @accented; printf( "of those, %d words match ^\\w+\$:\n ", scalar @alphanumer +ics ); print join( "\n ", @alphanumerics ),"\n"; my @diacritic_marks = grep /\p{NonspacingMark}/, @accented; printf( "and %d used separate diacritic marks:\n ", scalar @diacri +tic_marks ); print join( "\n ", @diacritic_marks ), "\n"; }
    Having tried it myself, I learned that non-spacing diacritic marks (presented as separate characters, rather than being an intrinsic part of a letter -- e.g. the second character in "U+0061 U+02CA" for á, rather than U+00E1) all fall into the category of things that match "\w".

    You might want to check out this little command-line tool I posted a while back -- it can really help with getting a handle on what kinds of unicode data you are really dealing with: tlu -- TransLiterate Unicode; check my home page for a few other unicode tools.

    (UPDATE: Forgot to mention -- I also noticed that the source data from the web site tended to use both the single-character "accented_letter" and the two-character "letter accent_mark" for the same thing -- that is, their unicode usage is inconsistent, and somewhat non-standard.)

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://799858]
Approved by graff
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (3)
As of 2022-05-17 13:06 GMT
Find Nodes?
    Voting Booth?
    Do you prefer to work remotely?

    Results (65 votes). Check out past polls.