Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Writing International Phonetic Alphabet symbols to Excel?

by cypress (Beadle)
on Sep 24, 2009 at 18:54 UTC ( #797279=perlquestion: print w/ replies, xml ) Need Help??
cypress has asked for the wisdom of the Perl Monks concerning the following question:

I’m trying to extract International Phonetic Alphabet (IPA) symbols from html source code. Internet Explorer’s View > Source console displays the symbols as symbols (ie, not in any utf form); moreover, from this console I can copy these symbols and paste them into Notepad as unformatted Arial-font text without utf code taking their place.

However, when I ask Perl to extract such symbols from html source and write them to Excel via Spreadsheet::WriteExcel, I get junk. (Spreadsheet::WriteExcel’s default ‘write’ font is Arial. Having opened the resulting Excel file, it makes no difference what font – including IPA-specific fonts – I choose to display a given cell's contents: it’s still junk.)

Can you explain to me what’s going on? Is there a fix? I’m not familiar with utf-8 programming in Perl, though I suspect I’ll need to go there.

The website I’m trying to use is www.dictionary.com. Search on a word like ‘hello’ and click Show IPA. The returned stuff between the slashes – that’s the stuff I wish to extract and have Perl write to a spreadsheet.

Many thanks!

Comment on Writing International Phonetic Alphabet symbols to Excel?
Re: Writing International Phonetic Alphabet symbols to Excel?
by zwon (Monsignor) on Sep 24, 2009 at 19:04 UTC

    Try to open html file as follows:

    open my $fh, "<:utf8", "html_file_name" or die $!;
      I'd be (pleasantly) surprised if the problem was in reading the html rather than in writing to Excel. Let me give your suggestion a try....
Re: Writing International Phonetic Alphabet symbols to Excel?
by cypress (Beadle) on Sep 24, 2009 at 19:12 UTC

    Dropping 'binmode STDOUT;' into my code in various places doesn't seem to help either. (See the Spreadsheet::WriteExcel documentation note on UTF-8, which I may not be reading correctly anyway.)

    -Cypress

Re: Writing International Phonetic Alphabet symbols to Excel?
by Anonymous Monk on Sep 24, 2009 at 19:27 UTC
Re: Writing International Phonetic Alphabet symbols to Excel?
by rdfield (Priest) on Sep 24, 2009 at 19:34 UTC
    You're probably going to have to encode the symbols in Microsoft's own character set. I've done a lot of coding the other way round - extracting symbols from Excel and trying to put them in a Postgres database with UTF8 encoding. You need to find the matching symbol in, say, "Character Map", find the ord() of that, and change the value of the input character to that ord value.

    rdfield

      There is no "Microsoft's own character set" to worry about. Excel uses totally standard Unicode internally.

      I've done quite a lot recently writing Unicode text into Excel files with Perl on Solaris, and I've never had to worry about encodings at all.

        Looks like it was just the CSV files generated from Excel - working with the spreadsheet directly sounds like a better option.

        rdfield

        After doing some testing on this, it looks like MS uses latin1 (or something very close) internally.

        I used a slightly modified version of the example script from Spreadsheet::ParseExcel, replacing the "print" with a database insert.

        When processing some data from a spreadsheet into a PostgreSQL database, cells with symbols such as 0xae (ascii 92, the "registered" symbol, ®), I constantly came up against the database error:
        DBD::Pg::db do failed: ERROR: invalid byte sequence for encoding "UTF +8": 0xae
        After setting the client encoding to latin1 (keeping the database at UTF8):
        $dbh->do("set client_encoding to latin1");

        the data went in OK.

        If there is a different/better way to process this, I'd be interested to know.

        Update: there is a better way...
        #!/usr/bin/perl use warnings; use strict; use Spreadsheet::ParseExcel; my $parser = Spreadsheet::ParseExcel->new(); my $workbook = $parser->Parse('Book1.xls'); binmode(STDOUT, ":utf8"); foreach my $worksheet ( $workbook->worksheets() ) { my ( $row_min, $row_max ) = $worksheet->row_range(); my ( $col_min, $col_max ) = $worksheet->col_range(); foreach my $row ( 1 .. $row_max ) { foreach my $col ( $col_min .. $col_max ) { my $cell = $worksheet->get_cell( $row, $col ); next unless $cell; next unless defined($col_mapping{$col}); my $value = $cell->value(); utf8::upgrade($value); ... store_in_database($value); ... } } }

        rdfield

Re: Writing International Phonetic Alphabet symbols to Excel?
by moritz (Cardinal) on Sep 24, 2009 at 19:46 UTC

    Please show include this in your program:

    use Devel::Peek; Dump $string_you_want_to_write_to_excel_file;

    And show us the output. That might help us to help you.

    Perl 6 - links to (nearly) everything that is Perl 6.
Re: Writing International Phonetic Alphabet symbols to Excel?
by Porculus (Hermit) on Sep 24, 2009 at 20:56 UTC

    Break the problem down! I suggest you forget about reading from HTML for now, and try to construct a minimal script that does nothing apart from write some IPA into an Excel file.

    You can take as your starting point something like "h\x{025b}\x{02c8}lo\x{028a}" (your phonetic "hello" as a Perl string). That will completely rule out any possibility that something is going wrong at the HTML stage.

    If you can get that working, you can easily plug it into your main program. And if you can't get it working, then it'll be small enough that you can simply post the entire script here, and we can try and help you figure out where it's going wrong.

Re: Writing International Phonetic Alphabet symbols to Excel?
by jmcnamara (Monsignor) on Sep 24, 2009 at 23:33 UTC

    Spreadsheet::WriteExcel can handle IPA characters if they are UTF-8 encoded. See the output of the following program which encodes the first few characters from the this IPA chart. (You will need perl 5.8 or later):
    #!/usr/bin/perl use strict; use warnings; use Spreadsheet::WriteExcel; my $workbook = Spreadsheet::WriteExcel->new('phonetic.xls'); my $worksheet = $workbook->add_worksheet(); my $row = 0; for my $code (0x0250 .. 0x0259) { $worksheet->write($row++, 0, chr $code ); }
    So the problem comes down to extracting the strings that you want and ensuring that they are encoded as UTF-8 when they get into your program.

    Hopefully, someone else in the thread can help you with that part. :-)

    --
    John.

      I have two problems actually.

      The first I described in my first post: The output isn't the IPA-encoded stuff I'm expecting. The suggestion that I find UTF-8 codes for each IPA symbol is good -- I can do that easily, and 'write' each symbol's UTF-8 hexidecimal code to Excel as needed. But as my html source doesn't give the UTF-8 hexidecimal, I'd need some method of encoding the html source's IPA symbols as UTF-8 hexidecimal before doing my Excel write.

      For this task, I might try to make sense of the post at http://www.dev411.com/blog/2006/09/29/perl-getting-a-unicode-characters-hex-codepoint

      The second problem is that any given web page's html source contains at least two pairings of what I'm calling 'surrounding strings' in the code below. Because regex is greedy, I now assume it's matching the 'htmlchunk' sitting between the final pairing found. I'd like to get the chunk between the first pairing.

      Have a look at the html source for http://dictionary.reference.com/browse/hello and you'll see what I mean. Scan for instances of "prondelim" and you'll see four (ie, two pairings).

      Luckily, dictionary.com's pronunciation delimiters for showing IPA (/ and /) differs from those used for showing spelled pronunciation ([ and ]). So I know my regex can distinguish between these similar-looking surrounding strings and that I'm matching the desired IPA-encoded source.

      However, 'hello' has just one dictionary entry.

      The word 'fly' for instance has four entries on the same webpage. Each entry offers it's own IPA-between-pronunciation-delimiters html. In this simple case, each pronunciation is exactly the same, so I don't really care whether I'm matching on the first, second, third, or fourth bit of IPA.

      But other words may have different pronunciations for different meanings. Again, I'd like to match on and extract merely the first -- not last -- pronunciation. So what do I do? Would a 'lookahead' regex be the proper way to proceed here? Reversing the entire html source and then reversing my regular expression (so as to greedily match the original first expression instance) seems an awful hack.

      Here's my failing code. Don't laugh if it looks newbie-ish -- it is. I really appreciate all of your help. You guys are much better at this stuff than I am. --Cypress

      use LWP::UserAgent; use HTTP::Request; use HTTP::Response; use Spreadsheet::WriteExcel; use FileHandle; use strict; # create useragent, open an excel workbook and sheet my $ua = LWP::UserAgent -> new; my $workbook = Spreadsheet::WriteExcel -> new ( "IPA.xls" ); my $sheet = $workbook -> add_worksheet ( ); $sheet -> set_column ( 0, 0, 100 ); # get html source and parse my $address = "http://dictionary.reference.com/browse/hello"; my $request = HTTP::Request -> new ( GET => $address ); my $response = $ua -> request ( $request ); my $htmlsource; my $writestring; if ( $response -> is_success ) { $htmlsource = $response -> content; $writestring = parse( $htmlsource ); } # write to spreadsheet, close excel $sheet -> write ( 0, 0, $writestring ); $workbook -> close ( ); sub parse { my $source = shift; my $htmlchunk; my $ipa; # select from html source the chunk of html which contains IPA- # encoded symbols # this chunk will still contain html tags that need to be removed # i'll find it between the *first* (but perhaps not last) pairing # of these two surrounding strings: # "prondelim">/</span><span class="pron" # /span><span class="prondelim" if ( $source =~ /"prondelim">\/<\/span><span class="pron"(.*?)\/sp +an><span class="prondelim"/ ) { $htmlchunk = $1; } # get rid of leading html tags, save the IPA or English bits # between '>' and '<', # and continue doing the same over the remaining chunk while ( $htmlchunk =~ /(.*?)>(.*?)<(.*)/ ) { $ipa = $ipa . $2; $htmlchunk = $3; } return $ipa; }

      Holy cow! Just noticed to whom I'm replying! How do I prostrate myself before you over the internet?

        Hmm. My paragraph above beginning 'The second problem...' could have been worded a bit better.

        What I meant is that a simple word like 'hello' has two pairings with the "prondelim" etc string. The first pairing (the one I'm actually matching) sets off the Show IPA html code. The second pairing sets off the Show Spelled Pronunciation html code.

        A word with multiple entries at dictionary.com will have multiple pairings, both for IPA and for spelled pronunciation. I don't want to match the greedy-final pairing -- I want to extract whatever IPA comes up first.

        As for the meaning of 'English bits' in my code comments, try the word 'of' at dictionary.com. I'll be matching on stuff like 'unstressed' and 'especially before consonants,' too.

        Having tried that version of your script myself, the problem that shows up in the excel file appears to be the result of "double encoding" into utf8. In other words, data that is already utf8 encoded gets treated as if it were plain-old single-byte Latin1, and gets encoded into utf8 again.

        There might be better ways to fix this besides the following, but the following will work (at least, it did for me):

        use Encode; # add this near the top, with the other "use" statements ... # write to spreadsheet, close excel $sheet -> write ( 0, 0, decode( 'utf8', $writestring )); # add the "de +code()" call

        You are very close, and well done for showing a detailed example.

        The main (Unicode) problem is that perl doesn't know that the strings that you are extracting from the Html source are UTF-8. You can either explicitly convert them, as graff shows, or better still use decoded_content() instead of content() in your LWP code:

        ... if ( $response -> is_success ) { $htmlsource = $response -> decoded_content(); $writestring = parse( $htmlsource ); } ...
        This will get you most of the way there if you view the output file. However, you will notice that the backquote-like (inflection?) character doesn't display in the default Arial font (the other Unicode characters do).

        The solution in this case is to switch to a full Unicode font in Excel such as 'Arial Unicode MS'

        ... my $arial_unicode = $workbook -> add_format(font => 'Arial Unicode + MS'); $sheet -> write ( 0, 0, $writestring, $arial_unicode ); ...

        --
        John.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://797279]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (9)
As of 2014-08-01 23:11 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Who would be the most fun to work for?















    Results (51 votes), past polls