Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Re: Extracting Chinese characters from Excel

by davies (Vicar)
on Mar 14, 2013 at 10:15 UTC ( #1023445=note: print w/ replies, xml ) Need Help??


in reply to Extracting Chinese characters from Excel

Never having used Unicode in any way, this looked like a good way for me to start teaching myself. In Excel, it’s certainly more complicated than I expected. The first thing I needed was a way of getting the characters into Excel. I was cursing the OP mildly for not doing this, but it turns out to be very difficult. AFAICT, the characters used are UTF16, while Win32:Ole doesn’t export anything beyond UTF8. Recording a VBA macro results in ActiveCell.FormulaR1C1 = "'??" which, needless to say, is utterly unhelpful. The reason I suspect UTF8 to be inadequate is that passing a single code from Perl results in two characters appearing in the cell. Extensive Gargling reveals nothing on getting more advanced Unicode from Perl to Excel.

I should perhaps point out here that my setup may be partly to blame. Never having had the slightest need to use any of these characters – and not even knowing their language – I have not installed lots of strange fonts in the hope that one of them might fit. But I don’t think that’s the case. I can copy & paste the characters without problems and I CAN enter them using the Insert | Symbol dialogue in Excel.

There is a way to get these characters in from VBA, using the ChrW function, but this isn’t available from the face of the spreadsheet. Wanting a SSCCE (http://sscce.org/), I went down the route shown in http://dailydoseofexcel.com/archives/2012/06/23/unicode-and-diacritic-characters/, getting the VBA in via Perl using the technique shown in RFC Tutorial: Adding and extracting VBA to and from Excel files.

At this stage, I had the data in, but getting it out again was the problem. The OP’s “hex_dump_cell” routine includes an error infelicity, in that “Value” should not be is not usually quoted, but the real issue is the one I mentioned in the first paragraph, namely that Win32:Ole doesn’t export anything beyond UTF8. This is the cause of the “?” characters that appeared in the recorded macro and in the OP’s output. The “?” character is 3F in hex, explaining the result of the unpack. So again, we need some VBA to get the characters out using the AscW function, since this isn’t available from the face of the spreadsheet.

It isn’t possible to combine the call to the extraction routine in the same cell as the characters to be decoded. Something within Excel converts these to “?” characters when I tried. Therefore, I have had to put formulae in the second row. Since I am not saving the spreadsheet, this should be no problem. The only thing needing care is that the formulae don’t overwrite a cell that will be decoded later. If this happens, things can get recursive very quickly.

use strict; use warnings; use Win32::OLE; my $xl = Win32::OLE->new('Excel.Application'); $xl->{Visible} = 1; my $wb = $xl->Workbooks->Add; if ($wb->Sheets->{Count} > 1) { for (2..$wb->Sheets->{Count}) { $wb->Sheets(2)->Delete; } } $xl->VBE->ActiveVBProject->VBComponents->Add(1); my $cm = $xl->VBE->ActiveVBProject->VBComponents(3)->CodeModule; my $line = int($cm->CountOfLines); $cm->InsertLines(++$line, "Function CHARW(code As Variant) As String") +; $cm->InsertLines(++$line, "'Use a Leading \"U\" or \"u\" to indicate U +nicode values"); $cm->InsertLines(++$line, " code = VBA.Replace(code, \"+\", \"\", 1, + 1, vbTextCompare)"); $cm->InsertLines(++$line, " If UCase(Left\$(code, 1)) = \"U\" Then c +ode = VBA.Replace(code, \"U\", \"&H\", 1, 1, vbTextCompare)"); $cm->InsertLines(++$line, " CHARW = ChrW(code)"); $cm->InsertLines(++$line, "End Function"); $cm->InsertLines(++$line, ""); $cm->InsertLines(++$line, "Function AscDec(char As Variant) As String" +); $cm->InsertLines(++$line, " Dim i As Long"); $cm->InsertLines(++$line, " For i = 1 To Len(char)"); $cm->InsertLines(++$line, " AscDec = AscDec & AscW(Mid(char, i, + 1))"); $cm->InsertLines(++$line, " If i < Len(char) Then AscDec = AscD +ec & \":\""); $cm->InsertLines(++$line, " Next i"); $cm->InsertLines(++$line, "End Function"); my $sht = $wb->Sheets(1); $sht->Cells(1, 1)->{Formula} = '=charw(1576)&charw(1744)&charw(1610)&c +harw(1580)&charw(1609)&charw(1709)'; $sht->Cells(1, 2)->{Formula} = '=charw(21271)&charw(20140)'; $sht->Cells(1, 3)->{Value} = 'Beijing'; $sht->Range("A1:C1")->Copy; $sht->Range("A1:C1")->PasteSpecial(-4163); #xlPasteValues $sht->Cells(2, 1)->{Formula} = "=ascdec(A1)"; $sht->Range("A2")->Copy ($sht->Range("B2:C2")); print $sht->Cells(2, 1)->{Value} . "\n"; print $sht->Cells(2, 2)->{Value} . "\n"; print $sht->Cells(2, 3)->{Value} . "\n"; $xl->{DisplayAlerts} = 0; $xl->Quit;

I think this mess is best summarised by Chesterton: http://www.gkc.org.uk/gkc/books/rolling.html.

Regards,

John Davies


Comment on Re: Extracting Chinese characters from Excel
Select or Download Code

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1023445]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (13)
As of 2014-09-16 15:38 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (33 votes), past polls