http://www.perlmonks.org?node_id=497331

graq has asked for the wisdom of the Perl Monks concerning the following question:

I am having problems parsing an 'XML' file.

Microsoft is moving it's Office software over to using XML, so they are now producing "Microsoft Excel XML spreadsheets" - and this is what I have to work on.

I have tried XML::Parser but it bombs out horribly. I have double checked the file, getting a colleague to download it onto the local server, but it still appears full of 'padding' at the end of the file. The error is:  not well-formed (invalid token) at line 1, column 181703, byte 181705 at PATH/XML/Parser.pm line 187

I toyed with merlyn's idea of using HTML::Parser, but the XML structure is too flat - the state just always being "Workbook Worksheet Table Row Cell Data", so I unable to identify the data.

Does anyone know of any other tools, or have any ideas on how to deal with this? Getting rid of the 'padding' seems tricky to me as it's a load of gibberish - but so is Microsoft's 'XML'.

(Apologies for not being able to supply the data - the only file I have is company sensitive)

Sample code piece 1:

use Data::Dumper; use XML::Parser; my $p = new XML::Parser( Style => 'Object', ErrorContext => 2, ); $p->parsefile('Spreadsheet.xls');

Sample code piece 2 (lifted from Merlyn's site):

use IO::File; use Data::Dumper; use HTML::Parser; my $count = 0; my $fh = new IO::File("/home/graq/spreadsheet.xls"); my @state; my $p = HTML::Parser->new ( xml_mode => 1, start_h => [sub { my ($tagname, $attr) = @_; push @state, $tagname; ## We are beginning state "@state" print '::'."@state".'::'."\n"; }, "tagname, attr"], text_h => [sub { my ($text) = @_; ## We see content within state "@state" }, "dtext"], end_h => [sub { my ($tagname) = @_; ## We are ending state "@state" pop @state; }, "tagname"], ); $p->parse_file($fh); $p->eof;

Replies are listed 'Best First'.
Re: Problems with Microsft's new Office 'XML'
by terra incognita (Pilgrim) on Oct 04, 2005 at 17:34 UTC
    You are referencing the "xls" file which is still binary. Change your filename to "spreadsheet.xml" and save the file as xml.

    Using your first script with an "xls" file causes the error you see, however when I use it on the same file saved as "xml" it does not report an error.

    Update

    Here is some code to dump the xml file.

    use strict; use warnings; use XML::Simple qw(:strict); use Data::Dumper; my $config = XMLin('./Book1.xml',forcearray => 1,keyattr => ['']); print Dumper($config);
      You are referencing the "xls" file which is still binary. Change your filename to "spreadsheet.xml" and save the file as xml.

      OK. This is meant to be an automated process, removing the human interaction - how do I do this with Perl?

        As always, you do this by using Win32::OLE to automate Excel.

        The steps are the usual six steps:

        1. Start the Excel Macro Recorder
        2. Perform the desired action(s) manually, in your case, save the file as XML
        3. Stop the Excel Macro Recorder
        4. Inspect the generated VB code
        5. Check that the generated VB code does what you want
        6. Translate the generated VB code to Perl / Win32::OLE invocations

        There are many examples on this website which should give you a good start on how to do the translation. It consists mostly of s!\.!->! applied to every line of the code, and you will have to exchange the global properties like "ActiveWorkSheet" by the local values you get from $excel->OpenFile().

Re: Problems with Microsft's new Office 'XML'
by graq (Curate) on Oct 10, 2005 at 08:37 UTC
    Just a quick followup to 'close' the question.

    The main problem with the new Excel XML is that it is still a binary. I haven't managed to read the XML binary with Spreadsheet::ParseExcel, but have worked around this by getting the client to supply a 'normal' Excel binary and then reading that.

    If anyone does manage to combine xml parsing with Excel binary, please do msg me :)

      I have no experience with M$ tools for transforming from xls to XML, but, if your requirement is not too general, you can write VBA macro which transforms and prints xls into _your_ form of xml file. And parsing will be easy on the next step... I hate the wheel reinvention, but I suppose that M$ XML could be similar to <some>binary_code</some>...
        The file (fetching and) processing takes place on a Solaris box; I'm not aware of any way of running VB script without using a Windows OS.

        As a total aside, I am beginning to dislike this problem, as it looks like I'm going to have to use SOAP to fetch the file now (and in some awful flat WSDL format) which is causing completely different problems - but I am betting that other clients will wish to supply (non binary-xml) Excel, so the current parsing solutuion is of use. Look out for my next question in SoPW!