Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Mega XSLT Batch job - best approach?

by ajt (Prior)
on Jan 22, 2002 at 16:23 UTC ( [id://140643]=perlquestion: print w/replies, xml ) Need Help??

ajt has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I have a mega XSLT batch processing job to perform. Somewehere in the bowels of the building sits an AIX box with Informix/SAP on it. Stored in there are ~1600 product descriptions. The SAP Business Connector web interface can spit the product descriptions out as a single nested XML file.

On a Linux or NT box, a simple Perl script uses the nesting, and some simple rules, parses ths XML file using XML::Parser in stream mode and generates a nested set of directorites, and various descriptive text and XML files over the newly created directory tree.

I now traverse the director tree and find all the text files, and create appropiate HTML pages from them. That bit is easy, the problem I face is running XSLT on ~1600 xml files to get HTML.

I did a simple bench mark on NT: Instant Saxon 6.5; Xalan 1.3 (c++) and XML::LibXSLT, and found that the Java start up on Saxon makes it massivly slower to run than Xalan or the LibSXLT solution - assuming that the XSLT job is small and simple. Xalan is twice as fast as LibXSLT when both are called via a system call, but when in-line, LibXSLT is much faster.

Given that I have ~1600 XML files to transofrm to HTML, I can do this one of two ways, build a list and pass them one at a time to Xalan (it was faster than either Saxon or LibXSLT), or use LibXSLT from within the scipt that finds them (which should be the fastest method given simple transformations).

I'm not worried about raw speed, it will run in batch mode, but I would like it to finish in hours rather than days.

I'm also would not like a pure Perl solution to fail from a leak of some sorts, ~1600 XSLT calls in one process is a lot, and I'd rather not have to do it several times.

In summary I plan to:

  • Parse one big XML file into a series of directories and smaller XML and control text files - PASS 1
  • Convert the text files into HTML pages (the indexes of each folder) - PASS 2
  • Run XSLT on the ~1600 XML files, to generate HTML - PASS 3

I'll do this serveral times per language, and assuming all works, probably once per week as the product database underneath changes.

I know this is very brute-force, are there better approaches to the problem? give than I'm not allowed to use DBI to get data directly from the underlying database.

Hints, tips and suggestions, warmly accepted,

As every, my humble thanks in advance...

Replies are listed 'Best First'.
Re (tilly) 1: Mega XSLT Batch job - best approach?
by tilly (Archbishop) on Jan 22, 2002 at 18:23 UTC
    Given what you would describe I would plan on, one way or another, winding up with LibXSLT inlined in Perl.

    But before coding, my next question is how many parallel processes you can profitably run. This is a question of whether you are bound on I/O or CPU. If CPU, then it is generally not worthwhile to run more processes than you have CPUs. If I/O then it depends on your hardware, and what fraction of time is CPU. The last time I tested an I/O bound job, 7 worked best for me. YMMV.

    Next try a run of 50-100 pages with LibXSLT, and see if you have a serious memory leak. If memory usage stays flat, then I wouldn't worry about it. If it is clearly leaking but doesn't wind up at a worrying level, note that. If it leaks unacceptably, figure out how many you can do in one "batch".

    Now you have cases:

    1. Only worthwhile to run one process, and there is no leak. Then just inline it.
    2. Only worthwhile to run one process, but there is a leak you care about. Then write a script that can take an input file listing 50-100 file names and do those files. In your main script launch batches in system calls. (The idea of the batch is to amortize startup costs.)
    3. Worthwhile to run many processes. Figure out what to run, then run them in parallel, either using Parallel::ForkManager (may well be Linux specific - the NT emulation of fork is not great) or using IPC::Open3 directly as I did at Run commands in parallel.
    Two gotchas to think about. How you will handle error handling is one. And the other is that in any sort of, "gather together, set batches off" logic it is very easy to say, "OK, when I have a batch, send it off" but forget that when you finish finding new jobs, you need to run the remainder as a batch.

    Good luck, and tell us how it went.

Re: Mega XSLT Batch job - best approach?
by Matts (Deacon) on Jan 22, 2002 at 21:29 UTC
    Well, I hope I'm qualified to answer ;-)

    1600 XSLT calls in one process is nothing. I know people using XML::LibXSLT in AxKit with MaxRequestsPerChild much higher than that, and no leaks. And it should take absolutely no time at all. Assuming your XML files are no larger than about 4KB, I would guess about 0.2 seconds per file to run (assuming you cache stylesheets), thus taking you in total about 5 minutes for the whole lot. I'm really not sure what you're concerned about.

    Though maybe I've got your XSLT complexity way out of whack. I've seen some XSLT transforms take over a second once or twice (though it's rare). So you may be looking at about 25 minutes.

    The key though for speed is to cache stylesheets.

    One alternative, which I'm hoping Barrie Slaymaker will come on here and describe, is to use a SAX ByRecord machine to process the XML files directly using SAX, no 3 step approach like you've described - just a single simple one-liner.

    Matt.

Re: Mega XSLT Batch job - best approach?
by stefan k (Curate) on Jan 22, 2002 at 18:18 UTC
    Java start up on Saxon makes it massivly slower to run

    Hmm, have you not tried to process the large nested XML file? I found the Java-Startup very slow, too, but afterwards saxon runs fine. You can always create a new outputfile from saxon by using something like

    <xsl:template match="page"> <saxon:assign name="ofile"> <xsl:value-of select="@outputfile" /> </saxon:assign> <saxon:output file="{$ofile}" method="html" indent="yes" doctype-public="-//W3C//DTD HTML 4.01 Transitional//EN" doctype-system="http://www.w3.org/TR/html401" encoding="iso-8859-1"> <!-- ... process contents here --> </xsl:template>
    and so on. This assumes you defined a saxon variable. And you should have a look at the correct syntax when you're using newer Saxon versions. This syntax is a snippet I use with Saxon 5.5.1 which is from a time when the official W3C XSLT did not support multiple outputfiles.

    I build my whole site from one large XML (about 200k) file and it processes in some 10 seconds on a Athlon 1.4 512MB linux box - including the java startup.

    Yes I know that this is not a perlish solution. Maybe you should look at http://www.javajunkies.org/?

    Regards... Stefan
    you begin bashing the string with a +42 regexp of confusion

use XML::SAX::Machines qw( ByRecord ) Was: Mega XSLT Batch job - best approach?
by Anonymous Monk on Jan 22, 2002 at 23:09 UTC

    As Matt mentioned, XML::SAX::ByRecord from the XML::SAX::Machines distribution might be useful here. Make sure you get at least XML-SAX-Machines-0.31, I fixed a bug in X::S::ByRecord to get this example working <:-/>.

    ByRecord is designed for handling record oriented XML files one record at a time. It splits the document apart in to individual documents, one per record, and runs them through a pipeline of SAX processors, merging the resulting subdocuments back in to the body of the output document. Everything that's not a record is passed through verbatim. This should make things a bit easier on the old memory banks, reduce time to first output, and make it possible to use simpler stylesheets.

    Here's a recipe that might get you started. It copys only the <state> records through to the output (the StateML file I fed it has several different record types). Feel free to email me and/or the perl-xml list if you have questions.

    use XML::SAX::Machines 0.31; use XML::SAX::Machines qw( Pipeline ByRecord Tap ); use XML::Filter::XSLT; my $f =XML::Filter::XSLT->new( Source => { ByteStream => \*DATA } ); Pipeline( ByRecord( $f ), \*STDOUT )->parse_uri( $ARGV[0] ); ## "in-place upgrades" until some new releases hit CPAN ;) use IO::Handle; ## XML::LibXML needs this to read from DATA ## and this makes XML::Filter::XSLT machine compliant sub XML::Filter::XSLT::LibXSLT::set_handler { my $self = shift; $self->{Handler} = shift; $self->{Parser}->set_handler( $self->{Handler} ) if $self->{Parser}; } __END__ <xslt:transform version="1.0" xmlns:xslt="http://www.w3.org/1999/XSL/Transform" > <xslt:template match="state"> <xslt:copy-of select="."/> </xslt:template> </xslt:transform>

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://140643]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others browsing the Monastery: (4)
As of 2024-03-28 23:20 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found