Converting text to XML; Millions of records.

MikeEndo has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

I need to generate a script that would convert a text file containing several million records into a XML (MARCXML) file. I have a rough idea how to do this through shell scripting but given the size of the file required to parse I thought it might be best to run using Perl?

The basic text record is as follows:

*** DOCUMENT BOUNDARY ***
.000. |aam  0c --> This can be ignored
.001. |aa1292700
.003. |aSIRSI
.299.   |aSymphonies, no.7/Vaughan Williams
.702.   |aThomson, Bryden,|b1928-1991|cConductor
.702.   |aBott, Catherine|b1952|cSoprano
.702.   |aLondon Symphony Chorus
.702.   |aLondon Symphony Orchestra
.315.   |aS
.021.   |aND 7382902
.301.   |a83'31"
.551.   |aSt Jude's Kilburn London
.260.   |c1989.06.21/22
.509.   |a1989 Original recording (P) date
.971.   |ade
.976.   |aND
.087.   |a1CD0027302
.087.   |a1CD0043184
.001. |aCKEY1292700 --> This can be ignored
*** DOCUMENT BOUNDARY ***
[download]

This is then converted to XML as follows:

<record>
<controlfield tag="001">aa1292700</controlfield>
<controlfield tag="003">aSIRSI</controlfield>
<datafield tag="299" ind1=" " ind2=" ">
<subfield code="a">Symphonies, no.7/Vaughan Williams</subfield>
</datafield>
<datafield tag="702" ind1="" ind2="">
<subfield code="a">Thomson, Bryden</subfield>
<subfield code="b">1928-1991</subfield>
<subfield code="c">Conductor</subfield>
</datafield>
<datafield tag="702" ind1="" ind2="">
<subfield code="a">Bott, Catherine</subfield>
<subfield code="b">1952</subfield>
<subfield code="c">Soprano</subfield>
</datafield>
<datafield tag="702" ind1="" ind2="">
<subfield code="a">London Symphony Chorus</subfield>
</datafield>
<datafield tag="702" ind1="" ind2="">
<subfield code="a">London Symphony Orchestra</subfield>
</datafield>
<datafield tag="315" ind1="" ind2="">
<subfield code="a">S</subfield>
</datafield>
<datafield tag="021" ind1="" ind2="">
<subfield code="a">ND 7382902</subfield>
</datafield>
<datafield tag="301" ind1="" ind2="">
<subfield code="a">83'31"</subfield>
</datafield>
<datafield tag="551" ind1="" ind2="">
<subfield code="a">St Jude's Kilburn London</subfield>
</datafield>
<datafield tag="260" ind1="" ind2="">
<subfield code="c">1989.06.21/22</subfield>
</datafield>
<datafield tag="509" ind1="" ind2="">
<subfield code="a">1989 Original recording (P) date</subfield>
</datafield>
<datafield tag="971" ind1="" ind2="">
<subfield code="a">de</subfield>
</datafield>
<datafield tag="976" ind1="" ind2="">
<subfield code="a">ND</subfield>
</datafield>
<datafield tag="087" ind1="" ind2="">
<subfield code="a">1CD0027302</subfield>
</datafield>
<datafield tag="087" ind1="" ind2="">
<subfield code="a">1CD0043184</subfield>
</datafield>
</record>
[download]

Note that numbers 001 to 009 are controlfields (only 001 and 003 in the records), whilst all other numbers are datafields.
Subfield codes (within datafields) are indicated the leading letter (a,b,c) and by a pipe:
.702. |aThomson, Bryden,|b1928-1991|cConductor
e.g:

<datafield tag="702" ind1="" ind2="">
<subfield code="a">Thomson, Bryden</subfield>
<subfield code="b">1928-1991</subfield>
<subfield code="c">Conductor</subfield>
</datafield>
[download]

The records run sequentially, i.e.

record
*** DOCUMENT BOUNDARY ***
record
*** DOCUMENT BOUNDARY ***
record
*** DOCUMENT BOUNDARY ***
record
*** DOCUMENT BOUNDARY ***
[download]

I need a routine that would convert the flat file into the XML file using the rules above. Each record may have a varying level of datafields and accompanying subfields per datafield.

Any initial ideas would be greatly appreciated.
Thanks
MikeE

Comment on Converting text to XML; Millions of records. Select or Download Code

Replies are listed 'Best First'.
Re: Converting text to XML; Millions of records. by dHarry (Abbot) on Jul 07, 2009 at 08:14 UTC
I have a rough idea how to do this through shell scripting but given the size of the file required to parse I thought it might be best to run using Perl? Why? Not that I would like to discourage using Perl of course;) If you have a rough idea how to do it with a shell script surely you can do it in Perl. Give it a try and ask for help if you get stuck. You might want to take a look at XML::XMLWriter.	[reply]
Re^2: Converting text to XML; Millions of records. by mzedeler (Pilgrim) on Jul 07, 2009 at 18:47 UTC
First off, I agree with roboticus. The proposed XML schema is obscure and doesn't add any value. If you have any say, please try changing it. Also, I'd suggest trying the marc2xml-tools, but given that they aren't suitable, read on... As far as I can see, `XML::Writer` isn't streaming its output, which means that you'll be buffering a data structure representing the entire XML document in memory. This doesn't really sound like a useful approach, given the expected output size, unless the output is generated in chunks (the records described in the question). For this purpose, it seems that `XML::Writer` wants to insert processing instructions, which makes chunk generation unfeasible without nasty hacks. For generating chunks, I'd suggest `XML::Generator` - a wonderfully simple and flexible module. If the chunk approach is undesirable, I'd look for a module that can serialize SAX events.	[reply] [d/l] [select]
Re: Converting text to XML; Millions of records. by roboticus (Chancellor) on Jul 07, 2009 at 16:01 UTC
MikeEndo: (Note: I realize that you may have no control over the project requirements. I also realize that the example you gave may be a toy example. But I feel I must comment anyway...) As I see it, XML is bloated and ugly. However, it's useful because it allows you to make your data descriptive and easier parse and use in new ways. So I suggest that you change your schema, if possible. I don't really see how `<datafield tag="702" ind1="" ind2=""> <subfield code="a">Thomson, Bryden</subfield> <subfield code="b">1928-1991</subfield> <subfield code="c">Conductor</subfield> </datafield>` [download] is any more descriptive than the original file. I feel you would be better served giving descriptive tags to your data. Perhaps something like: `<conductor> <Name> <Last>Thomson</Last> <First>Bryden</First> </Name> <Born>1928</Born> <Died>1991</Died> </conductor>` [download] In my job, I frequently have to reverse engineer file formats, and I would greatly prefer to reverse engineer the first file format than the XML version, unless the tags were meaningful. Without meaningful field names, it just makes detecting meaningful patterns in the data more difficult with the visual clutter. Just my $0.02. ...roboticus	[reply] [d/l] [select]
Re^2: Converting text to XML; Millions of records. by superfrink (Curate) on Jul 07, 2009 at 18:35 UTC
++ I once had to wrap data in XML and send it to another company for processing. The XML was almost entirely composed of tags that contained strings of bytes where position 1 meant something and positions 2 through 8 meant something, etc. I used format a lot on that project.	[reply]
Re: Converting text to XML; Millions of records. by Anonymous Monk on Jul 07, 2009 at 08:14 UTC
Any initial ideas would be greatly appreciated. Here is search clue perl MARCXML, MARCXML: marc2xml, xml2marc	[reply]
Re: Converting text to XML; Millions of records. by Anonymous Monk on Jul 08, 2009 at 08:58 UTC
Hi - your query came up in the context of a search on London Symphony Orchestra - I'm their technologist and we're working on a very similar data set; curious about the project - maybe we can compare notes? Jeremy dot Garside at lso.co.uk	[reply]


Pathologically Eclectic Rubbish Lister
	PerlMonks