Hi Monks,
I need to generate a script that would convert a text file containing several million records into a XML (MARCXML) file. I have a rough idea how to do this through shell scripting but given the size of the file required to parse I thought it might be best to run using Perl?
The basic text record is as follows:
*** DOCUMENT BOUNDARY ***
.000. |aam 0c --> This can be ignored
.001. |aa1292700
.003. |aSIRSI
.299. |aSymphonies, no.7/Vaughan Williams
.702. |aThomson, Bryden,|b1928-1991|cConductor
.702. |aBott, Catherine|b1952|cSoprano
.702. |aLondon Symphony Chorus
.702. |aLondon Symphony Orchestra
.315. |aS
.021. |aND 7382902
.301. |a83'31"
.551. |aSt Jude's Kilburn London
.260. |c1989.06.21/22
.509. |a1989 Original recording (P) date
.971. |ade
.976. |aND
.087. |a1CD0027302
.087. |a1CD0043184
.001. |aCKEY1292700 --> This can be ignored
*** DOCUMENT BOUNDARY ***
This is then converted to XML as follows:
<record>
<controlfield tag="001">aa1292700</controlfield>
<controlfield tag="003">aSIRSI</controlfield>
<datafield tag="299" ind1=" " ind2=" ">
<subfield code="a">Symphonies, no.7/Vaughan Williams</subfield>
</datafield>
<datafield tag="702" ind1="" ind2="">
<subfield code="a">Thomson, Bryden</subfield>
<subfield code="b">1928-1991</subfield>
<subfield code="c">Conductor</subfield>
</datafield>
<datafield tag="702" ind1="" ind2="">
<subfield code="a">Bott, Catherine</subfield>
<subfield code="b">1952</subfield>
<subfield code="c">Soprano</subfield>
</datafield>
<datafield tag="702" ind1="" ind2="">
<subfield code="a">London Symphony Chorus</subfield>
</datafield>
<datafield tag="702" ind1="" ind2="">
<subfield code="a">London Symphony Orchestra</subfield>
</datafield>
<datafield tag="315" ind1="" ind2="">
<subfield code="a">S</subfield>
</datafield>
<datafield tag="021" ind1="" ind2="">
<subfield code="a">ND 7382902</subfield>
</datafield>
<datafield tag="301" ind1="" ind2="">
<subfield code="a">83'31"</subfield>
</datafield>
<datafield tag="551" ind1="" ind2="">
<subfield code="a">St Jude's Kilburn London</subfield>
</datafield>
<datafield tag="260" ind1="" ind2="">
<subfield code="c">1989.06.21/22</subfield>
</datafield>
<datafield tag="509" ind1="" ind2="">
<subfield code="a">1989 Original recording (P) date</subfield>
</datafield>
<datafield tag="971" ind1="" ind2="">
<subfield code="a">de</subfield>
</datafield>
<datafield tag="976" ind1="" ind2="">
<subfield code="a">ND</subfield>
</datafield>
<datafield tag="087" ind1="" ind2="">
<subfield code="a">1CD0027302</subfield>
</datafield>
<datafield tag="087" ind1="" ind2="">
<subfield code="a">1CD0043184</subfield>
</datafield>
</record>
Note that numbers 001 to 009 are controlfields (only 001 and 003 in the records), whilst all other numbers are datafields.
Subfield codes (within datafields) are indicated the leading letter (a,b,c) and by a pipe:
.702. |aThomson, Bryden,|b1928-1991|cConductor
e.g:
<datafield tag="702" ind1="" ind2="">
<subfield code="a">Thomson, Bryden</subfield>
<subfield code="b">1928-1991</subfield>
<subfield code="c">Conductor</subfield>
</datafield>
The records run sequentially, i.e.
record
*** DOCUMENT BOUNDARY ***
record
*** DOCUMENT BOUNDARY ***
record
*** DOCUMENT BOUNDARY ***
record
*** DOCUMENT BOUNDARY ***
I need a routine that would convert the flat file into the XML file using the rules above. Each record may have a varying level of datafields and accompanying subfields per datafield.
Any initial ideas would be greatly appreciated.
Thanks
MikeE