Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

XML File Slicing Golf(ish)

by gregor42 (Parson)
on Jun 20, 2001 at 00:54 UTC ( [id://89814]=perlmeditation: print w/replies, xml ) Need Help??

I wrote some code today to solve a co-worker's problem. Apparently a client was concatenating a series of XML files into one big file to send via email (or some other numbskull reason) & then needed them split up afterwards for processing.

Here's what I came up with. Being that TIMTOWTDI, which would be the Better/Stronger/Faster/SHORTEST possible way to accomplish the same thing:

use strict; my @input_file = <>; my @temp; my $filecount = 0; my $filename_prefix = "outfile_"; my $file_is_open = 0; my $out_name; foreach my $in (@input_file) { if ($in =~ m/^<\?xml version/) { if ($file_is_open) { close(OF) or die "Couldn't close $out_name!!\n"; $file_is_open = 0; } ++$filecount; $out_name = $filename_prefix.$filecount.".xml"; open(OF,"> $out_name") or die "Couldnt open $out_name for writin +g!!\n"; print(STDOUT "Writing to $out_name.\n"); $file_is_open = 1; } if ($file_is_open) { print(OF $in); } } close(OF);

My first thought was to use the command flag to put perl into looping mode, but I haven't gone there yet... Any other brilliant ideas? I find I learn more quickly when other people try to tackle the same problem as I & we can share insights.



Wait! This isn't a Parachute, this is a Backpack!

Replies are listed 'Best First'.
Re: XML File Slicing Golf(ish)
by mirod (Canon) on Jun 20, 2001 at 01:58 UTC

    Of course you realize that this can be broken by perfectly valid XML:

    <?xml version="1.0"?> <doc> <!-- <?xml version="1.0"?>--> <!-- an XML comment --> <![CDATA[<?xml version="1.0"?> ]]> <!-- a CDATA section --> </doc>

    I can't really see any "good" XML way to do this though. Why on Earth doesn't your customer use tar or zip or whatever instead of going a definitely non-XML (and risky) way?

      Note that this particular example will work with the program, because the regex is anchored at the start of the line. However, in general it is possible that a "fake" XML declaration will appear at the beginning of a line, so it is a good point. For example:
      <![CDATA[ <?xml version="1.0"?> ]]>
      Also, while playing with the little program I wrote, I used the cat command to concatenate a bunch of XML files I have, and discovered that some of them did not have an end-of-line at the end of the last line, so the result would be like this:
      </ESP-Component><?xml version="1.0"?>
      Where the start declaration of one file would end up on the same line as the closing declaration of the previous file. In this case, this simple mechanism for recognizing when a file starts would not work, obviously.

      Of course, I agree that some other packing method would be much better. But people do weird things... :-)

      --ZZamboni

(tye)Re: XML File Slicing Golf(ish)
by tye (Sage) on Jun 20, 2001 at 18:13 UTC

    The poor man's "tar":

    binmode(STDOUT); for my $file ( @ARGV ) { open(FILE,"<$file") or die "Can't read $file: $!\n"; binmode(FILE); print "$file: ",-s $file,$/; while(<FILE>) { print; } }
    followed by
    binmode(STDIN); my $file; while( defined( $file= <STDIN> ) ) { chomp $file; $file =~ s/: (\d+)$// or die "Invalid header: $file\n"; my $size= $1; open(FILE,">$file") or die "Can't create $file: $!\n"; warn "Extracting $size bytes to $file...\n"; my $len= $size < 4096 ? $size : 4096; my $read; while( $read= sysread(STDIN,$rec,$len) ) { $size -= $read; $len= $size if $size < $len; print FILE $rec; } }

            - tye (but my friends call me "Tye")
Re: XML File Slicing Golf(ish)
by ZZamboni (Curate) on Jun 20, 2001 at 03:39 UTC
    This is what I came up with. It does essentially the same as your program, but it's a little bit shorter. Notice that I dropped the close altogether, because open automatically closes the file handle if it was previously opened (see perlopentut -- I wouldn't consider this a good programming practice, but I'm assuming this is a one-shot program here), so you don't have to keep track of whether file is open. Also, you don't need to read the whole file in memory at once.
    use strict; my $n="0000"; while (<>) { if (/^<\?xml version/) { open OF, ">outfile_$n.xml" or die "open: $!\n"; $n++; } print OF $_; }

    --ZZamboni

Re: XML File Slicing Golf(ish)
by Abigail (Deacon) on Jun 20, 2001 at 01:55 UTC
    This is not a good question, or a good challenge. What's lacking is a specification. All you are saying is "do the same as the following program", without telling what the program does, what the program is supposed to do, how the input looks like, or what the output is supposed to do.

    I think you have a better chance of a useful answer if you specify the problem you want to solve first.

    -- Abigail

      Abigail-sama, I am righteously admonished. Please allow me to correct myself...

      (BTW, good to see you posting here again.)

      I suppose I was more concentrating on code, rather than phrasing my 'question'. I was looking more for a conversation about coding style & technique so that I could learn a better way to do something. I'm sure someone of much brain such as yourself could do this in a few lines. Perhaps I should rephrase it in a more generic, theoretical way, rather than a real world example?

      I'm attempting to simply split a file into smaller files based on a search criteria. I've explained the real world application. But my approach is simply one of finding the opening XML Processing Instruction which identifies the version of the XML file. One subtopic of discussion might be why I shouldn't do that, perhaps.. For now I'm splitting up a file into smaller pieces based on a regular expression. I'm asking is there is a simpler way to do this, or if my coding style is perhaps obtuse? Comments of all temperatures are welcome. (I'm wearing my asbestos weave robes today. (; )

      Here some sample data:

      <?xml version = "1.0"?> <!DOCTYPE AdvanceShippingNotice SYSTEM "AdvanceShippingNotice.dtd"> <AdvanceShippingNotice> <AdvanceShipping> <AdvanceShippingDetail> </AdvanceShippingDetail> </AdvanceShipping> </AdvanceShippingNotice> <?xml version = "1.0"?> <!DOCTYPE AdvanceShippingNotice SYSTEM "AdvanceShippingNotice.dtd"> <AdvanceShippingNotice> <AdvanceShipping> <AdvanceShippingDetail> </AdvanceShippingDetail> <AdvanceShippingDetail> </AdvanceShippingDetail> <AdvanceShippingDetail> </AdvanceShippingDetail> </AdvanceShipping> <GlobalDocumentFunctionCode>ASN</GlobalDocumentFunctionCode> </AdvanceShippingNotice> <?xml version = "1.0"?> <!DOCTYPE AdvanceShippingNotice SYSTEM "AdvanceShippingNotice.dtd"> <AdvanceShippingNotice> <AdvanceShipping> <AdvanceShippingDetail> </AdvanceShippingDetail> </AdvanceShipping> <fromRole/> <toRole/> <thisDocumentGenerationDateTime> <DateTimeStamp>20010619120949</DateTimeStamp> </thisDocumentGenerationDateTime> <thisDocumentIdentifier> <ProprietaryDocumentIdentifier/> </thisDocumentIdentifier> <GlobalDocumentFunctionCode>ASN</GlobalDocumentFunctionCode> </AdvanceShippingNotice>

      I have left the details of the data out for the sake of brevity, and since they don't matter.

      The results of this should be three new files written to the filesystem which are themselves valid XML files.

      Actually, let me correct that... Known bugs:

      No zero padding of numbers in filenames to support alphabetical ordering

      No actual XML validation



      Wait! This isn't a Parachute, this is a Backpack!

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlmeditation [id://89814]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others surveying the Monastery: (8)
As of 2024-04-24 11:36 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found