Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Human-readable serialization formats other than YAML?

by jasonk (Parson)
on Apr 22, 2008 at 22:13 UTC ( [id://682292]=perlquestion: print w/replies, xml ) Need Help??

jasonk has asked for the wisdom of the Perl Monks concerning the following question:

History: I'm working on an application that involves a huge amount of text-processing, scraping useful information from thousands and thousands of files in about 3 dozen different formats. One of the problems I'm trying to overcome is that although these text files were all generated by different versions of the same tool, their contents vary pretty drastically depending on what version of the tool and what version of the system the tool was collecting data for.

So, with that in mind, I wanted to be damn sure that changes made to the extraction tools to correct problems with one version of the data didn't break other versions of the data, so I wrote a script to take the known-working version, and store both the original contents and the known-working output so I could use it later to make sure the same input would produce the same output. Sounds simple enough, right?

Apparently it's only simple if you don't care if that file is human readable or not. I tried using YAML (even though I seem to have a history of discovering new bugs every time I try to use YAML) because the output is so nicely readable. Unfortunately I fairly quickly found a simple case that YAML couldn't roundtrip. I've actually tried several different YAML implementations, and all of them had problems.

This is a sample of the text I was working with:

F S UID PID PPID C PRI NI CMD 040 S root 14 0 0 75 0 [kupdated]

One of the things I liked about using YAML to store this stuff, is that it has a 'block-folding' operator, where this text can (theoretically be stored like this:

content: | F S UID PID PPID C PRI NI CMD 040 S root 14 0 0 75 0 [kupdated]

Nice and readable, and easy to work with, right? Not really, YAML 0.66 serialized it to this:

--- |2 F S UID PID PPID C PRI NI CMD 040 S root 14 0 0 75 0 [kupdated] 040 S root 13 0 0 85 0 [bdflush]

And then puked when attempting to deserialize it...

YAML Error: Inconsistent indentation level Code: YAML_PARSE_ERR_INCONSISTENT_INDENTATION Line: 3 Document: 1 at /usr/local/lib/perl5/site_perl/5.8.8/YAML.pm line 33

Even a very simple string that starts with spaces can't be roundtripeed...

use strict; use warnings; use YAML qw( LoadFile Dump Load ); use Data::Dump qw( dump ); my $in = " FOO\nBAR BAZ\n"; my $out = Load( Dump( $in ) ); print dump( $in, $out ); ------ output ----- ( " FOO\nBAR BAZ\n", " FOO\nBAR BAZ\n", )

So, since there are several different YAML modules in CPAN, I figured one of them must meet my needs. No such luck..

YAML::Syck successfully roundtripped all the examples above, but it did it by not even attempting to do the folding, instead it just wrapped the whole thing in double quotes, and converted all the newlines to "\n", turning the whole document into one long, unreadable string.

YAML::Tiny couldn't even dump the original data, as the output that I'm storing is an object, and it doesn't seem to be able to serialize a blessed object.

So, for the time being I'm doing it the ugly way, and storing the source data in a separate file and using YAML::Syck for the comparison data. This is not an an ideal solution, as it requires that those two files be carefully kept together, so I'm wondering if anyone has suggestions for human-readable serialization formats that actually work on real-world data?


www.jasonkohles.com
We're not surrounded, we're in a target-rich environment!

Replies are listed 'Best First'.
Re: Human-readable serialization formats other than YAML?
by BrowserUk (Patriarch) on Apr 23, 2008 at 00:01 UTC

    FWIW: I find the output of Data::Dump far more readable that yaml, and it doesn't make the Pythonesque mistake of relying upon invisible characters for structural integrity.

    I did find the standard maximum width of 60 rather limiting, so I patched my version to make it a global so that I can adjust it.

    Of course, if you're squeamish about using eval it may not be what you want, but I see no greater risk in the contents of local data files, than local source files. And I see no reason to trade Perl's well-proven, reasonably efficient parser, for less well-proven, invariably highly inefficient, implementations of a design-by-committee spec. in order to avoid using eval.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      Data::Dump suffers from the same problem as YAML::Syck, that the huge block of text data gets turned into a long, unruly string. And while I do use Data::Dump a lot for debugging, when dealing with deeply-nested objects I find it gets way too wide to read way too quickly, while YAML is usually much easier to follow, due to the amount of whitespace that ends up at the beginning of the line... i.e.

      ### An object like this... my $obj = bless( { this_object_has_some_long_keys => { and_some_of_them_are_nested => { fairly_deeply => 'Foo!', } }, }, 'Some::Random::ClassName' ); ### Will end up like this with Data::Dumper... bless({ "this_object_has_some_long_keys" => { "and_some_of_them_are_nested" +=> { fairly_deeply => "Foo!" } }, }, "Some::Random::ClassName") ### YAML output is more readable, IMHO --- !!perl/hash:Some::Random::ClassName this_object_has_some_long_keys: and_some_of_them_are_nested: fairly_deeply: Foo!

      Ultimately, though, your comment did lead me to a solution which works. This is what I'm using now...

      From the serializer...

      $fh->print( "package MyApp::TestData::$digest;\n", "use parent 'Class::Data::Accessor';\n\n\n", "use YAML qw( Load );\n", 'sub data($$) { __PACKAGE__->mk_classaccessor( @_ ) }'."\n\n", "data filename => ".dump( $self->filename ).";\n\n", "data store => Load( <<'$tag' );\n", YAML::Dump( $store ), "$tag\n\n", "data content => <<'$tag';\n", $self->content, "$tag\n", "return __PACKAGE__;\n", );

      Which produces an output file that looks like this:

      package MyApp::TestData::PUHmx80zfPHoDloIOrexGA; use parent 'Class::Data::Accessor'; use YAML qw( Load ); sub data($$) { __PACKAGE__->mk_classaccessor( @_ ) } data filename => "test-data/some-filename"; data store => Load( <<'___END_OF_TEST_DATA_SECTION___' ); --- &1 !!perl/hash:MyApp::Store children: - &2 !!perl/hash:MyApp::Store ___END_OF_TEST_DATA_SECTION___ data content => <<'___END_OF_TEST_DATA_SECTION___'; Original file contents here... ___END_OF_TEST_DATA_SECTION___ return __PACKAGE__;

      Then, my regression test scripts can just do this:

      for my $file ( <*.pl> ) { ok( my $class = do $file, "Loaded $file" ); my $p = MyApp::Processor->new( filename => $class->filename, content => $class->content, ); eq_or_diff( $p->store, $class->store ); }

      www.jasonkohles.com
      We're not surrounded, we're in a target-rich environment!
Re: Human-readable serialization formats other than YAML?
by tachyon-II (Chaplain) on Apr 22, 2008 at 23:28 UTC

    Why do you need a human readable serialisation format when your data looks perfectly human readable currently?

    The problem appears simple. You have multiple data formats. You want to compare data. To do this you need a series of filters to standardise the data into a standard format. You need to validate your conversions.

    The desire to use a human readable format suggest you "human" plan to read the data and decide if it is valid. Why do that after every change? Why not leverage modules, /t test.t files and make test to do this?

      The point of making it human-readable wasn't that it was going to be reviewed by a human at test time (in fact I am using test modules just as you suggest) it was simply to make it easier for me to review the original data in the event that the test fails, so I can more easily determine what went wrong.

      True, the data is currently perfectly readable, but I would like to store it in the same file as the processed data that is associated with it to make it easier to manage by not having the data scattered across multiple files.

      In essence, what I'm trying to do is something like this...

      my $data = file( shift() )->slurp; ### At dev-time, when this processor is known to be working my $tst1 = MyApp::Processor->process( $data ); YAML::DumpFile( $file, $data, $test1 ); ### Then, sometime later in a test script... my @files = <test-files/*.yml>; plan tests => scalar @files; for my $file ( @tests ) { my ( $data, $test1 ) = YAML::LoadFile( $file ); my $test2 = MyApp::Processor->process( $data ); eq_or_diff( $test1, $test2, "$file not broken yet!" ); }

      www.jasonkohles.com
      We're not surrounded, we're in a target-rich environment!

        Nothing you have said so far explains the need for YAML/Data:Dump etc serialisation to me. You data is already human readable. Storing the original and processed data in the same file is a simple as adding a separator. You don't need any fancy modules to do it.

        while(<DATA>) { if (m/<ORIG DATA ABOVE MUNGE BELOW>/) { $munge .= $_ while <DATA>; # slurp } else { $orig .= $_; } }

        As an added benefit of keeping it simple you can leverage diff to do the data comparison to do your eq_or_diff() routine if a simple eq test fails. I really think you are over complicating the task by adding a middleware serialisation layer. You are not actually using it to reconstitute a data structure, nor is there any real need as all you want to do is reformat the old data into the new format so you can process it. Why add useless middleware that only offers the opportunity to include bugs for no real gain?

        As I see it you need a base class that has the functions:

        my ($orig,$munge) = load_file($file); # munge may be NULL my $data = parse($orig); # process current format data o +nly my $cur_format = serialise($data); # output current format write_file($orig, $current_format); # write to file with separator my $invalid = eq_or_diff($munge,$cur_format); print "$file\n$invalid\n" if $invalid; # diff output, null if OK

        Each filter class only requires a parse() method to generate whatever data structure you want to work with in your ultimate program.

        You probably already have parse code to work with current data. The serialise method simply writes this data struct back into a sting that you can save. For current data this may or may not be identical to the current data format, but the process is valid if a base class parse on the original and munge data serialises to the same result as it is then round tripping.

        Essentially what I am saying is don't use serialisation middleware. Write your own code that takes your data structure (which you need) and serialises it *into the current format* (which you need, mostly for validation). The filters become simply a parse method to generate your standard internal data structure. Note that if your internal data structure uses hashes ensure you apply a sort or a list ordering to the keys during serialisation. If you don't it will probably bite you. It has bitten me before as key return order is not guaranteed and is different in different versions of perl on different OSs for exactly the same data.

        Doing it this way gives you:

        1. Human readable output
        2. The old data in new data format in the same file
        3. A format that can easily be munged by diff to show the exact differences, probably in the most intuitively understandable format.
        4. No useless middleware bugs to deal with. You will personally own all bugs :-)
        5. A simple one method filter that does the absolute minimal task required - convert old data into a standardised internal representation ready to either work with or write back to file.
Re: Human-readable serialization formats other than YAML?
by AK108 (Friar) on Apr 23, 2008 at 00:28 UTC
    Depending on your particular audience, JSON might be a good format. It's a very simple format, especially for programmers. I use the module JSON::DWIW to generate and parse JSON in my programs.
      JSON uses the same subset of data structures as YAML::Tiny.

      If he wants a "human-readable" storage format that supports objects, JSON won't help.
      So is Data::Dumper, especially for perl programmers :p

      I like JSON myself and use it extensively for data transfer purposes (it's about as compact as plaintext serialization format could ever be), but it's not particularly human-readable, even when pretty-printed. This is especially the case with multi-line strings mentioned in this example.

      Still, it might be enough for debugging purposes. But so is any other text serialization format. Even (G-d forbid!) XML. With CDATA sections.

Re: Human-readable serialization formats other than YAML?
by Arunbear (Prior) on Apr 23, 2008 at 09:45 UTC

      JSON can't round-trip perl objects without extra work, no matter which module you use to produce it.

      I seem to have missed YAML::XS when looking at other YAML implementations though, and it passes all the failing test scripts I sent in with my YAML bug reports, so I'll probably start using that for all my YAML needs now.


      www.jasonkohles.com
      We're not surrounded, we're in a target-rich environment!
Re: Human-readable serialization formats other than YAML?
by perrin (Chancellor) on Apr 23, 2008 at 14:32 UTC
    Although it's a lot more verbose than YAML, XML is a lot more robust. Have you tried XML::Simple for dump and restore?

      Unless there is an option I missed somewhere, XML::Simple can't roundtrip perl objects either...

      use Test::More tests => 1; use Test::Differences; use XML::Simple; my $in = bless( { foo => 'bar' }, 'Something' ); my $out = XMLin( XMLout( $in ) ); eq_or_diff( $in, $out ); __END__ 1..1 not ok 1 # Failed test at foo.pl line 8. # +----+------------------+----------------+ # | Elt|Got |Expected | # +----+------------------+----------------+ # * 0|bless( { |{ * # | 1| foo => 'bar' | foo => 'bar' | # * 2|}, 'Something' ) |} * # +----+------------------+----------------+ # Looks like you failed 1 test of 1.

      www.jasonkohles.com
      We're not surrounded, we're in a target-rich environment!
        Sorry, I missed that you wanted to handle blessed objects. I think you'd better stick with something that understands perl then, like Data::Dumper.
        (Got, Expected) => (Expected, Got)

        Please note that the key return order in a hash is not guaranteed by perl. While it will be constant for a given version of perl on one OS it may well change with a different version of perl or even the same version on a different OS. This has bitten me before when writing tests for CPAN modules. If you are stringifying hashes for comparison purposes make sure you apply a sort to the keys first.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://682292]
Approved by almut
Front-paged by almut
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others admiring the Monastery: (4)
As of 2024-04-19 06:19 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found