Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Re^4: Complex file manipulation challenge

by Tux (Canon)
on Aug 14, 2019 at 06:32 UTC ( [id://11104432]=note: print w/replies, xml ) Need Help??


in reply to Re^3: Complex file manipulation challenge
in thread Complex file manipulation challenge

Your vision on CSV is indeed very limited :)

Consider not only Excel (or other spreadsheet application) exports, but also:

  • Database exports (including images, BLOB's, XML, Unicode, …)
  • Log exports (I know of a situation that has to read 4Tb (tera-byte!) a day
  • CSV exports where not only the data, but also the header-row has embedded newlines in the fields (and comma's)
  • CSV files with mixed encoding (you should know that Oracle supports field-scoped encodings in their most recent versions)
  • Nested CSV: each/any field in the CSV is (correctly or incorrectly quoted) CSV, but the final result is valid CSV
  • I've seen CSV files with more than 65535 columns.

All of the above should remember you never to use regular expressions or read-by-line algorithms to parse CSV. It looks too easy to be true.

Now reconsider you last line: a CSV file does not have a humongous line length. It is likely to have a humongous record length. (Think of a database export where a table has stored movies in parts and each record has up to 4 pieces of the movies, so each CSV record can be Gb's. People use databases and CSV for weird things.


Enjoy, Have FUN! H.Merijn
  • Comment on Re^4: Complex file manipulation challenge

Replies are listed 'Best First'.
Re^5: Complex file manipulation challenge
by karlgoethebier (Abbot) on Aug 14, 2019 at 11:28 UTC

    I‘m wondering why CSV isn’t replaced with JSON. Doesn’t have PostgreSQL have row_to_json? One row as JSON array and good is? I don’t know in a hurry what Sybase, Oracle or MySQL provide but i guess that they come with something similar. Writing some stored procedure might be an option. And probably there is some fubar Excel macro that does the same. Processing such a file line by line with JSON::Tiny and/or with something from the MCE toolbox should work like a charm. Regards, Karl

    «The Crux of the Biscuit is the Apostrophe»

    perl -MCrypt::CBC -E 'say Crypt::CBC->new(-key=>'kgb',-cipher=>"Blowfish")->decrypt_hex($ENV{KARL});'Help

      I‘m wondering why CSV isn’t replaced with JSON.

      "Millionen Fliegen können nicht irren." (1)

      CSV is so widely used to connect systems you don't even have dreamed of, it won't go away any soon. If all parties agree on a common subset of CSV (i.e. they define away the evil edge cases Text::CSV cares about), a CSV parser is trivial to implement in many systems. A fully featured JSON parser is much harder, plus it may generate unexpected data structures. Assume your CSV-replacement JSON is defined as an array of arrays. What if your CSV-replacement-JSON parser is suddenly fed an array of hashes, or an AoAoA? What if array elements are not simple scalars? Plus, JSON requires some interpunction (e.g. for keys) that is optional in CSV, so the JSON files might get bigger.

      Alexander

      (1) Did you know that electronic delivery notes for blood products are specified as floppy disks containing CSV files? See for yourself: https://dgti.de/docs/doclink/10158/Lieferscheindiskette_Version2_2.doc (MS Word Format, german). And yes, this format is still in use, still on floppy disks, but also via e-mail. I know at least 10 different systems implemented in at least five different languages that use exactly this format, at least in Germany and Austria. Every change in the format requires changing ALL of that systems, followed by a complete verification of the systems. No one wants to pay for that, so changes are rare, even if the format sucks.

      --
      Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
        ""Millionen Fliegen..."

        Sure, I know. And the heaven will on some day fall on our heads.

        But wait: Data for exchange (CSV is made for this) should be sound and well defined.

        To me a line of a file is an array with n slots. If something else happens something else went wrong. A reason to die.

        And using CSV for data exchange is a total abuse. Often it's because the different departments in some company restrict/don't allow access to their holy grail AKA their database. Remember Turnschuhnetzwerk. They send it as Email-Attachment, hand over some USB-Sticks, a DVD etc. A practice that makes admins shudder. Best regards, Karl

        «The Crux of the Biscuit is the Apostrophe»

        perl -MCrypt::CBC -E 'say Crypt::CBC->new(-key=>'kgb',-cipher=>"Blowfish")->decrypt_hex($ENV{KARL});'Help

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11104432]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others rifling through the Monastery: (3)
As of 2024-04-25 07:24 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found