Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.
 
PerlMonks  

Re^2: Finding Minimum Value

by Laurent_R (Canon)
on Sep 22, 2014 at 21:46 UTC ( [id://1101570]=note: print w/replies, xml ) Need Help??


in reply to Re: Finding Minimum Value
in thread Finding Minimum Value

When working with data and manipulating the data in various ways, always think "database"!

I am sorry but I have to object very strongly to this. Especially with the word "always".

Databases are very good for two things: data persistence and ability to manage data volumes too big for the computer memory. And also the fact that, if your database is SQL, that the SQL language is a very high level and practical language that will hide many implementation details.

But databases also have a lot of limitations. First, they are horribly slow (compared to hashes in memory). And the languages to manipulate them, such as PL-SQL, are often also horribly slow. Of course, this probably does not matter if you have just tens of thousands of records. But when you get to millions or tens of millions of records, the difference is huge.

So, if you don't need to store data in a persistent fashion, just probably don't use a database, or, at least, think twice before you do it.

About a year and a half ago, I was asked to try to improve performance of a very complicated extraction process on a database. Initial duration test led to a prospective execution time of 160 days. After some profiling and benchmarking work, I was able to reduce it to about 60 days, 59.5 of which in a very complicated trans-codification process. Not too bad, but still obviously a nogo. I moved to an extract of raw data files and a reprocessing of the flat files in pure Perl. The overall extraction time fell to about 12 or 13 hours, but the trans-codification part, using half a dozen Perl hashes, fell from 59.5 days to just about an hour, i.e. an improvement of a factor of about 1,400.

No, it is a bit more complicated. Databases are very useful, there is no doubt about it, but they are certainly not the solution to everything, far from that. Especially when performance is important.

Replies are listed 'Best First'.
Re^3: Finding Minimum Value
by CountZero (Bishop) on Sep 23, 2014 at 06:32 UTC
    I said "think database", not that you always have to use one or that it is always the best option.

    However in actual practical real world cases, there is usually a need to persist the data, there is more data than you can keep in the __DATA__ section of your script or the data changes from time to time. So it pays to "think database" in many, if not most, of the day-to-day jobs.

    I am not surprised that Perl outruns any database in doing a complicated trans-codification process. That is one of Perl's main strengths!

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

    My blog: Imperial Deltronics
      I might have over-reacted, but, true, you said "think database", not "use database". Sorry if I interpreted wrongly what you said.

      My work is to guarantee data consistency between several hundreds applications of the Information System of a Tier One telecommunication operator (cell phones, landlines, Internet, TV, VOIP, VOD, etc.). The various applications run on MVS, AS400, AIX, HP-UX, Solaris, VMS, at least half a dozen distributions of Linux, etc. The database systems used are perhaps even more varied.

      Now, how to you compare data from a DB2 database running on MVS with an Oracle database running on AIX or a RMS database running on VMS? As you might imagine, this is not straight forward. In most cases, we just extract flat CSV data files from the various applications and compare those files. There a few cases where we load the data from two applications into the same database (different tables), say Oracle, and compare the tables, using SQL queries or some other mean. But that's usually when we also need to run manual SQL queries to investigate further.

      But in most cases, though, this is too inefficient (we have about 35 million customers and the extraction files commonly have tens of gigabytes, sometimes several hundreds of GB). It is most of the time much much faster to simply sort the data from both application according to the same sorting key (using for example the Unix sort utility), and to compare them line by line with Perl to detect duplicates, orphans (that's how we call a piece of data present in one application and not in the other) and discrepancies in the details. We've done a lot of benchmarking. In most cases, the complete comparison method described just above takes less time than just loading the data into a database and constructing the indexes, without having even started comparing anything.

      I explained all this just because I wanted to let you know why I sometimes react strongly when someone suggest to use a database for such one-off processes. Sure, it is often easier when the data is small or medium, but for really large data, it simply scales much worse than the sort and compare strategy.

      Now, of course, I admit that our activity might not be typical, especially because our activity does not imply persistent data. We compare different databases at a given point in time once a week, or once a month, or according to some other calendars, but we don't need to keep data from one time to another (well, this is sort of a lie or oversimplification, because when we do a comparison, we most of the time compare its results with the results of the previous identical comparison, in order to check that the corrections we submitted previously were duly applied, or, if they were not, to investigate why not).

      Again sorry if I overreacted, I hope that my description of my job gave you the reasons for it.

        In most cases, the complete comparison method described just above takes less time than just loading the data into a database and constructing the indexes, without having even started comparing anything.

        Ah! Someone who understands to pick up the baton :)

        As (seemingly) the lone voice making that exact point for the 10+ years, it's good to see.

        Warning: be careful with your examples. erix has had the delightful habit of proving me wrong if my 'slow DB' examples are poorly chosen.

        Also be(a)ware of Oracle's latest hardware/DB monsters. Not that I would ever use anything that company makes -- assuming I had any influence -- but if you can afford it (the lock-in as much as the price), Oracle's latest hardware/in-memory DB is capable of some pretty amazing performance as they recently demonstrated to an old client of mine.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1101570]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (3)
As of 2024-03-29 07:16 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found