Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

A database table size issue

by sophate (Beadle)
on Apr 27, 2012 at 08:20 UTC ( #967544=perlquestion: print w/ replies, xml ) Need Help??
sophate has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

I have a huge database table which is over 30GB. I would like to create a Perl script that reads the table row by row. Process the data after reading each row.

I've written the script below but I run into a capacity issue right away as the script eats up all the memory when it tries to store the whole table into an array. Here are the codes.

#!/usr/bin/perl use DBI; $ENV{'FREETDSCONF'} = '/home/test/config/freetds.conf'; $ENV{'SYBASE'} = '/home/test/sql_server_15_0-64bit'; my $SERVER = 'TEST'; my $DB = 'test'; my $USER = 'admin'; my $PASSWD = 'admin'; $DB_Ref = DBI->connect("DBI:Sybase:server=$SERVER;database=$DB", "$USER", "$PASSWD", {syb_err_handler => \&syb_err_h}) or die "Couldn't connect to $DB : " . DBI->errstr; $DB_Ref ->{RaiseError} = 1; $DB_Ref ->{PrintError} = 1; $DB_Ref ->do("set dateformat ymd"); ## The table testTable is over 30GB ## The line below eats up all RAM on the server my $testArray = $DB_Ref->selectall_arrayref('select * from testTable') +; ## Do something on each row here

Can I output the results of the query one row at a time? Something like an input stream so that I can process the results row by row with something like "while <QUERY_STREAM> {}" to save memory? Many thanks.

Comment on A database table size issue
Download Code
Re: A database table size issue
by Anonymous Monk on Apr 27, 2012 at 08:24 UTC

      It's exactly what I'm looking for !! Thanks so much !!

Re: A database table size issue
by roboticus (Canon) on Apr 27, 2012 at 09:57 UTC

    sophate:

    You've already gotten a good answer, so I'll just digress a bit.

    I don't often see a task to process *every* row in a table where using SQL isn't better choice. It's pretty expensive to serialize each row and ship it to the other machine which does some processing, and then repeats the process to get the data back into the database. Since you haven't mentioned your level of experience with databases, I thought I'd mention a couple things, just in case.

    1. Can your task be done with SQL? It's a pretty simple language, and since the database server is optimized to use it, it's frequently faster than anything you'll be able to do locally (especially as you already have the overhead of shipping the data across a pipe).
    2. Even if your modifications are too complicated for a simple update, it can be better to make a temporary table and then do a set of transformations in SQL before updating the original table.
    3. If you find that you really need to get the data to the local desktop, be sure to get only the data you need to work on (via the WHERE clause, appropriate joins, etc.) so you can minimize the impact to your application, and all other users of the database.
    4. If you need to process all the rows and it's going to be a frequent operation, you might consider using the bulk-loader capabilities of your database to export the table into a flat file, which you can process easily with perl, and then bulk load the results back into the database. The bulk-loading tools for the databases I use (Sybase, MSSQL server, Oracle) are pretty efficient for whacking large amounts of data (and 30GB certainly qualifies).

    If you want, feel free to share the types of transformations you're wanting and I'll try to help with the SQL.

    Update: Typo fix (s/to/too/).

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

      ++ for the good answer but just to add to it, use set based logic rather than procedural row by row processing;Use the server resources rather that the client's. Furthermore even opening just firehose cursors for fetching that amount of rows will place loads of stress over the connection line if not optimised for batch fetching
        Use the server resources rather that the client's.
        Really? Why? Typically, the number of servers is far less than the number of clients. Using server resources that benefit one client hurt all the clients.

        I'm not saying that in this case the task should be done on the client (we've no idea about the task, nor about the setup), but I disagree with the blanket "use server resources rather than the clients". In way too many cases that isn't scalable to do without thinking.

      Thanks for the suggestion. The script was really slow when I processed the table row by row. I took your advice and rewrote the SQL to let the DB to handle part of the processing. The script now runs much faster than before :-)

Re: A database table size issue
by sundialsvc4 (Monsignor) on Apr 27, 2012 at 12:31 UTC

    Echoing (and upvoting) Roboticus' suggestions, I have found that there can simply be tremendous improvements to be had when the TCP/IP communication link (no matter how fast it is) is eliminated from the processing picture.   If you’ve got a lot of data to be processed, first get it in table-form (or if necessary, flat-file form) to that server or to another server that might be connected to it, say, by an optical fiber link, e.g. a SAN or what-have-you.   Then, do the processing as “locally” as you possibly can, provided of course that you do not overload a server that has been dedicated to and configured for a particular task, with a task for which it is not intended.

    Then, construct your processing in a way that features these considerations, for instance:

    Note that in the following as originally writ, I assumed the update file was 30GB.   As far as database-tables go, 30GB is merely “moderate size.”   The basic principles pontificated :-) here, still hold.
    • It is done mostly with SQL, but it is not an arduous task.   Use EXPLAIN on the queries that you are contemplating, to see exactly how the system will do it.   Explore the process you have in mind; map it out on paper before you build it.   Avoid “build it just to see if it works,” because then you’re always scrambling, and losing credibility quickly.   The best way to stay on top of the ball, is to studiously avoid falling off.
    • Bulk-operations are done as much as possible with tools and/or database-facilities that are designed by the vendor for the purpose.
    • Pay attention to transactions:   use ’em.   Transactions are a clear indication to the server as to how much data it can keep “in memory” for a few milliseconds, and how much must be immediately committed to disk.   Disk-I/O delays add up very quickly.   But, don’t let your transactions become too big.   COMMIT regularly throughout, and of course, once more at the end.   If an operation fails, ROLLBACK.   It is desirable to use “highly exclusive of others” transaction isolation levels, but this might need to be coordinated with other simultaneous activities.   (You also must consider whether other users/applications should be permitted to see incompletely-loaded or incompletely-modified data.)
    • What IBM used to call “checkpoint/restart.”   If the process of churning through 30GB of data fails at any point, it should be possible to fairly-automatically resume.   Likewise, if it is discovered that the data which was applied was incorrect, a strategy must exist for getting the database back to a previous known-good state.   Also consider how you intend to identify, catalog, keep, and then eventually discard those 30GB data files.

      Thanks for your detailed and in-depth reply. Really appreciate the help from you and other monks.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://967544]
Approved by marto
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (15)
As of 2014-07-23 16:10 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (146 votes), past polls