Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Strategy for managing a very large database with Perl

by punkish (Priest)
on Jun 18, 2010 at 05:34 UTC ( #845309=perlquestion: print w/ replies, xml ) Need Help??
punkish has asked for the wisdom of the Perl Monks concerning the following question:

The "with Perl" part is not as gratuitous as it might seem, although the question is more db-centric. I am planning a rather large database table that I will query with DBD::Pg. The table itself is rather simple; just about half a dozen, mixed INT and REAL values as shown below

CREATE TABLE t { t_id BIGSERIAL NOT NULL, year SMALLINT, yday, SMALLINT, a INT, b INT, c REAL, d INT, e REAL, f REAL, point_id INT );

These are daily values per year, and they repeat for each point on land. So, for one point I have one row for each day of the year for 20 years, for the next point I have one row for each day of the year for 20 years, and so on.

The expected table size is going to be in 100 billion rows plus.

A typical search would be to retrieve all the values of a <variable> for or for a particular year+yday combination. In other words, besides the automatic INDEX on t_id, the PRIMARY KEY, there will be INDEXes on year, yday, and point_id. The INDEX on year, yday might be a composite INDEX because they will be searched together.

Anyway, as I said, 100 billion+ rows totaling about 10 TB without the extra INDEXes (that is, the INDEXes other than the PK).

Any suggestions if there are any better ways to do this? Should I try to save space by converting the REALs into INTs? Or, who cares about space... disks are cheap.

Should I combine the different values into a single blob using some kind of data serialization technique and then store them in a single column? To be sure, that sounds like a silly idea to begin with, because I would have to extract the individual variables out of that blob on every query.

Should I partition the table into smaller tables using the technique discussed at http://http://www.postgresql.org/docs/8.4/interactive/ddl-partitioning.html, or is it too much trouble as Pg will handle this data very well anyway.

Once the db is built, it will be relatively static. In other words, it will be readonly.

Fishing for ideas here.

--

when small people start casting long shadows, it is time to go to bed

Comment on Strategy for managing a very large database with Perl
Download Code
Re: Strategy for managing a very large database with Perl
by dws (Chancellor) on Jun 18, 2010 at 05:55 UTC

    A typical search would be to retrieve all the values of a <variable> for or for a particular year+yday combination.

    If those are the typical search patterns, is a relational database the right hammer to reach for? Why not flat files? Appending to a flat file is a quick operation. If you keep one flat file per day, scanning for the subset of records for that day is trivial. If you have to do a full scan through all the data, one file per day is also easy, and allows the scan to be parallelized if you ever need to throw more machines at the problem.

    10Tb over 20 years is 500Mb a year, or very roughly 1.5Mb a day. That's not an unreasonable size for a flat file.

Re: Strategy for managing a very large database with Perl
by moritz (Cardinal) on Jun 18, 2010 at 06:34 UTC
    To me, partitioning by year sounds like a good idea.
      moritz> To me, partitioning by year sounds like a good idea.
      
      Thanks. I might try that. That would result in 23 tables, each with 4_850_174_538 rows by however many columns. That is still almost 5 billion rows, but better than 115 billion.

      Now, here is where some db chops are needed. Is a db really inefficient managing 115 billion rows as opposed to 5 billion? Well, yes, for certain operations it can be so. But, as far as I understand, a db manages data in pages, so, it is analogous to reading data from file line by line, except, each line, in this case, is a page. The db knows which page to go to with the help of indexes. So, it doesn't matter how big the table is. The db goes to the right page to grab the data. In this way, it is analogous to knowing the correct file offset to go to the right place in the file to read the data. Sure, I could implement all that crap myself, but the db has already figured it out.

      Since my data are pretty much read-only, what do I gain by tinkering with the rather simple scheme I have right now?

      I can optimize for space, but 10 TB doesn't sound very big, especially when some of my colleagues are deploying 100s of TB for satellite imagery.

      I can optimize for rows, and while 115 billion is indeed a very large number, does it affect my operation?

      In the end, I want to optimize for ease and speed of retrieval, so that is where I want to concentrate on. The partitioning docs indicate that that cause might be helped by virtue of having smaller INDEXes for specific searches.

      Thanks for the suggestion.

      --

      when small people start casting long shadows, it is time to go to bed
Re: Strategy for managing a very large database with Perl
by JavaFan (Canon) on Jun 18, 2010 at 07:15 UTC
    Considering your title, I wonder, would you have asked different questions, or expect a different answer, if you were going to manage the database with Python? Or C? Or something else?

    I really think you'd get much better answers in a database forum than here.

      JavaFan> Considering your title, I wonder, would you have 
      JavaFan> asked different questions, or expect a different answer, 
      JavaFan> if you were going to manage the database with Python? 
      JavaFan> Or C? Or something else?
      
      Don't know if would have asked a different question if managing the db with Python or C, because I would probably have had a different mindset.
      JavaFan> I really think you'd get much better answers in a 
      JavaFan> database forum than here.
      
      You are possibly correct, but there are db wizards here on Perlmonks, and they are Perl-friendly and knowledgable, so I wouldn't be surprised if they came up with good suggestions. They already have. See BrowserUk's wonderfully whacked out suggestion: Re: Strategy for managing a very large database with Perl (Video). That is why I love Perlmonks.
      --

      when small people start casting long shadows, it is time to go to bed
Re: Strategy for managing a very large database with Perl (Video)
by BrowserUk (Pope) on Jun 18, 2010 at 08:33 UTC

    I'd store it as video. Bear with me.

    If we take a look at your table definition, we can exclude (for now) t_id and point_id as they are just a part of the RDBMS data representation rather than the actual data. Year and day effectively denote 'frame'. That leaves 3 ints and 3 reals.

    Assuming (I know foolish), that 2 of those reals represent X & Y coordinates. We might also think of the 3 integers as being roughly equivalent to RGB. and the last as gamma (or alpha).

    20 years of days is just over 7000 frames, which at say 30 frames per second equates to a video clip of just over 4 minutes.

    100 billion rows / 7000 frames (assuming a roughly 16/9 frame shape) gives roughly a 6000 x 2500 frame size. That's approximately 10x the size of an HD video frame. So 40 minutes of HD movie. And 1 hour of HD movie is about 1 GB to download.

    Now, that's mostly an academic exercise as whilst video codec compression is highly developed, it is lossy, and so not really much good for academic data. However, it could lead to thinking about a different way of storing your data that might both reduce the storage requirements and lend itself to the type of queries that will be made of it.

    For example, if your data can be represented by a (say 6000 x 2500) 2D array of 3 ints and 1 float per point per day. Then that equates to a 6e3 x 2.5e3 x 16-bytes frame per day. Or 228 MB raw. * 20 years (7000) = 1.6 TB. Much less than your 10 TB projection.

    Where has the savings come from? First we've moved the year/day information out of every record. It's now just stored once in the name of the file. We've also moved the X & Y coordinates out of every record as they are now implicit in the position of the data within the file.

    And your typical queries are straight forward: "the values of a <variable> for a given point or a given set of points for a particular year+yday combination.", becomes: locate the appropriate file (date), and a direct read of 16 bytes from a directly calculated offset for each point or given set of points.

    No data loss, just a more compact representation that lends itself directly to the types of queries anticipated.

    Of course, once you store the data this way, you can then use some other video techniques to compress it further--without loss. You might for example store a "full frame" for the first day of every month, and then a set of { X,Y new value(s) } changes per day in between. This might make sense if only a small subset of the full frame is likely to change on any given day. (Much like video phones transmit change sets that are applied to periodic full frames.)

    The full dataset for any given date can quickly be reconstructed (losslessly) by grabbing the preceding "full frame" and then applying the change sets to bring it up to date. And if the particular query being satisfied only applies to a small area of the full frame then pre-filtering the change sets speeds the processing.

    Finally, if the 3 ints and the real for each X,Y position are compatible with representation as RGB & gamma (or one of the other color encodings), then displaying (sub-areas of) the frames as a movie would be a powerful way of visualising the changes in the datasets over time.

    Just a thought that normalised relational forms are often not the best way of storing and manipulating datasets.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      BrowerUk> I'd store it as video. Bear with me.
      
      First, a wonderfully whacked out suggestion. I wouldn't expect any less from you, BrowserUk, as you always come up with an intriguing possibility.

      A few specific responses: First, a series of video frames is not that outlandish, actually. A video frame is nothing but a series of 2D images that flutter by our eyes at 30 fps. So, it is akin to storing a bunch of 2D images. Doing locational searches is, as you suggested, finding the correct file, and then finding the subset from that file.

      But, these would be a lot of very large images... a pretty bad combination. My dataset's 2D dims are 4,587 X 2,889 = 13,251,843 points. 23 years is 23 * 365 (or 366) = 8395 (or 8418) images. Plus, each "pixel" in the image is an array of my 6 or 7 values. So, we are back to complexity in retrieval.

      See my response to moritz at Re^2: Strategy for managing a very large database with Perl. What am I optimizing for?

      • Space? I don't really care about space, because disk drive is a lot cheaper than my time.
      • Integrity? Once loaded, the data are read-only.
      • Ease and speed of retrieval? Yes and Yes.
      I only care about how easily and quickly I can get data that I want. Say, I want to create an image of variable a over an area for a particular day. That is a simple db query SELECT var FROM table WHERE year = ? AND yday = ?. Note: A columnar database such as Vertica or MonetDB might be very speed efficient for these queries, but those db lack spatial searches, and suffer from the lack of ease aspect.

      Nevertheless, your idea is very intriguing, and I am going to do a few tests with array storage. Of course, as mentioned in an earlier posting, the data are already in NetCDF, an array storage format, so I have to do something new that overcomes the shortcomings of the current format. The main shortcoming of the current format is the inability to do spatial searches for arbitrary spatial bounds.

      Thanks again, for a wonderful response.

      --

      when small people start casting long shadows, it is time to go to bed
        Plus, each "pixel" in the image is an array of my 6 or 7 values. So, we are back to complexity in retrieval.

        I don't know why you say that?

        The code below plucks a rectangle of data points of a specified size from a specified year/day "image". I dummied up two days worth of data files:

        C:\test>dir tmp 18/06/2010 14:28 371,260,960 2010.168.bin 18/06/2010 14:22 371,260,960 2010.169.bin 2 File(s) 742,521,920 bytes

        And this shows the code plucking 10 x 10 x 7 datasets from various positions within each of those files (with the output redirected for clarity). The code is just a little math, a read and an unpack--most of the posted code is just parsing the arguments and formatting the output and timing:

        for /l %y in (0,500,2500) do @845309 2010 169 2293:9 %y:9 >nul [2010 169 2293:9 0:9] Took 0.020 seconds [2010 169 2293:9 500:9] Took 0.017 seconds [2010 169 2293:9 1000:9] Took 0.017 seconds [2010 169 2293:9 1500:9] Took 0.017 seconds [2010 169 2293:9 2000:9] Took 0.019 seconds [2010 169 2293:9 2500:9] Took 0.017 seconds for /l %y in (0,500,2500) do @845309 2010 168 2293:9 %y:9 >nul [2010 168 2293:9 0:9] Took 0.021 seconds [2010 168 2293:9 500:9] Took 0.017 seconds [2010 168 2293:9 1000:9] Took 0.017 seconds [2010 168 2293:9 1500:9] Took 0.066 seconds [2010 168 2293:9 2000:9] Took 0.023 seconds [2010 168 2293:9 2500:9] Took 0.017 seconds

        And here 100 x 100 x 7 data points. Very linear as expected.

        for /l %y in (0,500,2500) do @845309 2010 169 2293:99 %y:99 >nul [2010 169 2293:99 0:99] Took 0.115 seconds [2010 169 2293:99 500:99] Took 0.115 seconds [2010 169 2293:99 1000:99] Took 0.117 seconds [2010 169 2293:99 1500:99] Took 0.116 seconds [2010 169 2293:99 2000:99] Took 0.115 seconds [2010 169 2293:99 2500:99] Took 0.116 seconds for /l %y in (0,500,2500) do @845309 2010 168 2293:99 %y:99 >nul [2010 168 2293:99 0:99] Took 0.125 seconds [2010 168 2293:99 500:99] Took 0.116 seconds [2010 168 2293:99 1000:99] Took 0.114 seconds [2010 168 2293:99 1500:99] Took 0.115 seconds [2010 168 2293:99 2000:99] Took 0.115 seconds [2010 168 2293:99 2500:99] Took 0.115 seconds

        So, very simple code and very fast. And the entire uncompressed dataset (23 * 365.25 * 354MB) = < 3TB.

        With compression, that could be as little as 1.3 TB. Though you'd have a pay the price for unpacking--~30 seconds per file.

        18/06/2010 14:28 237,173,932 2010.168.bin.gz 18/06/2010 14:22 175,868,626 2010.169.bin.bz2

        But the main point of partitioning your dataset this way is that you reduce the search space to 1/8th of 1% as soon as you specify the year/day. And there is no searching of indexes involved in the rest of the query. Just a simple calculation and a direct seek.

        Anyway, t'is your data and your employers money :)

        The (flawed) test demo code.

        #! perl -slw use strict; use Time::HiRes qw[ time ]; use constant { XMAX => 4587, YMAX => 2889, REC_SIZE => 7 * 4, }; my( $year, $day, $xRange, $yRange ) = @ARGV; my( $xStart, $xEnd ) = split ':', $xRange $xEnd += $xStart; my( $yStart, $yEnd ) = split ':', $yRange; $yEnd += $yStart; my $start = time; open BIN, '<:perlio', "tmp/$year.$day.bin" or die $!; binmode BIN; my $xLen = ( $xEnd - $xStart + 1 ) * REC_SIZE; for my $y ( $yStart .. $yEnd ) { my $pos = ( $y * XMAX * REC_SIZE ) + $xStart * REC_SIZE; seek BIN, $pos, 0; my $read = sysread( BIN, my $rec, $xLen ) or die $!; my @recs = unpack '(A28)*', $rec; for my $x ( $xStart .. $xEnd ) { my( $a, $b, $c, $d, $e, $f, $g ) = unpack 'N7', $recs[ $x - $x +Start ]; printf "%4d.%03d : %10u %10u %10u %10u %10u %10u %10u\n", $year, $day, $a, $b, $c, $d, $e, $f, $g//0; } } close BIN; printf STDERR "[@ARGV] Took %.3f seconds\n", time() - $start;

        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Strategy for managing a very large database with Perl
by Fletch (Chancellor) on Jun 18, 2010 at 14:06 UTC

    TMTOWTDI, and sometimes they don't involve Perl: this size data set is getting into the realm where something like Pig might be more suitable.

    The cake is a lie.
    The cake is a lie.
    The cake is a lie.

Re: Strategy for managing a very large database with Perl
by metaperl (Curate) on Jun 18, 2010 at 16:18 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://845309]
Approved by moritz
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chilling in the Monastery: (17)
As of 2014-10-23 16:37 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    For retirement, I am banking on:










    Results (126 votes), past polls