Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Possible faster way to do this?

by Anonymous Monk
on Jun 25, 2019 at 08:14 UTC ( [id://11101850]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks!
I have a very large file, size ~5 TB, that has 95 columns (tab-separated). I need to load this on a DB, so, in order to find out the column specifications I should use, I need to know what type of data each column has. What I do so far is to cut each column and then make its contents unique
cut -f1 my_big_file > column1 sort -u column1 > column1.uniq

and then study each of them. But I was wondering, since the cut is been taking like 2 days now and still not done, can you suggest a faster way perhaps? Maybe with a hash? But, will a hash be able to handle such a large file?
Thank you!

Replies are listed 'Best First'.
Re: Possible faster way to do this?
by Eily (Monsignor) on Jun 25, 2019 at 08:42 UTC

    Hello Anonymous Monk.

    Your code is not perl, it's calls to the cut and sort programs, probably inside a shell script.

    But, will a hash be able to handle such a large file?
    Probably not, unless you have very little unique values, you are going to end up with several GB of data in your hash. But do you actually need all the values? If you're just trying to get the type of each column you just need to keep a few value:
    • if a value that can't be coded as a number has been seen for this column
    • the max length (in characters) of this column
    • if a non integer value has been seen for this column
    • the min and max value for this column (unless you already know that it's a string)

    In perl that might look something like that (untested since it's barely more than pseudo code):

    use strict; use warnings; use Scalar::Util 'looks_like_number'; my @result; open my $file, "<", "my_big_file" or die "Can't open my_big_file: $!"; while (<$file>) { my @data = split "\t", $_; for my $col (0..$#data) { my $len = $length($data[$col]); if (!looks_like_a_number($data[$col])) { $result[$col]{is_a_string} = 1; } $result[$col]{max_len} = $len if ($len > $result[$col]{max_len}); next if $result[$col]{is_a_string}; # Don't check number range if +this is a string # Check number range here ... # Other stuff ... } }
    You might notice the use of looks_like_number of Scalar::Util to help you with detecting numbers. Of course that's if you actually want to use perl, rather than whatever language your script is currently executed as.

Re: Possible faster way to do this?
by Corion (Patriarch) on Jun 25, 2019 at 08:44 UTC

    While I wrote SQL::Type::Guess, it really, really wants to have all the data in memory already, which is asking a bit much for a 5 TB file.

    If you are content with a quick approach, consider either taking the first 100 GB or a random sampling of rows from your input file(s) and use these with SQL::Type::Guess to determine good types for your columns.

    Alternatively, a more "manual" approach of reading the file line by line and feeding the information to SQL::Type::Guess could also work:

    while (my $hashref = read_next_row_from_file_as_hashref( $fh )) { $sqltypes->guess( $hashref ); }

    Have you looked at how long it takes to read the 5TB file without doing any processing? Maybe two days isn't all that bad.

      Maybe two days isn't all that bad.
      When it's for just one column of a total of 95, my guess would be that this is pretty bad :P. But rather than the 5TB file, it's sorting and searching for unique values in 50GB (~5TB/95) that I'd be the most worried about personally.

Re: Possible faster way to do this?
by hippo (Bishop) on Jun 25, 2019 at 10:32 UTC
    I need to know what type of data each column has.

    Ask the author of the file. Seriously, why double-guess formats when you can have the definitive answer and probably in less time, too?

Re: Possible faster way to do this?
by bliako (Monsignor) on Jun 25, 2019 at 09:59 UTC
    sort -u column1 > column1.uniq

    Ouch!

    You are right to seek a hash-base solution if only to eliminate sorting in order to achieve uniqueness which sounds like a poor-man's dinner fat-industrialist's gala!

    IMPORTANT EDIT: I was too fast to croak, sorry: uniq only filters out repeated adjacent lines. It seems you have to write your own...

    Which, btw, can be achieved by GNU's uniq which - my guess - is the most efficient way to do that.

    Thank you Richard Stallman: cut -f1 my_big_file | uniq > column1.uniq

    Edit 2: just to remedy my previous gaffe, here is a Perl and c++ uniq which works without the "adjacent lines" restriction, it reads only from STDIN, Edit3: added awk solution too:

    perl -ne 'chomp; $uniq{$_}++; END{print $_ . " => " . $uniq{$_} . "\n" + for keys %uniq}'

    or

    perl -lne '$uniq{$_}++; END{print $_ . " => " . $uniq{$_} for keys %un +iq}'

    using awk

    awk -vN=1 '{if($N in uniq){uniq[$N]++}else{uniq[$N]++}}END{for(k in un +iq){print k," => ",uniq[k]}}'
    /* reads whole lines from stdin and counts how many times each was repeated, overall and NOT "adjacently" by bliako, 25.06.2019 for https://perlmonks.org/?displaytype=edit;node_id=11101856 g++ -O puniq.cpp -o puniq */ #include <unordered_map> #include <iostream> using namespace std; int main(void){ unordered_map<string,int> uniq; string aline; while(getline(cin, aline)){ if( uniq.count(aline) == 0 ) uniq[aline] = 1; else uniq[aline]++; } for(unordered_map<string,int>::iterator itr=uniq.begin();itr!=uniq +.end();itr++){ cout << itr->first << " => " << itr->second << endl; } }

    End Edit 2

    But Eily is right to suspect that depending on your data you may end up with several GB hash - all values unique keys! And I love Corion's idea of random sampling except that it can never be 100% accurate. Especially if this is yet one of those bioinformatics projects where you have quite a few instrument errors causing outliers, impossible values and NA's. But the approach of making an educated guess based on a few samples and then you find whether it is *mostly* correct through random sampling could work. The key is the percentage of disagreements and whether you can afford to throw the whole row away just because it is too expensive to re-work your assumptions. Beware that even 1% discards per column can eliminate up to 95% of a 95-column data!

    And there is the parallelise-way which may, just may, be able to overcome the bottleneck of N threads reading sequentially from same file on disk given that the data-chunk-per-thread are large enough and overall you gain. Each thread reads a chunk of the data, process it and then when all threads are done, re-unique their combined output which will hopefully be much smaller.

    Edit 3: how to parallelise on the GNU bash shell - I will use the awk-based solution but all other solutions combined with cut will also do:

    Ncols=95 Nthreads=4 for acol in $(seq 1 ${Ncols}); do echo "awk -vN=${acol} '"'{if($N in uniq){uniq[$N]++}else{uniq[$N]++} +}END{for(k in uniq){print k," => ",uniq[k]}}'"' my_big_file > column$ +{acol}.uniq" > tmpscript${acol}.sh echo doing column $acol sem -j${Nthreads} sh tmpscript${acol}.sh done

    End Edit3

    Storing data in a solid file vs splitting the data in smaller chunks/files will benefit the parallelise way a little bit. A lot if those chunks are stored in physically separate disks.

    Additional problem: col1's type can be integer or string depending on the value of col2. So give us some data.

    bw, bliako

      Which, btw, can be achieved by GNU's uniq which - my guess - is the most efficient way to do that.

      That isn't what uniq does. It removes adjacent duplicate lines only. This is why everyone sorts before running uniq.

      $ echo -e "foo\nbar\nfoo" | uniq foo bar foo $ echo -e "foo\nbar\nfoo" | sort | uniq bar foo $

        yep, sorry for me spreading the wrong information, i have edited my answer. (and added one Perl and one C++ proper-uniqs)

      Hi all, many interesting points here!
      Corion, I do like your approach and probably use it at some point; however, in order to create the tables correctly, I would need ALL of the unique values, in order to see if e.g. VARCHAR(15) is enough, or if 10,000 rows from the hundreds of millions that this big file has, have e.g. VARCHAR(16). So, as I see it, not really much to do, except using uniq directly, without sorting...
      The values vary largely, but, as correctly pointed out, they are biology-related data (on a bioinformatics project), so some of them are e.g. chromosome number positions (so, only numbers), others are text etc. And most of them are repeated of course. However, I really need to know all the unique values before I can design my table scheme..
        I don't know why it is necessary to find all of the unique values for a particular column in order to get started.

        One thought would be to build a histogram of number of chars in each field of a row. This would spot weirdo outlier cases where a column is almost always 12, but very rarely 112 chars. This also might help with char(x) versus varchar(x) decisions.

        This is gonna be a big DB and I assume that you have an appropriate DB for that. Might be interesting to know the target DB?

        I would also consider an iterative approach... Write a one pass analyzer program as I mentioned. Create a DB based upon that info and your judgement. Then if you want to know the unique values for a column, ask the DB! In other words, scan data in one pass to make "good enough judgments" to make an initial DB. Then query this first DB to to help you find the best schema for the final DB. The DB will be more efficient than Perl at finding uniques and a fancy DB can use multi-cores to speed things up.

        Update:
        Figuring out the optimal data types for your specific DB can be difficult if you want to minimize storage (and I guess that you do want to do that? - but note that sometimes minimal storage means less than optimal query speed).
        For example MySQL has many data types: MySQL data types.

        Two days to analyze one column from 95 is way outside the boundaries of what is needed. It is difficult for me to believe that you don't know what most of these columns represent. You should be able to figure out a reasonable data type for most of the columns without even seeing any of the actual data.

        I really need to know all the unique values before I can design my table scheme.
        No, you don't need to know all the unique values. You only need to know a few pieces of metadata describing the kind of data in each column, such as the length of the longest value (so you know whether VARCHAR(15) is enough or if one or more values require VARCHAR(16)), possibly the length of the shortest value (if you want to try to use CHAR instead of VARCHAR where possible), a flag for whether the column is numeric-only (if you want to use numeric types instead of VARCHAR where possible), and, for numeric columns, the minimum/maximum values (so you can choose a numeric type with an appropriate range).

        The only scenario which would require you to know all possible values for each column would be if you were intending to define each column's type as an enum, which I would consider to fall under the "pathological database designs" heading.

        Sorry, GNU's uniq just filters out adjacent lines. See my edits in my first post Re: Possible faster way to do this? for a Perl one-liner and a C++ uniq commands with the "proper" functionality using hashmaps.

        bw, bliako

Re: Possible faster way to do this?
by QM (Parson) on Jun 26, 2019 at 10:41 UTC
    A tool I used just looked at the first 20 rows, and guessed from that. It was interactive, so you could change the type for each column at time of ingestion. You could also change the number of rows to preread.

    -QM
    --
    Quantum Mechanics: The dreams stuff is made of

Re: Possible faster way to do this?
by daxim (Curate) on Jun 26, 2019 at 07:15 UTC
    in order to find out the column specifications I should use, I need to know what type of data each column has
    You don't need to figure that out up-front! Load all the data into a database right now, every field is provisionally type text or blob, that means that the data passes into the database completely lossless and unmolested.

    Next use database queries and other features to figure out what the data in each column looks like, it can handle large amount of data much better than you fumbling around with command-line tools.

    After you have made a conservative guess for the accurate type of a column, alter the column type. If you were wrong and a row datum does not meet the constraints of the new type, then the conversion will throw an error and abort the type change. You can modify your guess and try again.

      Good luck running (multiple) ALTER TABLE-statements against a 5 TB database.


      holli

      You can lead your users to water, but alas, you cannot drown them.
      There's a lot of ambiguity here, so I would second the notion of just throwing it all into a database to analyze. I imagine even indexing will take patience, so maybe make a table per column to make the initial data type determinations standalone. Then dump it into the final db spec when decisions are made. Normalizing it may dictate having multiple tables at the end of the day afterall, and the dump/restore might benefit from having the columns separated at that point as well.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11101850]
Approved by marto
Front-paged by haukex
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others rifling through the Monastery: (10)
As of 2024-04-24 09:46 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found