Beefy Boxes and Bandwidth Generously Provided by pair Networks Frank
more useful options
 
PerlMonks  

Biggest file?

by BrowserUk (Pope)
on Dec 17, 2011 at 10:46 UTC ( #944061=perlquestion: print w/ replies, xml ) Need Help??
BrowserUk has asked for the wisdom of the Perl Monks concerning the following question:

What is the biggest file that you routinely manipulate with Perl?

Ideally, I'd like the following information:

  • Typical size.
  • Realistically anticipated maximum size.
  • OS used.
  • Filesystem used.
  • A generic description of the contents.
  • A rough description of the format of the contexts.
    • Ie. Fixed or variable length records (Approx. sizes min/ave/max?).
    • Binary or text.
    • Freeform (eg XML or FASTA)
    • Fixed form (e. image or music

I'm writing a file handling utility and I'd like to get a feel for what size and format files people are actually manipulating.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

The start of some sanity?

Comment on Biggest file?
Re: Biggest file?
by Tux (Monsignor) on Dec 17, 2011 at 12:14 UTC

    For bulk file work, avg about 8 Mb which is not that big. For backup work, tar/rsync/scp etc up to 600Gb per file. Biggest "transactions are however database work. Up to 80_000_000 record traversals per process several times a day.

    Linux (ext3 and ext4), HP-UX (jfs) and AIX (jfs) and several NFS based processes (small files).

    Data files are mostly plain CSV or CSV file(s) inside a ZIP. I'm not doing a lot of XML. Binary, HTML and other formats occasionally. CSV is not-fixed length, but some binary files are (though we are pushing the organizations that give us those to move to CSV/UTF-8).


    Enjoy, Have FUN! H.Merijn
Re: Biggest file?
by keszler (Priest) on Dec 17, 2011 at 12:20 UTC

    It depends on what the meaning of manipulate is.

    Manipulate: create by issuing commands to other programs (i.e. DB), copy/move/delete (including across networks), etc. without opening/reading.

    • Typical size: 4GB
    • maximum size: 8GB
    • OS used: Solaris 10, CentOS 5
    • Filesystem used: UFS, NFS, Ext3
    • description: database/application/custom_compiled_code backups
    • format: binary

    Manipulate: read/write

    • Typical size: 20MB
    • maximum size: 100MB
    • OS used: Solaris 10
    • Filesystem used: UFS
    • description: Proviso data
    • format: variable-length text records w/ default seperator '|_|', 1/20/100 chars

Re: Biggest file?
by erix (Priest) on Dec 17, 2011 at 12:23 UTC

    1. The UniProt (=SwissProt+Trembl) monthly updated protein info database. We put these datafiles into a database. Uniprot.org also makes available this data in XML form (same URL as below) but I find those too large to download/handle/process. The (smaller) .dat files are regular text files:

       size     URL   maxsize   OS    fs    description   length     format
    ------------------------------------------------------------------------
    Swiss-Prot  (1)    2.4 GB   linux  ext3  protein info  variable  free text (multiline)
    Trembl      (2)   47.5 GB   linux  ext3  protein info  variable  free text (multiline)
    
    
      (1): Swiss-Prot (curated data): uniprot_sprot.dat
      (2): Trembl (uncurated data): uniprot_trembl.dat
    
    

    Uniprot grows pretty fast too: see the graphs on the SwissProt and TrEMBL stats pages.

    2. Sometimes it's necessary to munge a database dump (in text form). They can be 100s of GB.

    3. Semi-continuously processed data-files vary from tiny to 1 GB (xml+csv, linux).

Re: Biggest file?
by Corion (Pope) on Dec 17, 2011 at 13:28 UTC

    My scripts routinely process files sized 1GB (gzip compressed). The content is either 28 column CSV (ca. 250 bytes per line) or 500 column fixed width (ca. 2k bytes per record) transaction data. Both types get converted to tab separated output plus two administrative columns and then bulk loaded into database tables, as the bulk loader does not like to talk to a fifo or pipe, unfortunately.

    The content of the files is ASCII text, all packed decimals for the fixed width files have already been decoded to numbers.

Re: Biggest file?
by roboticus (Canon) on Dec 17, 2011 at 13:30 UTC

    BrowserUk:

    In my current job, not much. (Currently I write database query tools, reports & monitors.) In my previous job:

    800MB .. 4GB

    6GB

    Windows XP, Windows Server 2003

    NTFS

    Credit card transaction information, customer billing information.

    A mixture of:

    • Flat files, fixed format: one line per record, fixed-width fields. 128-3200 bytes per record. Usually ASCII text or EBCDIC text + binary COMP fields.
    • Flat files, delimited format: one line per record. 64-400(ish) bytes per record. Usually ASCII text.
    • Hierarchical files, fixed format: like several flat file fixed format files shuffled together: Usually 128, 256 bytes per record. EBCDIC+binary COMP fields.
      • Merchant record
      • Chain record
      • Store record
      • Transaction record
    • Various "printer report" files (mainframe printer files, each line with a prefix for carriage control & such.

    I have a small collection of utilities I use to crank through 'em. For example, for some printer report files, I have a program that accepts an excel spreadsheet and it creates a C program to parse it and reformat it to a fixed format for importing into excel or a database. I also have a few programs that analyze files to help determine their contents and format.

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

Re: Biggest file?
by wazoox (Prior) on Dec 17, 2011 at 14:09 UTC

    A few years back I wrote a script to scan a broken filesystem for video files by combing through the raw disk device. The RAID was 13 TB, and the individual files saved went from a couple GB to 50 or 60 GB.

    Else I wrote a few disk benchmark scripts that work fine indeed... Perl moves data around more than fast enough to saturate the speediest storage.

    • OS Linux
    • File size: from 8 to 250 GB
    • Filesystem: XFS, JFS, PVFS2
    • filesystem size: 8 to 250 TB
    • content: video, various binary
    • binary data
Re: Biggest file?
by TJPride (Pilgrim) on Dec 17, 2011 at 16:25 UTC
    I work for a firm that does marketing for 90+ car dealerships around the US. A big part of what we do to prep for data mining is standardizing and importing data from assorted dealership systems, some archaic and some not (ADP, R&R, Advent, Arkona, Quorum, Scorekeeper, etc.), and this sometimes requires processing service files of up to 200-300 MB with hundreds of thousands of records. Theoretical maximum could be even larger. Input format might be CSV or more of a vertical text format (key value), depending on how we're acquiring the data, but it's always text and never fixed-length. We use custom Perl scripts / mySQL for the most part, and we recently upgraded to a pretty fast server with 4 GB RAM (Cari.net, their pricing and service is pretty good and we also had our previous server there). OS is of course some popular Unix variant that I forget.

    EDIT: We also import sales, leases, and a variety of other stuff, but the service file is just the largest part of that. I imagine the databases in uncompressed form could run upwards of 500 MB to a GB each over time.

Re: Biggest file?
by choroba (Abbot) on Dec 17, 2011 at 21:54 UTC
    • Typical size: 1MB
    • Maxsize: 10MB
    • OS: linux
    • Filesystem: various (ext3, reiser, xfs)
    • Contents: graphs with spanning trees, each vertex and edge can be assigned a structure. Used to represent syntax and meaning of sentences.
    • Format: XML, sometimes gzipped.
    But, of course, sometimes I work with whatever else that drops on me.
Re: Biggest file?
by mbethke (Hermit) on Dec 17, 2011 at 23:57 UTC
    Currently it's mostly little bits of nothing, <10 MB of XML or rarely more than 50 MB of log files. In my previous job I did a lot of log file analysis for an major ISP/web hoster where mail and FTP server logs measured 500-800 MB from just a couple of hours on one box, of which they had a couple of hundred.
Re: Biggest file?
by CountZero (Chancellor) on Dec 18, 2011 at 13:33 UTC
    CSV and formatted Excel spreadsheet files of up to 5 MB. All containing insurance claims data and each insurance company uses its own format. The files get parsed by some Perl-scripts into a standard format which goes to the database.

    Otherwise a variety of small Excel spreadsheets (a few hundred rows at the most) too small and too much prone to change to warrant a proper database to be made for it: the data in these spreadsheets is used to produce insurance certificates, extracts of cover, ... thanks to Template::Toolkit and LaTeX.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

Re: Biggest file? (Conclusion?)
by BrowserUk (Pope) on Dec 19, 2011 at 13:37 UTC

    On the basis of the replies so far (many thanks to all respondents), a file handling utility that catered for files up to 256 Terabytes and individual lines and records up to 64k, would likely cater for most peoples every day requirements?


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

      Uh, sounds like it would be massive overkill for "most people's everyday requirements". :-)

        sounds like it would be massive overkill for "most people's everyday requirements".

        Maybe, but once you go bigger than 4GB, you have to start dealing with 64-bit integers, which at 16 million TB is really overkill :)

        So, since I also need to keep track of the length of each record/line, I figured that using the lower 48 bits for offsets (256TBmax) and the upper 16-bits for the length (64k), means that I can manipulate 'record descriptors' which are 64-bits each.

        Not only are these easily manipulated as 'integers', they are also a cache friendly size which might also yield some performance benefits.

        In an ideal world, the split point would be a runtime option which might allow (say) dealing with genomic stuff where individual sequences can be substantially bigger than 64k; but overall file sizes tend to be much smaller. But I cannot see an easy way to make that decision at runtime.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        The start of some sanity?

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://944061]
Approved by Old_Gray_Bear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others studying the Monastery: (4)
As of 2014-04-20 01:47 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (485 votes), past polls