Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

[SOLVED]finding repetitions

by baxy77bax (Deacon)
on Dec 10, 2014 at 23:19 UTC ( [id://1109981]=perlquestion: print w/replies, xml ) Need Help??

baxy77bax has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

Situation is the folloowing : Memory is low and data is large. What I need is an algorithm to locate repetitive enteries with a prespecified cutoff (let say 5 repeats) in a file that containes 70000000 string enteries (file example:

ent1 up to 100 characters.. ent2 up to 100 characters.. ent3 up to 100 characters.. ent2 up to 100 characters.. ent7 up to 100 characters.. ent3 up to 100 characters.. ent5 up to 100 characters.. ent5 up to 100 characters.. ent2 up to 100 characters.. ..
) I only have app. 500MB of RAM at my disposal (other things are running and I cannot afford to SWAP). Does anyone has a suggestion ?

b

PS Don't care if it is slow as long as it can be processed within 6 hours.

Replies are listed 'Best First'.
Re: finding repetitions (sliding window)
by LanX (Saint) on Dec 10, 2014 at 23:30 UTC
      so u're saying i should sort the file and creare a 5 line sliding window to locate my 5 repeats... ok, this sounds so trivial ... thank you. I guess I am to tired not noticing the solution...

      thnx

Re: [SOLVED]finding repetitions
by sundialsvc4 (Abbot) on Dec 11, 2014 at 02:15 UTC

    Is the 70-million record file presently sorted?   If so, then all repeating entries will always be forced to be adjacent to one another, and you can find repeats of arbitrary size just by comparing adjacent records.   (You can find all of the repetitions in the entire sorted file using one sequential pass through it, considering only two adjacent records at all times.   The “window size,” so to speak, is “two.”)   This strategy is very commonly used, for example, in statistical analysis work, e.g. SPSS or SAS.

    I just finished-up an SPSS job that involved a few dozen million records, and it had plenty of logic that consisted of SORT CASES BY followed by use of the LAG() function (which, given n, retrieves a field’s values n cases ago; there is no function that looks ahead).   The other algorithms, including the ubiquitous MATCH FILES, SPLIT FILE, and descriptive statistics, also require sort order.   (For various somewhat- political reasons, databases could not be used here; nor indexed files.)   SPSS’s sorter is very “smart,” as most sorters are, and the main price paid by the application was that of sorting.   All other operations could proceed in purely linear time.   The whole process ... sorts, analysis, the whole #!shebang, could run-through in a matter of a few minutes on a decent box.   Hence, what I said above about “statistical analysis” work does come by hands-on professional experience ...   SAS is the same way, as is R.

    This number of records that you are dealing with here is fairly routine for a disk-based sorting algorithm or command.   I’d expect it to take a few minutes, at best.   Most sort-engines will make opportunistic use of RAM but allow the RAM consumption to be controlled.   They do produce a fairly intense disk-I/O load, however, which could present issues given that you say that you “can’t afford to swap.”   (Although 70 million modest-sized records, in a 500MB-RAM machine, might be “mostly sorted in RAM” anyway.)

    Obviously, there are trade-offs.   The I/O-footprint can be large, which can turn your app into a thousand-pound elephant “batch job” on an interactive machine.   (Some sorters automatically lower the thread’s priority during the sort, to help to alert the operating-system that the program’s runtime characteristics have changed.)   If you can’t afford to swap, maybe you also can’t afford to sort.   However, if you are doing a number of different operations on the file which can’t readily be done all-at-once and which could benefit from sorting, or if you need to match files and can’t use a database, and so on, this is a possible way to go that is used a lot.

    Now, one more thought here ... the most-advantageous way to arrange things is to proscribe that all of the input-files to a particular process must be sorted in the same way, and to arrange a scheme of merge and selection steps, all of which require that ordering (and which explicitly check for it ...), such that you never have to be the one to do the sorting yourself during the job.   A much more common practice back in the times when spools of magnetized plastic were the order of the day, but still “blisteringly(!!)-fast, and miserly with memory.”   The operating-system quickly figures out (or, you tell him) that you are doing strictly-sequential file access, so it starts buffering everything advantageously, and ... well, hang on to your hat.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1109981]
Approved by sundialsvc4
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (7)
As of 2024-04-18 14:49 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found