[SOLVED]finding repetitions

baxy77bax has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

Situation is the folloowing : Memory is low and data is large. What I need is an algorithm to locate repetitive enteries with a prespecified cutoff (let say 5 repeats) in a file that containes 70000000 string enteries (file example:

ent1  up to 100 characters..
ent2  up to 100 characters..
ent3  up to 100 characters..
ent2  up to 100 characters..
ent7  up to 100 characters..
ent3  up to 100 characters..
ent5  up to 100 characters..
ent5  up to 100 characters..
ent2  up to 100 characters..
..
[download]

) I only have app. 500MB of RAM at my disposal (other things are running and I cannot afford to SWAP). Does anyone has a suggestion ?

PS Don't care if it is slow as long as it can be processed within 6 hours.

Comment on [SOLVED]finding repetitions Download Code

Replies are listed 'Best First'.
Re: finding repetitions (sliding window) by LanX (Saint) on Dec 10, 2014 at 23:30 UTC
See sliding window Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :)}	[reply]
Re^2: finding repetitions by baxy77bax (Deacon) on Dec 10, 2014 at 23:44 UTC
so u're saying i should sort the file and creare a 5 line sliding window to locate my 5 repeats... ok, this sounds so trivial ... thank you. I guess I am to tired not noticing the solution... thnx	[reply]
Re: [SOLVED]finding repetitions by sundialsvc4 (Abbot) on Dec 11, 2014 at 02:15 UTC
Is the 70-million record file presently sorted? If so, then all repeating entries will always be forced to be adjacent to one another, and you can find repeats of arbitrary size just by comparing adjacent records. (You can find all of the repetitions in the entire sorted file using one sequential pass through it, considering only two adjacent records at all times. The “window size,” so to speak, is “two.”) This strategy is very commonly used, for example, in statistical analysis work, e.g. SPSS or SAS. I just finished-up an SPSS job that involved a few dozen million records, and it had plenty of logic that consisted of `SORT CASES BY` followed by use of the `LAG()` function (which, given n, retrieves a field’s values n cases ago; there is no function that looks ahead). The other algorithms, including the ubiquitous `MATCH FILES, SPLIT FILE,` and descriptive statistics, also require sort order. (For various ~~somewhat-~~ political reasons, databases could not be used here; nor indexed files.) SPSS’s sorter is very “smart,” as most sorters are, and the main price paid by the application was that of sorting. All other operations could proceed in purely linear time. The whole process ... sorts, analysis, the whole #!shebang, could run-through in a matter of a few minutes on a decent box. Hence, what I said above about “statistical analysis” work does come by hands-on professional experience ... SAS is the same way, as is R. This number of records that you are dealing with here is fairly routine for a disk-based sorting algorithm or command. I’d expect it to take a few minutes, at best. Most sort-engines will make opportunistic use of RAM but allow the RAM consumption to be controlled. They do produce a fairly intense disk-I/O load, however, which could present issues given that you say that you “can’t afford to swap.” (Although 70 million modest-sized records, in a 500MB-RAM machine, might be “mostly sorted in RAM” anyway.) Obviously, there are trade-offs. The I/O-footprint can be large, which can turn your app into a thousand-pound elephant “batch job” on an interactive machine. (Some sorters automatically lower the thread’s priority during the sort, to help to alert the operating-system that the program’s runtime characteristics have changed.) If you can’t afford to swap, maybe you also can’t afford to sort. However, if you are doing a number of different operations on the file which can’t readily be done all-at-once and which could benefit from sorting, or if you need to match files and can’t use a database, and so on, this is a possible way to go that is used a lot. Now, one more thought here ... the most-advantageous way to arrange things is to proscribe that all of the input-files to a particular process must be sorted in the same way, and to arrange a scheme of merge and selection steps, all of which require that ordering (and which explicitly check for it ...), such that you never have to be the one to do the sorting yourself during the job. A much more common practice back in the times when spools of magnetized plastic were the order of the day, but still “blisteringly(!!)-fast, and miserly with memory.” The operating-system quickly figures out (or, you tell him) that you are doing strictly-sequential file access, so it starts buffering everything advantageously, and ... well, hang on to your hat.	[reply]


Perl-Sensitive Sunglasses
	PerlMonks