| [reply] |
so u're saying i should sort the file and creare a 5 line sliding window to locate my 5 repeats... ok, this sounds so trivial ... thank you. I guess I am to tired not noticing the solution... thnx
| [reply] |
Is the 70-million record file presently sorted? If so, then all repeating entries will always be forced to be adjacent to one another, and you can find repeats of arbitrary size just by comparing adjacent records. (You can find all of the repetitions in the entire sorted file using one sequential pass through it, considering only two adjacent records at all times. The “window size,” so to speak, is “two.”) This strategy is very commonly used, for example, in statistical analysis work, e.g. SPSS or SAS.
I just finished-up an SPSS job that involved a few dozen million records, and it had plenty of logic that consisted of SORT CASES BY followed by use of the LAG() function (which, given n, retrieves a field’s values n cases ago; there is no function that looks ahead). The other algorithms, including the ubiquitous MATCH FILES, SPLIT FILE, and descriptive statistics, also require sort order. (For various somewhat- political reasons, databases could not be used here; nor indexed files.) SPSS’s sorter is very “smart,” as most sorters are, and the main price paid by the application was that of sorting. All other operations could proceed in purely linear time. The whole process ... sorts, analysis, the whole #!shebang, could run-through in a matter of a few minutes on a decent box. Hence, what I said above about “statistical analysis” work does come by hands-on professional experience ... SAS is the same way, as is R.
This number of records that you are dealing with here is fairly routine for a disk-based sorting algorithm or command. I’d expect it to take a few minutes, at best. Most sort-engines will make opportunistic use of RAM but allow the RAM consumption to be controlled. They do produce a fairly intense disk-I/O load, however, which could present issues given that you say that you “can’t afford to swap.” (Although 70 million modest-sized records, in a 500MB-RAM machine, might be “mostly sorted in RAM” anyway.)
Obviously, there are trade-offs. The I/O-footprint can be large, which can turn your app into a thousand-pound elephant “batch job” on an interactive machine. (Some sorters automatically lower the thread’s priority during the sort, to help to alert the operating-system that the program’s runtime characteristics have changed.) If you can’t afford to swap, maybe you also can’t afford to sort. However, if you are doing a number of different operations on the file which can’t readily be done all-at-once and which could benefit from sorting, or if you need to match files and can’t use a database, and so on, this is a possible way to go that is used a lot.
Now, one more thought here ... the most-advantageous way to arrange things is to proscribe that all of the input-files to a particular process must be sorted in the same way, and to arrange a scheme of merge and selection steps, all of which require that ordering (and which explicitly check for it ...), such that you never have to be the one to do the sorting yourself during the job. A much more common practice back in the times when spools of magnetized plastic were the order of the day, but still “blisteringly(!!)-fast, and miserly with memory.” The operating-system quickly figures out (or, you tell him) that you are doing strictly-sequential file access, so it starts buffering everything advantageously, and ... well, hang on to your hat.
| [reply] |