Given quantities of that magnitude, and the relative simplicity of the task (breaking the stream into a sequence of numerics), I'd say it's worthwhile to write an application in C and compile it.
It would be a short and easy program to write, esp. as a stdin-stdout filter: it's just a while loop that reads a nice size char buffer (say, a few MB at a time), and steps through the buffer one character at a time, accumulating consecutive digit characters, and outputting the string of digits every time you encounter a non-digit character. It wouldn't be more than 20 lines of C code, if that, and you'll save a lot of run-time.
I suppose there must be more to your overall process than just splitting into digit strings; you could still do that extra part of your process in perl, but have the perl script read from the output of the C program. (But again, given the quantity of data, if the other stuff can be done in C without too much trouble, I'd do that.)
UPDATE: Okay, I admit I was wrong about how many lines of C it would take. This C program is 26 30 lines (not counting the 4 blank lines added for legibility):
(2nd update: added four more lines at the end to handle the case where the last char in the stream happens to be a digit.) | [reply] [d/l] |
That's actually a very good idea, haven't thought about this approach.
| [reply] |
%{
void process(char *tok);
%}
%option noyywrap
%%
[0-9]+ process(yytext);
[ \t\n]+ /* ignore */
. /* printf("Bad input character: %s\n", yytext); */
%%
void process(char *tok)
{
printf("%d\n", atoi(tok));
}
int main(int argc, char **argv)
{
yyin = stdin;
if (argc > 1)
yyin = fopen(argv[1], "r");
return yylex();
}
Just run it through flex and compile the generated lex.yy.c.
| [reply] [d/l] |
Now that is long indeed!Assuming you can read and process a gigabyte of data per second, handling a line that long will take you more than a month.
CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James My blog: Imperial Deltronics
| [reply] |