Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer

Re^2: Reading a huge input line in parts

by kroach (Pilgrim)
on May 04, 2015 at 19:53 UTC ( #1125620=note: print w/replies, xml ) Need Help??

in reply to Re: Reading a huge input line in parts
in thread Reading a huge input line in parts

The lines in question can be up to 2 700 000 000 000 000 characters.
  • Comment on Re^2: Reading a huge input line in parts

Replies are listed 'Best First'.
Re^3: Reading a huge input line in parts
by graff (Chancellor) on May 05, 2015 at 03:29 UTC
    Given quantities of that magnitude, and the relative simplicity of the task (breaking the stream into a sequence of numerics), I'd say it's worthwhile to write an application in C and compile it.

    It would be a short and easy program to write, esp. as a stdin-stdout filter: it's just a while loop that reads a nice size char buffer (say, a few MB at a time), and steps through the buffer one character at a time, accumulating consecutive digit characters, and outputting the string of digits every time you encounter a non-digit character. It wouldn't be more than 20 lines of C code, if that, and you'll save a lot of run-time.

    I suppose there must be more to your overall process than just splitting into digit strings; you could still do that extra part of your process in perl, but have the perl script read from the output of the C program. (But again, given the quantity of data, if the other stuff can be done in C without too much trouble, I'd do that.)

    UPDATE: Okay, I admit I was wrong about how many lines of C it would take. This C program is 26 30 lines (not counting the 4 blank lines added for legibility):

    (2nd update: added four more lines at the end to handle the case where the last char in the stream happens to be a digit.)
      That's actually a very good idea, haven't thought about this approach.

        You can also use flex to make a scanner with very little fuss. For example:

        %{ void process(char *tok); %} %option noyywrap %% [0-9]+ process(yytext); [ \t\n]+ /* ignore */ . /* printf("Bad input character: %s\n", yytext); */ %% void process(char *tok) { printf("%d\n", atoi(tok)); } int main(int argc, char **argv) { yyin = stdin; if (argc > 1) yyin = fopen(argv[1], "r"); return yylex(); }

        Just run it through flex and compile the generated lex.yy.c.

Re^3: Reading a huge input line in parts
by CountZero (Bishop) on May 04, 2015 at 21:00 UTC
    Now that is long indeed!

    Assuming you can read and process a gigabyte of data per second, handling a line that long will take you more than a month.


    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

    My blog: Imperial Deltronics

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1125620]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others studying the Monastery: (3)
As of 2020-10-31 04:34 GMT
Find Nodes?
    Voting Booth?
    My favourite web site is:

    Results (286 votes). Check out past polls.