Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Parse string greater than 2GB

by BigHoss (Initiate)
on Jun 30, 2013 at 01:13 UTC ( #1041534=perlquestion: print w/ replies, xml ) Need Help??
BigHoss has asked for the wisdom of the Perl Monks concerning the following question:

I need to parse a string which is larger than 2GB. Unfortunately "split()" fails with a "Split loop" error.

Looking for an alternate. Here is the minimized test case. due to the nature of the original code, the "read()" of the file cannot be changed.

thanks in advance for any help...

EDIT: 6/30/2013:
To clarify, this is trimmed down from the original to illustrate the problem with split.
I can only modify "BigParse". I have no control of the strings passed in.
The real code processes each split until end of data. The print statement in the loop is just to trim down the code.

Perl version: 5.12.4

Data File is binary file with embedded newline characters "\n".

$ wc -c data 2753110808 data $ wc -l data 2753111 data
#!/usr/bin/perl $FILE = "data"; open (INFILE, "$FILE") || die "Not able to open the file: $FILE \n"; binmode INFILE; my $map; read(INFILE, $map, 2147483648); # Using this read instead, everything works. # read(INFILE, $map, 2147483630); BigParse($map); exit; sub BigParse { my $map = shift; print "string length = ", length($map), "\n"; # This fails with "Split loop" error message. foreach my $l (split("\n", $map)) { print $l; } return; }

Comment on Parse string greater than 2GB
Select or Download Code
Re: Parse string greater than 2GB
by rjt (Deacon) on Jun 30, 2013 at 01:57 UTC
    foreach my $l (split("\n", $map)) { print $l; }

    It seems to me your code is doing nothing more than printing out the input with newlines removed. You can achieve the same result by removing all newlines with:

    $map =~ y/\n//r;

    For me this took a few seconds on a 2GiB + 16 byte string (whereas creating the same string with the repetition operator took more than twice as long, and that was without any IO).

    Your approach with split runs out of memory on my 4GiB VM, because split generates a new list with new strings, more than doubling the memory requirement (depending on density of newlines). I strongly suspect, however, that even if it worked, the split would be much slower.

    I also wonder if this may be an XY Problem: You say the read cannot be changed, and try as I might, I can't imagine why you'd want to read a huge binary file and print out everything but the newlines. If my advice doesn't hit the mark, can you give us a few more details on what it is you're doing?

    open (INFILE, "$FILE") || die "Not able to open the file: $FILE \n";

    Be careful with open. If you ever intend $FILE to be user-specified (and even if you don't), I'd recommend using the 3-argument open:

    open INFILE, '<', $FILE or die "Not able...";

    See Two-arg open() considered dangerous.

    I'd also use a lexical filehandle (open my $infile, ...) instead of INFILE.

Re: Parse string greater than 2GB
by kcott (Abbot) on Jun 30, 2013 at 02:15 UTC

    G'day BigHoss,

    Welcome to the monastery.

    You've written 'the "read()" of the file cannot be changed.' and then, in your code, you've shown what happens when you do change it. So, please clarify what you mean; I can provide a few suggestions but, until that ambiguity is sorted out, I'm really just guessing.

    The error you're getting is described in perldiag.

    Reading the entire file and then looping through the output from split can be achieved more simply with code like this:

    while (my $l = <INFILE>) { chomp $l; print $l; }

    You'll probably find that passing a lexical filehandle (see open) to BigParse() is easier than dealing with globrefs.

    Check the read documentation and man wc for discrepancies between what each considers a character and a byte to be.

    sysopen and sysread may be better options for dealing with your binary data.

    -- Ken

Re: Parse string greater than 2GB
by thomas895 (Hermit) on Jun 30, 2013 at 02:44 UTC

    In a one-liner:

    $ perl -pe 's/\n//' /path/to/data

    Of course this does mean you don't get the length shown. But that is an easy fix, simply pipe it into wc -.

    ~Thomas~ 
    "Excuse me for butting in, but I'm interrupt-driven..."

      You wrote:

          $ perl -pe 's/\n//' /path/to/data

      Your approach reads the data file (via standard input) one line at a time (delimited by newlines), and searches every line in its entirety to replace one newline character before the implicit -p loop prints them out. One can accomplish the same thing in about half the CPU time (depending on average line length) with:

          $ perl -pe chomp /path/to/data

      The OP also indicated that they have to stick with the read() loop, so it's worth noting that solutions like these that read line by line don't fit the problem description. (Not that I don't have some significant doubts about the problem description...)

Re: Parse string greater than 2GB
by Laurent_R (Vicar) on Jun 30, 2013 at 08:22 UTC

    Data File is binary file with embedded newline characters "\n".

    This sounds a little bit bad. If your data is really binary, then it is quite likely that some of the bytes will by accident have the value of new line characters in your system. How can you tell the difference between actual new lines and binary bytes that happen to have the value of a new line character? Reading the file line by line is probably not an option in this case. It probably does not matter too much if all what you want to do is to print the data, but it does if you want to do any more subtile processing.

Re: Parse string greater than 2GB
by swampyankee (Parson) on Jun 30, 2013 at 11:01 UTC

    While posting fragments of code is nice, it's even nicer to explain what you're trying to do. From your explanation and your code fragment, it seems that you need not do anything except

    open(my $input,"<",$input_file) or die "Could not open $input_file bec +ause $!\n"; while(<$input>) { print; }

    So, what's the point? Is this some way of writing od in Perl?


    Information about American English usage here and here. Floating point issues? Please read this before posting. — emc

Re: Parse string greater than 2GB
by sundialsvc4 (Monsignor) on Jul 01, 2013 at 11:23 UTC

    Unless it is reasonably possible that “the single thing that you are looking for” is actually ≥ 2GB in size by itself, then you will be, one way or the other, reading it in some more conveniently-sized sections and in some suitable way dealing with the “fragments” that are left-over at the end of each read.   (You move this unused portion to the start of your buffer, read more data to fill it up again, and keep going.)   If you can identify a record separator to Perl (it doesn’t have to be \n), Perl will even do a lot of the leg-work for you, using its own buffering scheme.

    One way that is sometimes useful to deal with very large static files is to memory-map them, e.g. PerlIO::mmap (or any of 64-or-so other packages I found in http://search.cpan.org using the key, “mmap.”)   This technique uses the operating system’s virtual memory subsystem to do some of the dirty-work for you, by mapping a portion of the file (a movable “window” into it, of some selected-but-not-2GB size) into the process’s virtual memory address space ... this avoids copying.   But you still can’t map “all of” a very large file.

Re: Parse string greater than 2GB
by kcott (Abbot) on Jul 04, 2013 at 01:31 UTC

    I've just come across this in perl-5.19.1 > perldelta: Selected Bug Fixes which may be related to your problem.

    "Fixed a small number of regexp constructions that could either fail to match or crash perl when the string being matched against was allocated above the 2GB line on 32-bit systems. [RT #118175]"

    Note: I haven't investigated further. It may be completely unrelated.

    -- Ken

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1041534]
Approved by bulk88
Front-paged by rjt
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (5)
As of 2014-07-13 09:35 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    When choosing user names for websites, I prefer to use:








    Results (248 votes), past polls