Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Safely reading line by line

by martin (Pilgrim)
on Jun 27, 2007 at 10:38 UTC ( #623579=perlquestion: print w/ replies, xml ) Need Help??
martin has asked for the wisdom of the Perl Monks concerning the following question:

It is very common practice to use variations of a pattern like this in Perl programs:
while (<$fh>) { # do something with one line of input }

However, if I can't trust my input --and who can?-- this is unsafe, as the program might come upon a line that is too long to fit into memory. Perl will happily keep allocating space to store it until the process runs out of virtual memory, which may happen only after the system has suffered serious performance loss trying to satisfy the demand.

In order to survive malformed input while keeping the normal procedure line-oriented, I need perhaps a getline variant with some sort of length limitation. I'd like to be able to give up gracefully when excessively long lines are detected.

Do you, fellow monks, know of a good way to handle this?

I know about sysread and could with some effort write a wrapper around it to reassemble fixed-length chunks into lines, but am quite open to learn about alternatives first.

Update: This topic has indeed been discussed here before: Read a line with max length ?. Thanks to runrig for a pointer to File::GetLineMaxLength. I'll follow up with a report once I have looked into it.

Comment on Safely reading line by line
Select or Download Code
Re: Safely reading line by line
by moritz (Cardinal) on Jun 27, 2007 at 10:51 UTC
    I'd just impose a memory limit to the perl interpreter process, and die automatically if a line is too long.

    Of course that's only possible if you don't mind losing some data from possibly manipulated sources, and don't leave damaged data structures behind (on disk, that is).

      A total memory limit for a process will limit the impact a single failure will have on the rest of the system. This is a reasonable precaution.

      On my Debian GNU/Linux box I can call

      ulimit -v 10000
      in the shell before starting my program and it will no longer be able to use more than 10000 Kilobytes of virtual memory.

      However, that is not all I wanted. I would like to be able to stop processing the input file as soon as its contents are known to be malformed and take whatever evasive action is most appropriate. This would rule out plainly crashing in many cases.

Re: Safely reading line by line
by RMGir (Prior) on Jun 27, 2007 at 12:28 UTC
    Assuming $fh is a handle to a file, File::Util has an interesting idea. It has a "readlimit" method that limits the size of file it will open.

    Of course, if your attacker has local access, or you're reading from a socket, that won't save you, since the file could get appended or modified AFTER you've opened it.

    Letting the interpreter crash is looking quite tempting :) Of course, that's only an option if it's not going to result in a denial-of-service attack.

    I think your idea of writing your own buffering length-limited readline in terms of read or sysread is probably the way to go, but it's going to be mildly complex if you want to make it efficient... Of course, if you do work that out, it'd probably be a nice addition to IO::Handle

    A reasonable alternative may be to recast your loop in terms of fixed-length reads, rather than line reads. But for line-oriented data, that's a pain :(

    Hmmm, this wasn't a very helpful response, was it? Sorry about that. You've brought up an interesting problem, and I don't know what the right answer is, but hopefully one of these rambles sparks an idea for someone who DOES know.


    Mike
      Your post was helpful indeed, but there is a thing to consider (and the reason for me to propose crashing the interpreter ;-)

      If you read data line by line, that's usually because you need it line by line.

      Depending on your application it might be possible to handle incomplete lines without much change to your program, or it might not.

      If there is a good reason for line based reading, and the line doesn't fit into RAM, you're lost anyway. (Not always, but still rather often).

      On the other hand if the line based reading is just a method of chunking the data, then the approach that uses a read limit is probably the way to go.

Re: Safely reading line by line
by cdarke (Prior) on Jun 27, 2007 at 12:59 UTC
    You could just read one char at a time, looking for a newline to terminate the line. You can use read for that, or local $/=\1;. Assuming you have not switched buffering off, the data will not be transferred from the disc one char at a time, only from the buffer. It is then easy to impose a limit for a line length.
Re: Safely reading line by line
by martin (Pilgrim) on Jun 28, 2007 at 22:22 UTC
    In order to survive malformed input while keeping the normal procedure line-oriented, I need perhaps a getline variant with some sort of length limitation. I'd like to be able to give up gracefully when excessively long lines are detected.

    It turns out getline is not a trivial target to emulate, what with all those variants of $INPUT_RECORD_SEPARATOR. I have tried to prove the concept here, but with good chances to win an ugliness contest.

    Should Perl perhaps offer native support for such a feature?

Re: Safely reading line by line
by runrig (Abbot) on Jun 28, 2007 at 23:19 UTC
      A bit of searching on CPAN yields File::GetLineMaxLength. I don't know anything about it, but try it out and give it a review.

      I did: File::GetLineMaxLength

      Thanks for the pointer. It looks like the module has some potential left for improvement.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://623579]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (19)
As of 2014-07-29 16:37 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (223 votes), past polls