Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Removing the first record in a file containing fixed records

by sparkle (Novice)
on Jul 17, 2008 at 23:35 UTC ( [id://698472]=perlquestion: print w/replies, xml ) Need Help??

sparkle has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks,

I have a file that contains a bunch of records which have a fixed length. The first record of the file contains some control information which is nice for me when I process the file but it needs to be removed after I'm done with it.

I'm trying to find an efficient way to remove the first record.

The solution I came up with would be to read each record at a time and print it to a new file while ignoring the first line.

$i = 0; until ( eof (recordfile) ) { read (recordfile, $record, $record_length) if (i > 0) { print newrecordfile ($record); } i++; }

These files can be huge, possibly 500+ MB and it seems like a waste to reprint and read the entire file just to remove the first line.

Is there a more efficient way to do this?

Replies are listed 'Best First'.
Re: Removing the first record in a file containing fixed records (updated with tested code)
by BrowserUk (Patriarch) on Jul 18, 2008 at 00:37 UTC

    If you're hoping for some method of telling the filesystem to ignore, or return the first n bytes of the file to the freespace and so avoid copying the file, both the short and long answers are: no.

    Then your problem becomes how to copy the file most efficiently. And perhaps the best answer is to copy it in place and so avoid forcing the filesystem to find another 500MB to fit the copy into.

    Something like this (untested) code might fit the bill:

    #! perl -slw use strict; use Fcntl qw[ SEEK_CUR SEEK_SET ]; use constant BUFSIZE => 64 * 1024; our $RECLEN || die "you must specify the length of the header. -RECLEN +=nnn"; @ARGV or die "No filename"; open FILE, '+<:raw', $ARGV[ 0 ] or die "$!: $ARGV[ 0 ]"; sysread FILE, my $header, $RECLEN or die "sysread: $!"; my( $nextWrite, $nextRead ) = 0; while( sysread FILE, my $buffer, BUFSIZE ) { $nextRead = sysseek FILE, 0, SEEK_CUR or die "Seek query next read failed; $!"; sysseek FILE, $nextWrite, SEEK_SET or die "Seek next write failed: $!"; syswrite FILE, $buffer or die "Write failed: $!";; $nextWrite = sysseek FILE, 0, SEEK_CUR or die "Seek query next write failed $!"; sysseek FILE, $nextRead, SEEK_SET or die "Seek next Read failed: $!"; } truncate FILE, $nextWrite or die "truncate failed: $!"; close FILE or die "close failed: $!";

    A casual test showed the program took < 4 seconds on a 500 MB file, though you can hear the disk thrashing as the system flushes it file cache to disk for several seconds afterwards.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Removing the first record in a file containing fixed records
by Fletch (Bishop) on Jul 18, 2008 at 00:35 UTC

    Not really without altering the structure. You can truncate files from the end, but you can't . . . precate (if I may coin a terrible word) things off the front. A kludge (although this is remotely similar to what some databases do) is to alter some part of the record to signal it as invalid (perhaps setting a field to say XXXDELETEDNOUSEXXX) and then have your downstream processing ignore these records; periodically you can have a utility "vacuum" out the deleted records by copying the live records into a new file as you mention above. Alternately if ordering of records in the file is not important you can overwrite deleted records with new ones in place (perhaps maintaining an external "free list" of offsets of deleted record slots).

    Of course doing this type of thing you're already on your way towards reimplementing your own database system so you might want to consider taking that plunge and letting someone else do the heavy lifting for you and concentrate on your processing tasks.

    The cake is a lie.
    The cake is a lie.
    The cake is a lie.

      If ordering is not important he could just read the very last record, overwrite the first record with it and truncate the file before the last record.
        Since sparkle asked for an efficient method that works even on large files (2GB is sometimes the next limit), I agree with Crackers2 suggestion that will work well since it preserves disk space and implies a minimal amount of copy operations. But in case the header has a different size than the other records, order is important, or in case a copy of the file is still required, I would suggest to give that task away to a tool that is optimised for that kind of operations. Where available, a system(...)-call to tail or dd might be worth consideration.

      I think that word is decapitate.


      #my sig used to say 'I humbly seek wisdom. '. Now it says:
      use strict;
      use warnings;
      I humbly seek wisdom.
Re: Removing the first record in a file containing fixed records
by runrig (Abbot) on Jul 17, 2008 at 23:43 UTC
    After the first record, increase the record length...to 1 MB or perhaps several MB. There's no reason to read the entire file one record at a time. Or perhaps (same idea but less code on your part):
    use File::Copy qw(copy); sysread(recordfile, $record, $record_length); copy( \*recordfile, \*newrecordfile );
Re: Removing the first record in a file containing fixed records
by jwkrahn (Abbot) on Jul 18, 2008 at 00:32 UTC

    I know this will work on Linux but I'm pretty sure that it won't work on Windows:

    my $filename = 'somefile'; my $rec_len = 567; open my $IN, '<:raw', $filename or die "Cannot open '$filename' $!"; open my $OUT, '+<:raw', $filename or die "Cannot open '$filename' $!"; my $file_size = -s $IN; my $total_records = $file_size / $rec_len; $/ = \$rec_len; while ( <$IN> ) { next if $. == 1; print $OUT $_; } $total_records == $. or die "Read only $. records but expected $total_ +records records!\n"; truncate $IN, $file_size - $rec_len; close $OUT; close $IN;

    Or you could do it using one of the memory mapped file modules which should work on windows.

      What makes you so sure it won't work on Windows?

      The :raw IO layer is equivalent to the binmode and I see no unixisms in your code that would prevent the program from working on Windows.

      I just saw - you're opening the same file twice. This won't work under Windows - you'll have to seek back and forth in the one file to copy the data from the end to the front.

      Update2: I tested your code on Win32 and it just works.

        I don't use Windows so I have no way to verify if it does or does not work there, sorry.

Re: Removing the first record in a file containing fixed records
by Narveson (Chaplain) on Jul 18, 2008 at 05:48 UTC

    Given all the gloomy responses above, you won't believe how simple this is with Tie::File, but just try it.

    use strict; use warnings; use Tie::File; tie my @records, 'Tie::File', 'file.txt'; shift @records;

    Update: Thanks are due to BrowserUK for the benchmark below.

    It's a beautiful example of a tradeoff between machine time and the time of

    • developers,
    • reviewers, and
    • maintainers.

    Further thought: The read-seek-write solution belongs in Tie::File under the fixed-record-length option that Dominus has listed as TO DO. Then the rest of us could enjoy the best of both worlds.

      I suggest you try it.

      First you should try it on a small file with fixed length records and no delimiters.

      When you've worked out why the file ends up empty, and how to fix that, then try it on a 500MB fixed record length file with no delimiters. And if you could time how long it takes and report back that would be interesting.

      Don't worry about being too accurate, the nearest week should be fine.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        I suggest you try it.

        I have done so. We have some large fixed-record-length extracts lying around, record length about 4 KB. The size of file.txt was 1,985,184 KB when I started and 1,985,180 KB when the shift was done. It took three minutes.

        It's true our fixed-length records are delivered to us with a newline at the end of each record. I always thought that was for the convenience of human readers with text editors, but it also enables Tie::File.

        As the manpage says, Tie::File does not support fixed-width records unless of course they end in a record separator. Whether the records in the original question have newlines at the end, only sparkle can tell us.

      ++Narveson - great, Tie::File was exactly what occurred to me on first thought!
Re: Removing the first record in a file containing fixed records
by pc88mxer (Vicar) on Jul 18, 2008 at 00:35 UTC
    Note that you can use the standard while (<>) { ... } idiom here:
    { local($/) = \$record_length; my $first_record = <recordline>; $/ = \(512*1024); while (<recordfile>) { print newrecordfile; } }
    Alternatively, you can seek() over the first record instead of reading it.
Re: Removing the first record in a file containing fixed records
by apl (Monsignor) on Jul 18, 2008 at 00:39 UTC
    Why not have whatever creates this file instead create two files? name.header would contain the first line, while name.data would have the balance of the file.

      If you are using Windows then store the control information in an ADS - "Additional Data Stream". This is just an ordinary file associated with the original. An ADS is identified by a ':' followed by the stream name, for example:
      open (my $handle, '>', $filename.':redtape') or die ...
      Perl is happy reading/writing ADS files just like any other.
      Gotchas: ADS files are not visible using Windows Explorer, dir, glob, or readdir (see Win32-StreamNames)
      ADS require NTFS 5 or later (not FAT)
Re: Removing the first record in a file containing fixed records
by poolpi (Hermit) on Jul 18, 2008 at 09:30 UTC

    If i really understand your need :

    $ head -n3 data.test --header-- sdfsdfsdfs sdfsdfsdfs $ ls -lh data.test -rw-r--r-- 1 user user 525M 2008-07-18 11:11 data.test # Linux debian 2.6.18-xen #1 SMP x86_64 GNU/Linux $ time perl -i -pe '1 .. s/^.*//ms' data.test real 0m48.445s user 0m35.550s sys 0m2.760s $ head -n3 data.test sdfsdfsdfs sdfsdfsdfs sdfsdfsdfs

    hth,
    PooLpi

    'Ebry haffa hoe hab im tik a bush'. Jamaican proverb
Re: Removing the first record in a file containing fixed records
by roboticus (Chancellor) on Jul 18, 2008 at 11:57 UTC
    sparkle:

    If the file supports a record type the other application would ignore (treat as a comment), or if you can create a record that won't affect any results from using the file (e.g., a $0.00 card transaction with card number 0000000000000000), you could just overlay the control information and not copy the file...

    ...roboticus
Re: Removing the first record in a file containing fixed records
by sparkle (Novice) on Jul 18, 2008 at 21:27 UTC
    Wow! I didn't expect to get so many responses so quickly. It's nice to see there's a great community here :)

    I decided to go with a loop reading the entire file and outputting each record (except the first) to a separate output file. It turns out that I needed to do some record by record validation anyway so this makes the most sense. I'd just print OUTPUTFILE ($record) after the validation checks out for that record.

    I just want to make a quick comment about Tie::File. In my experience it's very nice and easy to use for smaller files but once you get to even medium sized files the performance will suffer. I had a program using Tie::File which ran for over 15 minutes and switching over to the traditional open/read/write cut down the time to a few seconds! It does have it's uses though and I use it for other things that aren't as intensive.

    I like the suggestions (like in place writing and swapping the header with the last record) and I think I'll be able to use them in some of the other stuff I'm working on soon.

    Thanks again everyone!

Re: Removing the first record in a file containing fixed records
by perly-gates (Initiate) on Jul 18, 2008 at 16:52 UTC
    When reading the file in a loop just use '$. > 1;' the above statement will ignore the first line of the file ($. is line number)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://698472]
Approved by ww
Front-paged by Narveson
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (4)
As of 2024-04-18 03:48 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found