Removing the first record in a file containing fixed records

sparkle has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Removing the first record in a file containing fixed records (updated with tested code) by BrowserUk (Patriarch) on Jul 18, 2008 at 00:37 UTC
If you're hoping for some method of telling the filesystem to ignore, or return the first n bytes of the file to the freespace and so avoid copying the file, both the short and long answers are: no. Then your problem becomes how to copy the file most efficiently. And perhaps the best answer is to copy it in place and so avoid forcing the filesystem to find another 500MB to fit the copy into. Something like this ~~(untested)~~ code might fit the bill: #! perl -slw use strict; use Fcntl qw[ SEEK_CUR SEEK_SET ]; use constant BUFSIZE => 64 * 1024; our $RECLEN \|\| die "you must specify the length of the header. -RECLEN +=nnn"; @ARGV or die "No filename"; open FILE, '+<:raw', $ARGV[ 0 ] or die "$!: $ARGV[ 0 ]"; sysread FILE, my $header, $RECLEN or die "sysread: $!"; my( $nextWrite, $nextRead ) = 0; while( sysread FILE, my $buffer, BUFSIZE ) { $nextRead = sysseek FILE, 0, SEEK_CUR or die "Seek query next read failed; $!"; sysseek FILE, $nextWrite, SEEK_SET or die "Seek next write failed: $!"; syswrite FILE, $buffer or die "Write failed: $!";; $nextWrite = sysseek FILE, 0, SEEK_CUR or die "Seek query next write failed $!"; sysseek FILE, $nextRead, SEEK_SET or die "Seek next Read failed: $!"; } truncate FILE, $nextWrite or die "truncate failed: $!"; close FILE or die "close failed: $!"; [download] A casual test showed the program took < 4 seconds on a 500 MB file, though you can hear the disk thrashing as the system flushes it file cache to disk for several seconds afterwards. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l]
Re: Removing the first record in a file containing fixed records by Fletch (Bishop) on Jul 18, 2008 at 00:35 UTC
Not really without altering the structure. You can truncate files from the end, but you can't . . . precate (if I may coin a terrible word) things off the front. A kludge (although this is remotely similar to what some databases do) is to alter some part of the record to signal it as invalid (perhaps setting a field to say `XXXDELETEDNOUSEXXX`) and then have your downstream processing ignore these records; periodically you can have a utility "vacuum" out the deleted records by copying the live records into a new file as you mention above. Alternately if ordering of records in the file is not important you can overwrite deleted records with new ones in place (perhaps maintaining an external "free list" of offsets of deleted record slots). Of course doing this type of thing you're already on your way towards reimplementing your own database system so you might want to consider taking that plunge and letting someone else do the heavy lifting for you and concentrate on your processing tasks. The cake is a lie. The cake is a lie. The cake is a lie.	[reply]
Re^2: Removing the first record in a file containing fixed records by Crackers2 (Parson) on Jul 18, 2008 at 00:48 UTC
If ordering is not important he could just read the very last record, overwrite the first record with it and truncate the file before the last record.	[reply]
Re^3: Removing the first record in a file containing fixed records by Perlbotics (Archbishop) on Jul 18, 2008 at 09:08 UTC
Since sparkle asked for an efficient method that works even on large files (2GB is sometimes the next limit), I agree with Crackers2 suggestion that will work well since it preserves disk space and implies a minimal amount of copy operations. But in case the header has a different size than the other records, order is important, or in case a copy of the file is still required, I would suggest to give that task away to a tool that is optimised for that kind of operations. Where available, a `system(...)`-call to `tail` or `dd` might be worth consideration.	[reply] [d/l] [select]
Re^2: Removing the first record in a file containing fixed records by goibhniu (Hermit) on Jul 18, 2008 at 17:27 UTC
I think that word is decapitate. #my sig used to say 'I humbly seek wisdom. '. Now it says: use strict; use warnings; I humbly seek wisdom.	[reply]
Re: Removing the first record in a file containing fixed records by runrig (Abbot) on Jul 17, 2008 at 23:43 UTC
After the first record, increase the record length...to 1 MB or perhaps several MB. There's no reason to read the entire file one record at a time. Or perhaps (same idea but less code on your part): `use File::Copy qw(copy); sysread(recordfile, $record, $record_length); copy( \recordfile, \newrecordfile );` [download]	[reply] [d/l]
Re: Removing the first record in a file containing fixed records by jwkrahn (Abbot) on Jul 18, 2008 at 00:32 UTC
I know this will work on Linux but I'm pretty sure that it won't work on Windows: `my $filename = 'somefile'; my $rec_len = 567; open my $IN, '<:raw', $filename or die "Cannot open '$filename' $!"; open my $OUT, '+<:raw', $filename or die "Cannot open '$filename' $!"; my $file_size = -s $IN; my $total_records = $file_size / $rec_len; $/ = \$rec_len; while ( <$IN> ) { next if $. == 1; print $OUT $_; } $total_records == $. or die "Read only $. records but expected $total_ +records records!\n"; truncate $IN, $file_size - $rec_len; close $OUT; close $IN;` [download] Or you could do it using one of the memory mapped file modules which should work on windows.	[reply] [d/l]
Re^2: Removing the first record in a file containing fixed records by Corion (Patriarch) on Jul 18, 2008 at 06:11 UTC
What makes you so sure it won't work on Windows? The `:raw` IO layer is equivalent to the `binmode` and I see no unixisms in your code that would prevent the program from working on Windows. ~~I just saw - you're opening the same file twice. This won't work under Windows - you'll have to seek back and forth in the one file to copy the data from the end to the front.~~ Update2: I tested your code on Win32 and it just works.	[reply] [d/l] [select]
Re^3: Removing the first record in a file containing fixed records by jwkrahn (Abbot) on Jul 18, 2008 at 07:54 UTC
I don't use Windows so I have no way to verify if it does or does not work there, sorry.	[reply]
Re: Removing the first record in a file containing fixed records by Narveson (Chaplain) on Jul 18, 2008 at 05:48 UTC
Given all the gloomy responses above, you won't believe how simple this is with Tie::File, but just try it. `use strict; use warnings; use Tie::File; tie my @records, 'Tie::File', 'file.txt'; shift @records;` [download] Update: Thanks are due to BrowserUK for the benchmark below. It's a beautiful example of a tradeoff between machine time and the time of developers, reviewers, and maintainers. Further thought: The read-seek-write solution belongs in Tie::File under the fixed-record-length option that Dominus has listed as TO DO. Then the rest of us could enjoy the best of both worlds.	[reply] [d/l]
Re^2: Removing the first record in a file containing fixed records by BrowserUk (Patriarch) on Jul 18, 2008 at 11:17 UTC
I suggest you try it. First you should try it on a small file with fixed length records and no delimiters. When you've worked out why the file ends up empty, and how to fix that, then try it on a 500MB fixed record length file with no delimiters. And if you could time how long it takes and report back that would be interesting. Don't worry about being too accurate, the nearest week should be fine. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply]
Re^3: Removing the first record in a file containing fixed records by Narveson (Chaplain) on Jul 18, 2008 at 21:28 UTC
I suggest you* try it.* I have done so. We have some large fixed-record-length extracts lying around, record length about 4 KB. The size of file.txt was 1,985,184 KB when I started and 1,985,180 KB when the `shift` was done. It took three minutes. It's true our fixed-length records are delivered to us with a newline at the end of each record. I always thought that was for the convenience of human readers with text editors, but it also enables `Tie::File`. As the manpage says, Tie::File does not support fixed-width records unless of course they end in a record separator. Whether the records in the original question have newlines at the end, only sparkle can tell us.	[reply] [d/l] [select]
Re^4: Removing the first record in a file containing fixed records by BrowserUk (Patriarch) on Jul 18, 2008 at 22:43 UTC
Re^5: Removing the first record in a file containing fixed records by Narveson (Chaplain) on Jul 21, 2008 at 05:59 UTC
Some notes below your chosen depth have not been shown here
Re^2: Removing the first record in a file containing fixed records by svenXY (Deacon) on Jul 18, 2008 at 07:45 UTC
++Narveson - great, Tie::File was exactly what occurred to me on first thought!	[reply]
Re: Removing the first record in a file containing fixed records by pc88mxer (Vicar) on Jul 18, 2008 at 00:35 UTC
Note that you can use the standard `while (<>) { ... }` idiom here: `{ local($/) = \$record_length; my $first_record = <recordline>; $/ = \(512*1024); while (<recordfile>) { print newrecordfile; } }` [download] Alternatively, you can `seek()` over the first record instead of reading it.	[reply] [d/l] [select]
Re: Removing the first record in a file containing fixed records by apl (Monsignor) on Jul 18, 2008 at 00:39 UTC
Why not have whatever creates this file instead create two files? name.header would contain the first line, while name.data would have the balance of the file.	[reply]
Re^2: Removing the first record in a file containing fixed records by cdarke (Prior) on Jul 18, 2008 at 08:15 UTC
If you are using Windows then store the control information in an ADS - "Additional Data Stream". This is just an ordinary file associated with the original. An ADS is identified by a ':' followed by the stream name, for example: `open (my $handle, '>', $filename.':redtape') or die ...` [download] Perl is happy reading/writing ADS files just like any other. Gotchas: ADS files are not visible using Windows Explorer, dir, glob, or readdir (see Win32-StreamNames) ADS require NTFS 5 or later (not FAT)	[reply] [d/l]
Re: Removing the first record in a file containing fixed records by poolpi (Hermit) on Jul 18, 2008 at 09:30 UTC
If i really understand your need : `$ head -n3 data.test --header-- sdfsdfsdfs sdfsdfsdfs $ ls -lh data.test -rw-r--r-- 1 user user 525M 2008-07-18 11:11 data.test # Linux debian 2.6.18-xen #1 SMP x86_64 GNU/Linux $ time perl -i -pe '1 .. s/^.*//ms' data.test real 0m48.445s user 0m35.550s sys 0m2.760s $ head -n3 data.test sdfsdfsdfs sdfsdfsdfs sdfsdfsdfs` [download] hth, PooLpi 'Ebry haffa hoe hab im tik a bush'. Jamaican proverb	[reply] [d/l]
Re: Removing the first record in a file containing fixed records by roboticus (Chancellor) on Jul 18, 2008 at 11:57 UTC
sparkle: If the file supports a record type the other application would ignore (treat as a comment), or if you can create a record that won't affect any results from using the file (e.g., a $0.00 card transaction with card number 0000000000000000), you could just overlay the control information and not copy the file... ...roboticus	[reply]
Re: Removing the first record in a file containing fixed records by sparkle (Novice) on Jul 18, 2008 at 21:27 UTC
Wow! I didn't expect to get so many responses so quickly. It's nice to see there's a great community here :) I decided to go with a loop reading the entire file and outputting each record (except the first) to a separate output file. It turns out that I needed to do some record by record validation anyway so this makes the most sense. I'd just print OUTPUTFILE ($record) after the validation checks out for that record. I just want to make a quick comment about Tie::File. In my experience it's very nice and easy to use for smaller files but once you get to even medium sized files the performance will suffer. I had a program using Tie::File which ran for over 15 minutes and switching over to the traditional open/read/write cut down the time to a few seconds! It does have it's uses though and I use it for other things that aren't as intensive. I like the suggestions (like in place writing and swapping the header with the last record) and I think I'll be able to use them in some of the other stuff I'm working on soon. Thanks again everyone!	[reply]
Re: Removing the first record in a file containing fixed records by perly-gates (Initiate) on Jul 18, 2008 at 16:52 UTC
When reading the file in a loop just use '$. > 1;' the above statement will ignore the first line of the file ($. is line number)	[reply]


more useful options
	PerlMonks