Re: Removing the first record in a file containing fixed records (updated with tested code)
by BrowserUk (Patriarch) on Jul 18, 2008 at 00:37 UTC
|
If you're hoping for some method of telling the filesystem to ignore, or return the first n bytes of the file to the freespace and so avoid copying the file, both the short and long answers are: no.
Then your problem becomes how to copy the file most efficiently. And perhaps the best answer is to copy it in place and so avoid forcing the filesystem to find another 500MB to fit the copy into.
Something like this (untested) code might fit the bill:
#! perl -slw
use strict;
use Fcntl qw[ SEEK_CUR SEEK_SET ];
use constant BUFSIZE => 64 * 1024;
our $RECLEN || die "you must specify the length of the header. -RECLEN
+=nnn";
@ARGV or die "No filename";
open FILE, '+<:raw', $ARGV[ 0 ]
or die "$!: $ARGV[ 0 ]";
sysread FILE, my $header, $RECLEN or die "sysread: $!";
my( $nextWrite, $nextRead ) = 0;
while( sysread FILE, my $buffer, BUFSIZE ) {
$nextRead = sysseek FILE, 0, SEEK_CUR
or die "Seek query next read failed; $!";
sysseek FILE, $nextWrite, SEEK_SET
or die "Seek next write failed: $!";
syswrite FILE, $buffer
or die "Write failed: $!";;
$nextWrite = sysseek FILE, 0, SEEK_CUR
or die "Seek query next write failed $!";
sysseek FILE, $nextRead, SEEK_SET
or die "Seek next Read failed: $!";
}
truncate FILE, $nextWrite or die "truncate failed: $!";
close FILE or die "close failed: $!";
A casual test showed the program took < 4 seconds on a 500 MB file, though you can hear the disk thrashing as the system flushes it file cache to disk for several seconds afterwards.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] |
Re: Removing the first record in a file containing fixed records
by Fletch (Bishop) on Jul 18, 2008 at 00:35 UTC
|
Not really without altering the structure. You can truncate files from the end, but you can't . . . precate (if I may coin a terrible word) things off the front. A kludge (although this is remotely similar to what some databases do) is to alter some part of the record to signal it as invalid (perhaps setting a field to say XXXDELETEDNOUSEXXX) and then have your downstream processing ignore these records; periodically you can have a utility "vacuum" out the deleted records by copying the live records into a new file as you mention above. Alternately if ordering of records in the file is not important you can overwrite deleted records with new ones in place (perhaps maintaining an external "free list" of offsets of deleted record slots).
Of course doing this type of thing you're already on your way towards reimplementing your own database system so you might want to consider taking that plunge and letting someone else do the heavy lifting for you and concentrate on your processing tasks.
The cake is a lie.
The cake is a lie.
The cake is a lie.
| [reply] |
|
If ordering is not important he could just read the very last record, overwrite the first record with it and truncate the file before the last record.
| [reply] |
|
Since sparkle asked for an efficient method that works even on large files (2GB is sometimes the next limit), I agree with Crackers2 suggestion that will work well since it preserves disk space and implies a minimal amount of copy operations. But in case the header has a different size than the other records, order is important, or in case a copy of the file is still required, I would suggest to give that task away to a tool that is optimised for that kind of operations. Where available, a system(...)-call to tail or dd might be worth consideration.
| [reply] [d/l] [select] |
|
| [reply] |
Re: Removing the first record in a file containing fixed records
by runrig (Abbot) on Jul 17, 2008 at 23:43 UTC
|
After the first record, increase the record length...to 1 MB or perhaps several MB. There's no reason to read the entire file one record at a time. Or perhaps (same idea but less code on your part): use File::Copy qw(copy);
sysread(recordfile, $record, $record_length);
copy( \*recordfile, \*newrecordfile );
| [reply] [d/l] |
Re: Removing the first record in a file containing fixed records
by jwkrahn (Abbot) on Jul 18, 2008 at 00:32 UTC
|
my $filename = 'somefile';
my $rec_len = 567;
open my $IN, '<:raw', $filename or die "Cannot open '$filename' $!";
open my $OUT, '+<:raw', $filename or die "Cannot open '$filename' $!";
my $file_size = -s $IN;
my $total_records = $file_size / $rec_len;
$/ = \$rec_len;
while ( <$IN> ) {
next if $. == 1;
print $OUT $_;
}
$total_records == $. or die "Read only $. records but expected $total_
+records records!\n";
truncate $IN, $file_size - $rec_len;
close $OUT;
close $IN;
Or you could do it using one of the memory mapped file modules which should work on windows.
| [reply] [d/l] |
|
What makes you so sure it won't work on Windows?
The :raw IO layer is equivalent to the binmode and I see no unixisms in your code that would prevent the program from working on Windows.
I just saw - you're opening the same file twice. This won't work under Windows - you'll have to seek back and forth in the one file to copy the data from the end to the front.
Update2: I tested your code on Win32 and it just works.
| [reply] [d/l] [select] |
|
| [reply] |
Re: Removing the first record in a file containing fixed records
by Narveson (Chaplain) on Jul 18, 2008 at 05:48 UTC
|
Given all the gloomy responses above, you won't believe how simple this is with Tie::File, but just try it.
use strict;
use warnings;
use Tie::File;
tie my @records, 'Tie::File', 'file.txt';
shift @records;
Update: Thanks are due to BrowserUK for the benchmark below.
It's a beautiful example of a tradeoff between machine time and the time of - developers,
- reviewers, and
- maintainers.
Further thought: The read-seek-write solution belongs in Tie::File under the fixed-record-length option that Dominus has listed as TO DO. Then the rest of us could enjoy the best of both worlds. | [reply] [d/l] |
|
I suggest you try it.
First you should try it on a small file with fixed length records and no delimiters.
When you've worked out why the file ends up empty, and how to fix that, then try it on a 500MB fixed record length file with no delimiters. And if you could time how long it takes and report back that would be interesting.
Don't worry about being too accurate, the nearest week should be fine.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
|
I suggest you try it.
I have done so. We have some large fixed-record-length extracts lying around, record length about 4 KB. The size of file.txt was 1,985,184 KB when I started and 1,985,180 KB when the shift was done. It took three minutes.
It's true our fixed-length records are delivered to us with a newline at the end of each record. I always thought that was for the convenience of human readers with text editors, but it also enables Tie::File.
As the manpage says, Tie::File does not support fixed-width records unless of course they end in a record separator. Whether the records in the original question have newlines at the end, only sparkle can tell us.
| [reply] [d/l] [select] |
|
|
|
|
++Narveson - great, Tie::File was exactly what occurred to me on first thought!
| [reply] |
Re: Removing the first record in a file containing fixed records
by pc88mxer (Vicar) on Jul 18, 2008 at 00:35 UTC
|
Note that you can use the standard while (<>) { ... } idiom here:
{
local($/) = \$record_length;
my $first_record = <recordline>;
$/ = \(512*1024);
while (<recordfile>) {
print newrecordfile;
}
}
Alternatively, you can seek() over the first record instead of reading it.
| [reply] [d/l] [select] |
Re: Removing the first record in a file containing fixed records
by apl (Monsignor) on Jul 18, 2008 at 00:39 UTC
|
Why not have whatever creates this file instead create two files? name.header would contain the first line, while name.data would have the balance of the file.
| [reply] |
|
If you are using Windows then store the control information in an ADS - "Additional Data Stream". This is just an ordinary file associated with the original. An ADS is identified by a ':' followed by the stream name, for example:
open (my $handle, '>', $filename.':redtape') or die ...
Perl is happy reading/writing ADS files just like any other. Gotchas: ADS files are not visible using Windows Explorer, dir, glob, or readdir (see Win32-StreamNames) ADS require NTFS 5 or later (not FAT) | [reply] [d/l] |
Re: Removing the first record in a file containing fixed records
by poolpi (Hermit) on Jul 18, 2008 at 09:30 UTC
|
$ head -n3 data.test
--header--
sdfsdfsdfs
sdfsdfsdfs
$ ls -lh data.test
-rw-r--r-- 1 user user 525M 2008-07-18 11:11 data.test
# Linux debian 2.6.18-xen #1 SMP x86_64 GNU/Linux
$ time perl -i -pe '1 .. s/^.*//ms' data.test
real 0m48.445s
user 0m35.550s
sys 0m2.760s
$ head -n3 data.test
sdfsdfsdfs
sdfsdfsdfs
sdfsdfsdfs
hth, PooLpi
| [reply] [d/l] |
Re: Removing the first record in a file containing fixed records
by roboticus (Chancellor) on Jul 18, 2008 at 11:57 UTC
|
sparkle:
If the file supports a record type the other application would ignore (treat as a comment), or if you can create a record that won't affect any results from using the file (e.g., a $0.00 card transaction with card number 0000000000000000), you could just overlay the control information and not copy the file...
...roboticus | [reply] |
Re: Removing the first record in a file containing fixed records
by sparkle (Novice) on Jul 18, 2008 at 21:27 UTC
|
Wow! I didn't expect to get so many responses so quickly. It's nice to see there's a great community here :)
I decided to go with a loop reading the entire file and outputting each record (except the first) to a separate output file. It turns out that I needed to do some record by record validation anyway so this makes the most sense. I'd just print OUTPUTFILE ($record) after the validation checks out for that record.
I just want to make a quick comment about Tie::File. In my experience it's very nice and easy to use for smaller files but once you get to even medium sized files the performance will suffer. I had a program using Tie::File which ran for over 15 minutes and switching over to the traditional open/read/write cut down the time to a few seconds! It does have it's uses though and I use it for other things that aren't as intensive.
I like the suggestions (like in place writing and swapping the header with the last record) and I think I'll be able to use them in some of the other stuff I'm working on soon.
Thanks again everyone! | [reply] |
Re: Removing the first record in a file containing fixed records
by perly-gates (Initiate) on Jul 18, 2008 at 16:52 UTC
|
When reading the file in a loop just use
'$. > 1;'
the above statement will ignore the first line of the file ($. is line number) | [reply] |