Re: Get record separator of a file

in reply to Get record separator of a file

You are right, bad asked question, sorry. Perhaps it should better have been something like: How can i determine the record separator...in the most efficient way?"

Background is: i have many files to process with different record separators. Filenames change as well as suffixes a.s.o. And the files are large.

I have an XML file that provides information where to search for the files, what to do with them a.s.o. Easy to place the right Tie:File recsep option there.

Else i have to get the recsep from each file before i process it. And benchmark myself what's best.

I hope this doesn't it make worse. Please also keep in mind that i'm not a native english speaker.

Thank you for your help and best regards, Karl

ŤThe Crux of the Biscuit is the Apostropheť

Comment on Re: Get record separator of a file Download Code

Replies are listed 'Best First'.
Re^2: Get record separator of a file by davido (Cardinal) on Nov 13, 2012 at 20:10 UTC
Let's ask this. Without worrying about efficiency yet, how would you determine the record separator of one of the files? Until I am clear on how you're able to detect the record separator in the first place, I have a difficult time suggesting an efficient means of doing so. What heuristics are you using? Or is the record separator specified in some sort of header for the file? Or does it depend on the file's extension? If you haven't figured that part out yet, you have to step back from the problem and look at it as a human would. Ask yourself, "If I opened the file in an editor (possibly one that displays non-printables too), how would I spot the record separator?" Once you've figured that out, the next step is to isolate the rules, and put them to code. After you get that working, tests and all, you're done. Only then, if you feel the outcome isn't efficient enough for your needs, should you begin profiling and determining what needs to be made more efficient. There's an old expression, that Perl is great for prototyping, and often the result is good enough that there's no need to rewrite in C. The same applies here; get it working, and it may be good enough that you don't need to be further concerned with efficiency. Dave	[reply]
Re^3: Get record separator of a file by karlgoethebier (Abbot) on Nov 13, 2012 at 22:38 UTC
OK, I will try my best to explain. I can see the recsep of a file with: `Karls-Mac-mini:Desktop karl$ hexdump -c -n 8 file.txt 0000000 f o o ; b a r \n + 0000008` [download] Or i hope so. But i really don't want to check it this way. I have many larger files with `\r\n` or `\n` as recsep. So i thought about efficiency and figured out that `Tie::File` is faster than `IO::File` for my needs (i benchmarked it, but that is another issue). But when i tied `my @array` to the original file, all data was put into the first slot of `my @array`. After setting the `recsep` option of `Tie::File` to `\n`, everything was good. So i thought, it would be a good idea to do something like the `hexdump` command in perl to get the recsep - without loosing the performance boost that `Tie::File` gives me. I hope very much that this is a better explanation about what i wanted to do. Thank you very much for your patience and help. Regards, Karl ŤThe Crux of the Biscuit is the Apostropheť	[reply] [d/l] [select]
Re^4: Get record separator of a file by davido (Cardinal) on Nov 14, 2012 at 00:08 UTC
Now we're getting somewhere (I think). You should be able to take advantage of Perl's :crlf IO layer to handle the problem for you. ~~I'll let you test this yourself~~ I've tested this, and here is how ~~I think~~ it would work out. First, Tie::File seems to be "layers" unaware, which is fine, except that you'll have to open the file explicitly, and close it again when you're done, rather than letting Tie::File handle those operations. This gives you control over what layers are applied to the file handle. `use strict; use warnings; use Tie::File; use Scalar::Util qw( weaken ); open my $fh, '+<:crlf', 'filename.ext' or die $!; my @array; my $t = tie @array, 'Tie::File', $fh; weaken $t; # tie holds its own ref. We don't want a mem leak. # Work, work, work... untie @array; close $fh or die $!;` [download] The relevant explanation of '`:crlf`' from the POD is: " On read converts pairs of CR,LF to a single "\n" newline character. On write converts each "\n" to a CR,LF pair." Since this happens behind the scenes, it should play nice with Tie::File, but I would test on some copies of the files first to be sure. Updated: Added weaken to eliminate a potential memory leak, since tie also holds a ref to its own object. Dave	[reply] [d/l] [select]
[SOLVED]Re^5: Get record separator of a file by karlgoethebier (Abbot) on Nov 14, 2012 at 13:38 UTC
Re^5: Get record separator of a file by karlgoethebier (Abbot) on Nov 14, 2012 at 19:00 UTC
Re^2: Get record separator of a file by Anonymous Monk on Nov 13, 2012 at 21:56 UTC
Translation: read a chunk of data that you know to be big enough, then search for known character strings that could be the right answer ... taking care to search for longer strings first.	[reply]
Re^3: Get record separator of a file by karlgoethebier (Abbot) on Nov 14, 2012 at 09:59 UTC
Shure, yes. #!/usr/bin/perl use strict; use warnings; use Data::Dumper; use IO::All; io(shift)->read(my $chunk, 10); print Dumper(unpack("C", $chunk)); for(unpack("C", $chunk)){print chr()} $chunk =~ m/(\n\|\r\n)/g; print unpack("C*", $1); __END__ $ ./recsep.pl dos.csv $VAR1 = 102; $VAR2 = 111; $VAR3 = 111; $VAR4 = 59; $VAR5 = 98; $VAR6 = 97; $VAR7 = 114; $VAR8 = 13; $VAR9 = 10; foo;bar 1310 $ ./recsep.pl unix.csv $VAR1 = 102; $VAR2 = 111; $VAR3 = 111; $VAR4 = 59; $VAR5 = 98; $VAR6 = 97; $VAR7 = 114; $VAR8 = 10; $VAR9 = 10; foo;bar 10 [download] Don't no other way. Thank you and regards, Karl ŤThe Crux of the Biscuit is the Apostropheť	[reply] [d/l]

In Section Seekers of Perl Wisdom