http://www.perlmonks.org?node_id=1003696


in reply to Get record separator of a file

You are right, bad asked question, sorry. Perhaps it should better have been something like: How can i determine the record separator...in the most efficient way?"

Background is: i have many files to process with different record separators. Filenames change as well as suffixes a.s.o. And the files are large.

I have an XML file that provides information where to search for the files, what to do with them a.s.o. Easy to place the right Tie:File recsep option there.

Else i have to get the recsep from each file before i process it. And benchmark myself what's best.

I hope this doesn't it make worse. Please also keep in mind that i'm not a native english speaker.

Thank you for your help and best regards, Karl

«The Crux of the Biscuit is the Apostrophe»

Replies are listed 'Best First'.
Re^2: Get record separator of a file
by davido (Cardinal) on Nov 13, 2012 at 20:10 UTC

    Let's ask this. Without worrying about efficiency yet, how would you determine the record separator of one of the files? Until I am clear on how you're able to detect the record separator in the first place, I have a difficult time suggesting an efficient means of doing so. What heuristics are you using? Or is the record separator specified in some sort of header for the file? Or does it depend on the file's extension?

    If you haven't figured that part out yet, you have to step back from the problem and look at it as a human would. Ask yourself, "If I opened the file in an editor (possibly one that displays non-printables too), how would I spot the record separator?" Once you've figured that out, the next step is to isolate the rules, and put them to code. After you get that working, tests and all, you're done. Only then, if you feel the outcome isn't efficient enough for your needs, should you begin profiling and determining what needs to be made more efficient.

    There's an old expression, that Perl is great for prototyping, and often the result is good enough that there's no need to rewrite in C. The same applies here; get it working, and it may be good enough that you don't need to be further concerned with efficiency.


    Dave

      OK, I will try my best to explain.

      I can see the recsep of a file with:

      Karls-Mac-mini:Desktop karl$ hexdump -c -n 8 file.txt 0000000 f o o ; b a r \n + 0000008

      Or i hope so.

      But i really don't want to check it this way.

      I have many larger files with \r\n or \n as recsep.

      So i thought about efficiency and figured out that Tie::File is faster than IO::File for my needs (i benchmarked it, but that is another issue).

      But when i tied my @array to the original file, all data was put into the first slot of my @array. After setting the recsep option of Tie::File to \n, everything was good.

      So i thought, it would be a good idea to do something like the hexdump command in perl to get the recsep - without loosing the performance boost that Tie::File gives me.

      I hope very much that this is a better explanation about what i wanted to do.

      Thank you very much for your patience and help.

      Regards, Karl

      «The Crux of the Biscuit is the Apostrophe»

        Now we're getting somewhere (I think). You should be able to take advantage of Perl's :crlf IO layer to handle the problem for you. I'll let you test this yourself I've tested this, and here is how I think it would work out.

        First, Tie::File seems to be "layers" unaware, which is fine, except that you'll have to open the file explicitly, and close it again when you're done, rather than letting Tie::File handle those operations. This gives you control over what layers are applied to the file handle.

        use strict; use warnings; use Tie::File; use Scalar::Util qw( weaken ); open my $fh, '+<:crlf', 'filename.ext' or die $!; my @array; my $t = tie @array, 'Tie::File', $fh; weaken $t; # tie holds its own ref. We don't want a mem leak. # Work, work, work... untie @array; close $fh or die $!;

        The relevant explanation of ':crlf' from the POD is: " On read converts pairs of CR,LF to a single "\n" newline character. On write converts each "\n" to a CR,LF pair." Since this happens behind the scenes, it should play nice with Tie::File, but I would test on some copies of the files first to be sure.

        Updated: Added weaken to eliminate a potential memory leak, since tie also holds a ref to its own object.


        Dave

Re^2: Get record separator of a file
by Anonymous Monk on Nov 13, 2012 at 21:56 UTC
    Translation: read a chunk of data that you know to be big enough, then search for known character strings that could be the right answer ... taking care to search for longer strings first.

      Shure, yes.

      #!/usr/bin/perl use strict; use warnings; use Data::Dumper; use IO::All; io(shift)->read(my $chunk, 10); print Dumper(unpack("C*", $chunk)); for(unpack("C*", $chunk)){print chr()} $chunk =~ m/(\n|\r\n)/g; print unpack("C*", $1); __END__ $ ./recsep.pl dos.csv $VAR1 = 102; $VAR2 = 111; $VAR3 = 111; $VAR4 = 59; $VAR5 = 98; $VAR6 = 97; $VAR7 = 114; $VAR8 = 13; $VAR9 = 10; foo;bar 1310 $ ./recsep.pl unix.csv $VAR1 = 102; $VAR2 = 111; $VAR3 = 111; $VAR4 = 59; $VAR5 = 98; $VAR6 = 97; $VAR7 = 114; $VAR8 = 10; $VAR9 = 10; foo;bar 10

      Don't no other way. Thank you and regards, Karl

      «The Crux of the Biscuit is the Apostrophe»