Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Get record separator of a file

by karlgoethebier (Curate)
on Nov 13, 2012 at 18:32 UTC ( #1003686=perlquestion: print w/ replies, xml ) Need Help??
karlgoethebier has asked for the wisdom of the Perl Monks concerning the following question:

Good evening monks,

i would like to ask you how i can determine the record separator of a file before i process it using Tie::File.

Thank you and regards, Karl

«The Crux of the Biscuit is the Apostrophe»

Comment on Get record separator of a file
Download Code
Re: Get record separator of a file
by MidLifeXis (Prior) on Nov 13, 2012 at 18:44 UTC

    Tie::File::Schrödinger?

    Unless you know of a way to identify the record separator prior to inspecting it (in which case, congrats), you need to look at it before you can know what is in it.

    --MidLifeXis

Re: Get record separator of a file
by duff (Vicar) on Nov 13, 2012 at 18:47 UTC

    Um ... you have to know the record separator before hand. That's not something that's intrinsic to a file.

    If you're asking how to specify the record separator for Tie::File, that's in the Tie::File documentation.

Re: Get record separator of a file
by karlgoethebier (Curate) on Nov 13, 2012 at 19:42 UTC

    You are right, bad asked question, sorry. Perhaps it should better have been something like: How can i determine the record separator...in the most efficient way?"

    Background is: i have many files to process with different record separators. Filenames change as well as suffixes a.s.o. And the files are large.

    I have an XML file that provides information where to search for the files, what to do with them a.s.o. Easy to place the right Tie:File recsep option there.

    Else i have to get the recsep from each file before i process it. And benchmark myself what's best.

    I hope this doesn't it make worse. Please also keep in mind that i'm not a native english speaker.

    Thank you for your help and best regards, Karl

    «The Crux of the Biscuit is the Apostrophe»

      Let's ask this. Without worrying about efficiency yet, how would you determine the record separator of one of the files? Until I am clear on how you're able to detect the record separator in the first place, I have a difficult time suggesting an efficient means of doing so. What heuristics are you using? Or is the record separator specified in some sort of header for the file? Or does it depend on the file's extension?

      If you haven't figured that part out yet, you have to step back from the problem and look at it as a human would. Ask yourself, "If I opened the file in an editor (possibly one that displays non-printables too), how would I spot the record separator?" Once you've figured that out, the next step is to isolate the rules, and put them to code. After you get that working, tests and all, you're done. Only then, if you feel the outcome isn't efficient enough for your needs, should you begin profiling and determining what needs to be made more efficient.

      There's an old expression, that Perl is great for prototyping, and often the result is good enough that there's no need to rewrite in C. The same applies here; get it working, and it may be good enough that you don't need to be further concerned with efficiency.


      Dave

        OK, I will try my best to explain.

        I can see the recsep of a file with:

        Karls-Mac-mini:Desktop karl$ hexdump -c -n 8 file.txt 0000000 f o o ; b a r \n + 0000008

        Or i hope so.

        But i really don't want to check it this way.

        I have many larger files with \r\n or \n as recsep.

        So i thought about efficiency and figured out that Tie::File is faster than IO::File for my needs (i benchmarked it, but that is another issue).

        But when i tied my @array to the original file, all data was put into the first slot of my @array. After setting the recsep option of Tie::File to \n, everything was good.

        So i thought, it would be a good idea to do something like the hexdump command in perl to get the recsep - without loosing the performance boost that Tie::File gives me.

        I hope very much that this is a better explanation about what i wanted to do.

        Thank you very much for your patience and help.

        Regards, Karl

        «The Crux of the Biscuit is the Apostrophe»

      Translation: read a chunk of data that you know to be big enough, then search for known character strings that could be the right answer ... taking care to search for longer strings first.

        Shure, yes.

        #!/usr/bin/perl use strict; use warnings; use Data::Dumper; use IO::All; io(shift)->read(my $chunk, 10); print Dumper(unpack("C*", $chunk)); for(unpack("C*", $chunk)){print chr()} $chunk =~ m/(\n|\r\n)/g; print unpack("C*", $1); __END__ $ ./recsep.pl dos.csv $VAR1 = 102; $VAR2 = 111; $VAR3 = 111; $VAR4 = 59; $VAR5 = 98; $VAR6 = 97; $VAR7 = 114; $VAR8 = 13; $VAR9 = 10; foo;bar 1310 $ ./recsep.pl unix.csv $VAR1 = 102; $VAR2 = 111; $VAR3 = 111; $VAR4 = 59; $VAR5 = 98; $VAR6 = 97; $VAR7 = 114; $VAR8 = 10; $VAR9 = 10; foo;bar 10

        Don't no other way. Thank you and regards, Karl

        «The Crux of the Biscuit is the Apostrophe»

Re: Get record separator of a file
by RichardK (Priest) on Nov 14, 2012 at 09:35 UTC

    You could use the unix file command to determine the file type and then you'll know what you need to do for the file types you're handling.

    Also file will try to guess the separator for text files.

    I think there's a cpan module for this but I've never used it -- does anyone know anything about it?

      You could use the unix file command to determine the file type and then you'll know what you need to do for the file types you're handling.

      file(1) just guesses, based on some magic constants, and it often guesses wrong.

      Also file will try to guess the separator for text files.

      No. It detects line endings, but not record separators for CSV files:

      /tmp>echo "foo;bar;baz" > testme /tmp>echo "1;2;3" >> testme /tmp>cat testme foo;bar;baz 1;2;3 /tmp>file testme testme: ASCII text /tmp>
      I think there's a cpan module for this but I've never used it -- does anyone know anything about it?

      File::Type, mod:://File::MMagic, both share the guessing problem.

      Alexander

      --
      Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

        No. It detects line endings, but not record separators for CSV files:

        /tmp>echo "foo;bar;baz" > testme /tmp>echo "1;2;3" >> testme

        You seem to be confusing record separators and field separators. The line ending is the record separator and the ';' is the field separator.

Re: Get record separator of a file
by erix (Vicar) on Nov 14, 2012 at 10:12 UTC

    I think you simply have to accept that it is not possible to ascertain a definite "record separator".

    There is always the possibility that there is really only one 'column' in the file, and that "the" file separator does not occur in the file at all.

      But see my post above. And if the file has only one line it doesn't matter. My tied array then will only have one slot - i hope ;-)

      Thank you and regards, Karl

      «The Crux of the Biscuit is the Apostrophe»

Re: Get record separator of a file
by karlgoethebier (Curate) on Nov 14, 2012 at 13:11 UTC

    I tested my four subs (my basic intention).

    Unfortunately it seems as there was some confusion about recsep, line seperator and record seperator. Normally Tie::File treats every line of a file as record, except one changes this behaviour. Or vice versa, loading a file under Windows with has "\n" as line separator led to that Tie::File loaded all data into $array[0]. After setting the recsep option to \n everything worked as expected.

    From the documentation if Tie::File: recsep What is a 'record'? By default, the meaning is the same as for the <...> operator: It's a string terminated by $/, which is probably "\n". (Minor exception: on DOS and Win32 systems, a 'record' is a string terminated by "\r\n".) You may change the definition of "record" by supplying the recsep option in the tie call: tie @array, 'Tie::File', $file, recsep => 'es';.

    The correct question should better have been: "How can i get the line seperator from a file...?".

    Thank you for help to all and best regards, Karl

    «The Crux of the Biscuit is the Apostrophe»

      The correct question should better have been: "How can i get the line seperator from a file...?"

      I think a better question would have been, "How can I use Tie::File with text files from diverse platforms?"

      But we eventually got the point, and I'm glad you got it worked out (:crlf layer).


      Dave

        You're right, Dave.

        Karl

        «The Crux of the Biscuit is the Apostrophe»

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1003686]
Approved by davido
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (9)
As of 2014-11-01 10:31 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    For retirement, I am banking on:










    Results (229 votes), past polls