Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine

Re^2: Get record separator of a file

by davido (Archbishop)
on Nov 13, 2012 at 20:10 UTC ( #1003699=note: print w/replies, xml ) Need Help??

in reply to Re: Get record separator of a file
in thread Get record separator of a file

Let's ask this. Without worrying about efficiency yet, how would you determine the record separator of one of the files? Until I am clear on how you're able to detect the record separator in the first place, I have a difficult time suggesting an efficient means of doing so. What heuristics are you using? Or is the record separator specified in some sort of header for the file? Or does it depend on the file's extension?

If you haven't figured that part out yet, you have to step back from the problem and look at it as a human would. Ask yourself, "If I opened the file in an editor (possibly one that displays non-printables too), how would I spot the record separator?" Once you've figured that out, the next step is to isolate the rules, and put them to code. After you get that working, tests and all, you're done. Only then, if you feel the outcome isn't efficient enough for your needs, should you begin profiling and determining what needs to be made more efficient.

There's an old expression, that Perl is great for prototyping, and often the result is good enough that there's no need to rewrite in C. The same applies here; get it working, and it may be good enough that you don't need to be further concerned with efficiency.


Replies are listed 'Best First'.
Re^3: Get record separator of a file
by karlgoethebier (Monsignor) on Nov 13, 2012 at 22:38 UTC

    OK, I will try my best to explain.

    I can see the recsep of a file with:

    Karls-Mac-mini:Desktop karl$ hexdump -c -n 8 file.txt 0000000 f o o ; b a r \n + 0000008

    Or i hope so.

    But i really don't want to check it this way.

    I have many larger files with \r\n or \n as recsep.

    So i thought about efficiency and figured out that Tie::File is faster than IO::File for my needs (i benchmarked it, but that is another issue).

    But when i tied my @array to the original file, all data was put into the first slot of my @array. After setting the recsep option of Tie::File to \n, everything was good.

    So i thought, it would be a good idea to do something like the hexdump command in perl to get the recsep - without loosing the performance boost that Tie::File gives me.

    I hope very much that this is a better explanation about what i wanted to do.

    Thank you very much for your patience and help.

    Regards, Karl

    «The Crux of the Biscuit is the Apostrophe»

      Now we're getting somewhere (I think). You should be able to take advantage of Perl's :crlf IO layer to handle the problem for you. I'll let you test this yourself I've tested this, and here is how I think it would work out.

      First, Tie::File seems to be "layers" unaware, which is fine, except that you'll have to open the file explicitly, and close it again when you're done, rather than letting Tie::File handle those operations. This gives you control over what layers are applied to the file handle.

      use strict; use warnings; use Tie::File; use Scalar::Util qw( weaken ); open my $fh, '+<:crlf', 'filename.ext' or die $!; my @array; my $t = tie @array, 'Tie::File', $fh; weaken $t; # tie holds its own ref. We don't want a mem leak. # Work, work, work... untie @array; close $fh or die $!;

      The relevant explanation of ':crlf' from the POD is: " On read converts pairs of CR,LF to a single "\n" newline character. On write converts each "\n" to a CR,LF pair." Since this happens behind the scenes, it should play nice with Tie::File, but I would test on some copies of the files first to be sure.

      Updated: Added weaken to eliminate a potential memory leak, since tie also holds a ref to its own object.



        Sorry, i saw it to late. Cool, i didn't know this.

        Works when called with $fh. Very nice!

        Thank you very much and best regards, Karl

        «The Crux of the Biscuit is the Apostrophe»

        Just thank you very much, Karl

        «The Crux of the Biscuit is the Apostrophe»

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1003699]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others lurking in the Monastery: (6)
As of 2018-05-24 03:06 GMT
Find Nodes?
    Voting Booth?