Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Modify values of tied, split lines in a file

by glemley8 (Acolyte)
on Oct 22, 2012 at 17:48 UTC ( #1000393=perlquestion: print w/ replies, xml ) Need Help??
glemley8 has asked for the wisdom of the Perl Monks concerning the following question:

I'm using Perl to create a very long CSV file based off conditions determined from an input file. The input file has lines that consist of variables that are comma-separated, which I've "split" and defined as $a,$b,$c. The output file is similar to the input file, except for $c, which is determined by certain conditions in the input file (let's say $c changes to $z in the new file). I'm currently using "open (NEWFILE, ">C:/...");", then "print NEWFILE "$a,$b,$z\n"" (actually much more complicated than this, but you get the idea...) to write the new file. It seems silly to create an entire new file when I could just be editing the $c variable in the input file... am I right?

I'm using Tie::File to read the input file line-by-line as an array. I understand that I can also use this method to modify records in a file, which would cut out the seemingly wasteful step of making a new file... however, I'm not sure how to modify just one variable of a split line using this method (e.g. change $c to $z in each line). This is probably simple, but I can't seem to get the right syntax to make this happen. Can anyone lead me in the right direction? Let me know if you need more info about the script.

Comment on Modify values of tied, split lines in a file
Re: Modify values of tied, split lines in a file
by sundialsvc4 (Abbot) on Oct 22, 2012 at 18:06 UTC

    My very-candid opinion is that you are creating an unholy monster that you will regret for the entire brief remaining tenure of your employment.   Don’t tie to a file just to avoid reading the thing line-by-line and using split; or, better, using a CSV-file handling package of known provenance.   Don’t try to “cut out seemingly-wasteful steps” only to have the program, for example, crash-and-burn in the middle and in so doing leave your both-input and-output file destroyed.

    Step back completely from your present approach and reconsider the whole thing.   You are being led-on into unknown territory by the allure of the unfamiliar.   There are no words of warning strong enough to use here.

      I appreciate your honest criticism, but it is extremely vague and not very constructive. You've told me what not to do, but not why, nor any suggestions.

      I'm working on a set of scripts that were jimmy-rigged together and my task is to streamline them. They do currently work, but in a very poor manner. The input file is opened, read, and closed repeatedly during the process, which I'm working to minimize. I understand that it is not good practice to load very large files into memory, hence why I'm using Tie::File to read the input line-by-line, which is working very well. There is no risk of losing my input file, since my script creates a copy and works from it.

      You seem to think I'm going to blindly forward my work without any debugging or testing. For each addition, I run multiple tests to make sure the script is moving in a positive direction and that the output data is verified. You philosophy seems to be "if it ain't broke, don't fix it". My job is to fix it and perfect it.

      I invite any constructive suggestions.

        And I will take those rebuttals at face value and now try to address them as best I can ... since now I see that you were not the source of your problem.   Your situation is a familiar one, and if you interpret what I have said (alas, reasonably so ...) as a personal affront, then I now personally and publicly apologize for it.

        (Let me reiterate that:   my initial response, I now see, was strikerather/strike that of a horse’s asterisk.)   :*{   Okay, I said it first.   I am sorry.   May we please move on.

        The use of Tie::File is basically an in-efficient way to handle the input file, but for the moment “it is one that works” and I am not personally familiar with whether or not it loads the entire file in memory.   If it does, then that part of the program must immediately be replaced at whatever the cos might be, because it just might be the camel’s straw.

        In any case, the notion of modifying the file, if it remains a file in its present form, should be immediately and categorically excluded.   You need to consume a file as input, and to produce a file as output, without altering the input and with complete replacement of the output.   That is, if the output file in question must be of the same format and cannot possibly be, say, an SQLite database file instead.

        I cordially suggest that your task is destined to be more than “streamline.”   The best strategy would be to work with a file format that is specifically designed to be a read/write file, such as SQLite.   You definitely do not want to be working in terms of explicit print statements, even if they work “at the moment,” because they are destined to be maintenance PITAs forevermore.  

        The present modus operandi of this collection of legacy scripts is ... doomed, unsalvageable.   And so, not to be continued.   Deeper cuts, made carefully but made once, will lift this long-standing headache out of its present mire and could well dramatically transform it.   I suggest that you need to advocate for permission to make this deeper approach.

        (Please re-read the next to last sentence of paragraph #2.)

      The first thing I would consider is whether the script is or will always be run on a server I have control over or not. If not or not sure, avoid creating a new file, database, or use modules that you may regret later. KISS principle.

      Next I would consider whether it is or will become a web app... Then same as above try to keep it lean. (as few modules as possible)

      Then, if you still have to write a file, consider whether it really needs to be a CSV file. In my work (stock indexes) I find it's easier to use SDF to write the file then read it back in and split.

      I do large scrapes and manipulate years worth of data line by line without creating a file, and that's by choice, without much penalty. for what it's worth.

Re: Modify values of tied, split lines in a file
by aitap (Deacon) on Oct 22, 2012 at 19:10 UTC
    You can use the combination of -p and -i command line parameters to make Perl edit files in-place (like sed does). Actually, it does this by renaming the input file, opening the output file by the original name, and selecting that output file as the default for print() statements (a quote from perlrun).
    Sorry if my advice was wrong.
Re: Modify values of tied, split lines in a file
by Crackers2 (Vicar) on Oct 22, 2012 at 19:11 UTC
    It seems silly to create an entire new file when I could just be editing the $c variable in the input file... am I right?

    IMO, probably not, for two reasons:

    1. Speed. Say your file is 100MB in size, your first line is "a,b,c" and you want it to be "a,b,doublec" instead. If you edit in-place, that means that everything in the file after "c" will now have to be moved 6 bytes to accomodate the extra characters, so in effect you'll be doing a 100MB write just to fix the first line. Then on the second line you'll have to move everything except the first two lines again, etc. Caching probably makes it not quite as bad as it can be, but it'll likely still be much slower than just writing a new file.
    2. Error recovery; if something happens while you're in the middle of processing the file, you'll end up with half a file in the new format and half in the old format. (You could of course keep track of how many lines you've converted, but you'd have to make sure to persist that value and sync the file at the same time)
    So my suggestion would be to just write the new file.

Re: Modify values of tied, split lines in a file
by Marshall (Prior) on Oct 22, 2012 at 19:16 UTC
    It seems silly to create an entire new file when I could just be editing the $c variable in the input file... am I right?
    I don't think so. Creating a whole new file seems the right way to do it in this case.

    A file on the disk is essentially like a stream of bits. There is no way to "insert" some extra bits in that stream without re-writing all of the bits that come after that insertion. Tie::File can "hide" some of this reality from you, but the physical situation remains the same.

    A database can allow the modification of a column, but the data structures are much more complex than a simple line oriented file. If you want to modify say field 4 in a .CSV file, yes re-writing the entire file is the right way to go.

    There is a DBI module that handles .csv files, but again it will wind up re-writing the entire .csv file. It will cache stuff, but at the end of the day, a new file will be written.

Re: Modify values of tied, split lines in a file
by BrowserUk (Pope) on Oct 22, 2012 at 19:31 UTC
    I'm using Tie::File to read the input file line-by-line as an array. I understand that I can also use this method to modify records in a file, which would cut out the seemingly wasteful step of making a new file...

    Don't! (*)

    Tie::File is horribly inefficient when modifying large files, in-place. Think about what it has to do when you modify a line.

    Say your file consists of:

    1, 2, 3 4, 5, 6 7, 8, 9 ...{10 million more lines here}

    And you decide to modify line 2 so that it looks like this:

    1, 2, 3 4, 55, 6 7, 8, 9 ...{10 million more lines here}

    In order to accommodate that single extra character, every one of the 10,000,001 lines following it, will have to be read, and then re-written.

    And then you add or delete, a character in line 3 and the same process has to be repeated again.

    Of course, Tie::File is more intelligent than that and it goes to great lengths to buffer changed records in memory and defer the re-writing of the file until it has accumulated a bunch of changes. But that caching of changes does not come for free. It needs substantial memory and substantial cpu to be effective.

    And if you need to make changes to a substantial proportion of the lines, in the end, even with the caching, lots more data gets read and re-written for each change than is the case when you read from one file, make a change, and write to another, in a simply linear flow.

    (*The only exception I would make is for fixed record-length files, where changes to one line do not require all the following lines to be rewritten to accommodate each change.)


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    RIP Neil Armstrong

    mod://Tie::File
Re: Modify values of tied, split lines in a file
by Kenosis (Priest) on Oct 22, 2012 at 19:40 UTC
Re: Modify values of tied, split lines in a file
by kcott (Abbot) on Oct 23, 2012 at 06:35 UTC

    G'day glemley8,

    "... I'm not sure how to modify just one variable of a split line using this method ..."

    I've provided an example below of one way to do it. Do read all the caveats already provided. Benchmark may prove useful.

    #!/usr/bin/env perl use strict; use warnings; use Tie::File; my ($file, $index, $replacement) = @ARGV; tie my @records, 'Tie::File', $file or die $!; for (@records) { my @fields = split /,/; $fields[$index] = $replacement; $_ = join ',' => @fields; } untie @records;

    Example run:

    $ cat fred A,B,C D,E,F G,H,I $ pm_tie_file_split_fields.pl fred 2 Z $ cat fred A,B,Z D,E,Z G,H,Z

    -- Ken

Re: Modify values of tied, split lines in a file
by glemley8 (Acolyte) on Oct 23, 2012 at 16:24 UTC
    Thanks for all the constructive input! (minus the snark from sundialsvc4)

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1000393]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others having an uproarious good time at the Monastery: (13)
As of 2014-11-27 09:32 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My preferred Perl binaries come from:














    Results (183 votes), past polls