Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Text file manupulation to create test files

by ajguitarmaniac (Sexton)
on Jan 10, 2011 at 13:17 UTC ( [id://881449]=perlquestion: print w/replies, xml ) Need Help??

ajguitarmaniac has asked for the wisdom of the Perl Monks concerning the following question:

Namaste Monks,

I'm looking to change the format of an existing input file that I'm using to a new one. I must admit that I am in a fix as to how to start. Let me explain the conditions here.

My current input file looks something like this.

Loop 1 Line 1- SBSB_ID = 123456789,First_name = "Ajay", Last name = "George", + WMDS_SEQ_NO1 = 2, WMDS_SEQ_NO2 = 3,WMDS_SEQ_NO3 = 5 Line 2- Line 3- Line 4- .... .... Line n Loop2 Line 1- SBSB_ID = 123456782,First_name = "Ryan", Last name = "George", + WMDS_SEQ_NO1 = 2, WMDS_SEQ_NO2 = 3,WMDS_SEQ_NO3 = 5 Line 2- Line 3- Line 4- .... .... Line n EOF

The contents of line 2 till Line n can be considered as irrevelant info for now. Line 1 till Line n is called a single "Subscriber loop".The entire file can have several such "Subscriber loops".

I'm looking to create test files in the new format. This has 2 steps to it. 1) A change to be done from on the existing format. According to the new format the file should look like this.

Loop 1 Line 1- SBSB_ID = 123456789,First_name = "Ajay", Last name = "George" Line 2- WMDS_SEQ_NO1 = 2 Line 3- WMDS_SEQ_NO2 = 3 Line 4- WMDS_SEQ_NO3 = 5 Line 5- Line 6- .... .... Line n Loop 2 Line 1- SBSB_ID = 123456782,First_name = "Ryan", Last name = "George" Line 2- WMDS_SEQ_NO1 = 2 Line 3- WMDS_SEQ_NO2 = 3 Line 4- WMDS_SEQ_NO3 = 5 Line 5- Line 6 .... .... Line n EOF

2) In the generated file,subscriber loops are to be duplicated in order to generate large files by changing the SBSB_ID alone. So if a input file has 2 subscriber loops, the output file must have, say 100 loops with different SBSB_ID.

Please guide me as to how I should go about this.I'm not asking for the code here, something like an algorithm would do. I'll start with the code and post further questions if difficulties are encountered. Thanks in advance!!

Replies are listed 'Best First'.
Re: Text file manupulation to create test files
by ELISHEVA (Prior) on Jan 10, 2011 at 13:43 UTC

    If I'm reading what you wrote above correctly you want to split the first line of each subscriber loop into four or more lines, let's say 1a, 1b, 1c, 1d. These will be followed by the remaining lines of the subscriber loop.

    To do this task, you are going to need to review working with hashes if you haven't done so already. The algorithm would look something like this:

    1. Read line 1 of subscriber loop - convert into hash with hash entries being the property-value pairs found on the first subscriber line. Thus line 1 would look like this at the end of this step.
      $hLine = { SBSB_ID => '123456782' , First_name => "Ryan" , Last_name => "George" , WMDS_SEQ_NO1 => 2 , WMDS_SEQ_NO22 => 3 #, .... and so on for each field ... };
    2. Next print line 1a, extracting and printing property names and values you want (SBSB_ID, First_name, Last_name) from the hash.
    3. Delete the hash keys that you've printed. What you'll have left is the properties that you want to have one per line.
    4. For each remaining key-value pair in the hash, print one per line. That completes the processing of line 1 of the subscriber loop.
    5. For the remaining lines of the subscriber loop, read in each line and print as is
    6. When you detect the end of the subscriber loop, return to step 1, unless you are at the end of the file.

    Of course, there are many ways to do this, depending on what you know about the properties that belong on lines 1b, 1c, 1d, etc. For example, in step 3, if you know that the one-property-per-line properties always have the format WMDS_SEQ_NO# where # is some number, then you can skip the "delete each key" part above and just do a map with a regex, something like this (not checked for typos):

    foreach my $k (grep { /^WMDS_SEQ_NO\d+$/ } sort keys %$hLine) { print $fh "$k=".$hLine->{$k}."\n" }

    Update: I just noticed the requirement to replicate each subscriber loop. This requires a some changes to the above algorithm.

    • In step 1, use two hashes: one for the keys that belong in line 1a and another for the keys that belong one-per-line, i.e. $hLine1a and $hOnePerLine. This way you don't need to delete any keys.
    • In step 2, print the property-value pairs in $hLine1a
    • Step 3, skip - no longer applicable.
    • In step 4, print the property-value pairs in $hOnePerLine
    • In step 5, instead of immediately printing each line as is, save it in an array @aRestOfSubscriberLoop and then print as is. This will preserve the remaining lines of the subscriber loop. At the end of step 5, duplicate the subscriber loop as follows:
      1. Replace the value of $hLine1a->{SBSB_ID} with a random value
      2. Repeat steps 2 & 4 to print out line 1a,1b,1c, etc with the new SBSB_ID
      3. Print the remaining lines of the subscriber loop by printing out all of the lines in @aRestOfSubscriberLoop, one per line.
      4. Repeat for as many times as you want to duplicate this particular subscriber loop
    • In step 6, no change :-)

    Update: fixed instruction numbering error.

      Thanks Elisheva, your reply was very helpful.

Re: Text file manupulation to create test files
by Ratazong (Monsignor) on Jan 10, 2011 at 13:45 UTC

    for creating the new format, you might use the following approach:

    1. read the input-file line-by-line
    2. if a line starts with SBSB_ID
      • extract everything befor the WMDS_SEQ and write it to your output-file
      • split the rest into WMDS_SEQ-blocks and print all of them to your output-file (seperated by \n)
    for creating many SBS_ID-blocks you might modify step 2 of the algorithm above as follows:
    • instad or writing the text to a file, write it to a string
    • additionally extract the SBSB_ID
    • now create a loop, running e.g. 100 times
      • create a new SBSB_ID (e.g. by increasing the SBSB_ID by one) and put it into the string
      • write the string to the output-file

    HTH, Rata

      Thank you! Will implement your logic and get back if i'm stuck at any point!

Re: Text file manupulation to create test files
by ww (Archbishop) on Jan 10, 2011 at 13:45 UTC

    I'm having trouble understanding your intent... for several reasons, including

    • your formatting (put data inside <code> ...</code> tags so we can see what it really looks like)
      ... and
    • Your paired, but (IMO) contradictory statements "Line 2 till Line n can be considered as irrevelant info for now. Line 1 till Line n is called a single "Subscriber loop""

    Is the "n" in line n a variable number? Are lines 2- ..n really empty? Do they actually contain headers such as "line 1-", "line 2-"...? If they exist, are the trailing hyphens inconsistent as in what you posted?

    Please clarify.

    For a shot-in-the-dark, if the "Loop n" header consistently follows an empty "line n", then you have data which is probably susceptible to attack with a regex. And if the line-numbered data is more-or-less the way it looks here, then spliting on commas and combining (concatenating) the first three elements or a regex with captures nto AoAs might be a useful step one.

      And if the line-numbered data is more-or-less the way it looks here, then spliting on commas and combining (concatenating) the first three elements or a regex with captures nto AoAs might be a useful step one.

      If the actual data looks like this excerpt, I would just do
      s/,\s*WMDS/\nWMDS/;

      Of course it's impossible to tell without knowing the data if comma-whitespace-WMDS in fact a good separator string to use.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://881449]
Approved by ww
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others lurking in the Monastery: (5)
As of 2024-04-16 17:12 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found