Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much

Regex copy and paste help

by HtorneDK (Novice)
on Aug 31, 2012 at 08:56 UTC ( #990950=perlquestion: print w/replies, xml ) Need Help??
HtorneDK has asked for the wisdom of the Perl Monks concerning the following question:

Greetings fellow holy men. I have a big problem, which tests my faith. Help me find the right path of Perl. The Problem is this: I'm am getting this out of our IBM provided TSM system (It's a backup system). I need to process the information back to one of my databases. I started using Perl 1 year ago, and it has been great. But now I have been hit by a nut I can't crack.

A00002 \\a00002\c$ 2 WinNT NTFS Yes 73,626MB 42.5 A00003 / 1 Linux86 EXT3 No 38,289MB 29.8 A00004 A00004 \Sys- 1 WinNT VSS Yes 0 KB 0.0 temState\- NULL\Syst- em

In the line with A00004 the second column is broken up where the - sign is. Is there a way to fix this using regex? So I end up with something that looks like this
A00004 \SystemState\NULL\System 1 WinNT VSS Yes 0 KB 0.0
The original document contains about 800-900 lines. Also please keep in mind that I'm a newbie at perl so writing the solution using simple answers such as "ju

Replies are listed 'Best First'.
Re: Regex copy and paste help
by roboticus (Chancellor) on Aug 31, 2012 at 10:56 UTC


    I'd suggest something like this:

    Keep two sets of fields: the temporary set and the full set you're building.

    • Read a line and parse it into the temporary fields (unpack, substr)
    • If the first column exists (e.g. A00004) then it looks like you're on a new item, so:
      • If you have anything in the "full" set:
        • process it
        • write the processed data
      • Copy the temporary data into your "full" set
    • Otherwise
      • Append the existing temporary fields into your full set
    • then go back and process the next line

    Finally, when you're done, if you have anything in your full set, process it.


    When your only tool is a hammer, all problems look like your thumb.

Re: Regex copy and paste help
by aaron_baugher (Curate) on Aug 31, 2012 at 10:42 UTC

    What have you tried so far? It seems to me the logic would work something like this:

    read a line into an array, store it read the next line into an array if it only has one field preceded by whitespace append that field to the second field of the previous array else print out the stored array put this array into storage at the end, print the stored array

    Coding that should be fairly simple. I don't know if the third line with only the first field is significant, so you may need to add some code to handle that too.

    Aaron B.
    Available for small or large Perl jobs; see my home node.

Re: Regex copy and paste help
by MidLifeXis (Monsignor) on Aug 31, 2012 at 12:44 UTC

    This looks similar to what I see in sqlplus when data is wider than a format allows for. Is there any way to change the width of your formats when generating the data to make your parsing easier?

    Then next question I would ask is if the gaps between columns are tabs or spaces. I would use split, a regular expression or substr to pull out the data for each column.

    After I had the columnized data for each row, I would feed the data into a state machine. If @current_line[0] == @last_line[0], then you have a multi-line parsing situation (although, having nothing in the first column seems to indicate the same thing). Build up your data for each record, and print out the processed information once the record is complete.

    If you show some code, you will be led along the path to enlightenment.


Re: Regex copy and paste help
by Anonymous Monk on Aug 31, 2012 at 09:12 UTC

    How are you parsing it? This is important step, which will determine how you reassemble/join the string back together

    Once you have an array of arrays, its as simple as checking for trailing "-" on a field, iterating/appending until the trailing "-" is missing

    Parse::Report - parse Perl format-ed reports. might help, its like the opposite of Perl6::Form

Re: Regex copy and paste help
by talexb (Canon) on Aug 31, 2012 at 16:58 UTC

    This brings back nice memories, because it closely resembles the problem I solved as part of my first paying Perl gig, and that led to a contract that lasted about three years, back in about 1998.

    As has already been explained, just collect the lines, putting each line into an array (I would use an array of arrays, or AoA). You could merge the continuation lines as you go, or you could do it at the end -- if a line has a '-' in the second column, replace that with whatever's in the next line's second column.

    I also had to deal with headers and footers, but they were easier.

    Alex / talexb / Toronto

    "Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds

Re: Regex copy and paste help
by nemesdani (Friar) on Aug 31, 2012 at 09:05 UTC
    A way could be if you slurp the whole file (8-900 lines is not so much) and you check if something is followed by a newline and itself. You could delete the first occurence. This is a very roundabout way, but this came first to me.

    I'm too lazy to be proud of being impatient.
Re: Regex copy and paste help
by HtorneDK (Novice) on Aug 31, 2012 at 09:36 UTC
    I'm not currently parsing it, don't know where to start. By the way I'm on windows, and have close to zero linux experience.
Re: Regex copy and paste help
by CountZero (Bishop) on Sep 01, 2012 at 13:17 UTC
    I would do it this way:
    use Modern::Perl; while (<DATA>) { my @record = split; if ($record[1]=~s/-$//) { while (<DATA>) { my ($continuation) = split; $record[1] .= $continuation; last unless $record[1]=~s/-$//; } } say join ' ', @record; } __DATA__ A00002 \\a00002\c$ 2 WinNT NTFS Yes 73,626MB 42.5 A00003 / 1 Linux86 EXT3 No 38,289MB 29.8 A00004 A00004 \Sys- 1 WinNT VSS Yes 0 KB 0.0 temState\- NULL\Syst- em A00005 /xyz 4 Linux86 EXT2 No 12,345MB 30
    And yes, you get a few "Use of uninitialized value $record1" warnings because of the anomalous third record which has only one filed "A00004" in it. You can safely ignore these warnings or add another test to drop such record if you do not need it.


    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

    My blog: Imperial Deltronics
Re: Regex copy and paste help
by HtorneDK (Novice) on Sep 03, 2012 at 12:33 UTC
    Thanks a 10x10^6 times, best community ever :) You have helped proliferate Perl and fostered a newbie Monk, Me :) Thanks again

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://990950]
Approved by Corion
Front-paged by MidLifeXis
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (5)
As of 2017-03-27 06:09 GMT
Find Nodes?
    Voting Booth?
    Should Pluto Get Its Planethood Back?

    Results (315 votes). Check out past polls.