comment on

Hi guys. I've been mulling this one over for awhile now, trying to decide/work out how best to handle it.

One of my users has a report she would like to run which results in over 1000 pages of rather poorly-presented text. Needless to say, she would like it if I could arrange for this to be trimmed considerably. My intention is to build this up, firstly, getting the data extraction right, and then I will use Spreadsheet::WriteExcel to put it into a much more friendly format for her.

See below for an example of the kind of text I'm working with - note this is just part of one "order", and these are repeated numerous times throughout the file. All the dates etc. have the "ability" of being different, so each "order" and each "distribution" of each order need to be extracted and handled on an individual basis.



                             List of Distributions                    
+          
                                                                      
+          
                  Produced Tuesday, 9 October, 2012 at 1:38 PM        
+          
                                                                      
+          


       Order ID:PO-9999                  fiscal cycle:21112
      Vendor ID:VEND99                     order type:SUBSCRIPT 
    15)   requisition number:                      copies:9    
                call number:XX(9999999.999)                          
                  ISBN/ISSN:9999-999X           
         Title:Item title here.
         ISSN:9999-999X
         Publication info:More text here about stuff

        Distribution--
            packing list:STUFF-I-DONT-NEED-999      
            holding code:CODEINFO1                   copies:1    
           date received:27/6/2012                             date lo
+aded:27/6/
2012              
        Distribution--
            packing list:STUFF-I-DONT-NEED-999
            holding code:CODEINFO3                    copies:2    
           date received:27/9/2012                             date lo
+aded:27/6/
2012              
        Distribution--
            packing list:STUFF-I-DONT-NEED-999
            holding code:CODEINFO2                     copies:1    
           date received:25/8/2012                             date lo
+aded:27/6/
2012
[download]

So, out of that, I need to grab the values from the (in order of appearance) Order ID, fiscal cycle, Vendor ID, the number to the left of "requisition number", copies, title, ISBN/ISSN, holding code, copies, date received, date loaded.

My idea at this point in time, is to perform a loop, looking for the start of the data I need to grab. I have managed to get this part done using :

open (IN, "<$distfile") or die "Can't open $distfile\n";
print "File opened\n";
while ($line = <IN>) {
  chomp($line);
  if ( $line =~ /^(\s+.+)Order ID/ )
    {
    print "Found Order ID\n";
    }
  }
[download]

That part works fine. I then moved on to actually trying to extract the data from the crud. The idea was to just loop through each bracket of data, stripping the data out of each line separately.

So, for the first one, I designed a regex which matches everything except the two pieces of data I want from the "Order ID" line. The plan was to negate that match, and dump the results into a variable, then move on to the next line. Sounded relatively easy, but I've not been able to work out where I've gone wrong with it... I think if I can get a little help working out how to do this one line, then the rest of it should fall into place pretty readily...

The test I have been trying to use for this is :

  my ($order,$fiscal) = ($line !~ m/(\s+Order ID:|\s+fiscal cycle:)/g)
+;
  print "Order # $order, Fiscal year: $fiscal\n";
[download]

As it stands above, I get a string printed with null values. If I change the "!~" to "=~" then I get the output:

Order # Order ID:, Fiscal year: fiscal cycle:

... which is why I was trying to negate the regex match. So... could you please help me understand where it is I'm going wrong with this? Am I going about this the right way, or should I be thinking along different lines for processing all this text?

Thanks in advance for any assistance you can give.

In reply to How best to strip text from a file? by bobdabuilda

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Just another Perl shrine
	PerlMonks