|Perl: the Markov chain saw|
How best to strip text from a file?by bobdabuilda (Beadle)
|on Nov 02, 2012 at 02:39 UTC||Need Help??|
bobdabuilda has asked for the
wisdom of the Perl Monks concerning the following question:
Hi guys. I've been mulling this one over for awhile now, trying to decide/work out how best to handle it.
One of my users has a report she would like to run which results in over 1000 pages of rather poorly-presented text. Needless to say, she would like it if I could arrange for this to be trimmed considerably. My intention is to build this up, firstly, getting the data extraction right, and then I will use Spreadsheet::WriteExcel to put it into a much more friendly format for her.
See below for an example of the kind of text I'm working with - note this is just part of one "order", and these are repeated numerous times throughout the file. All the dates etc. have the "ability" of being different, so each "order" and each "distribution" of each order need to be extracted and handled on an individual basis.
So, out of that, I need to grab the values from the (in order of appearance) Order ID, fiscal cycle, Vendor ID, the number to the left of "requisition number", copies, title, ISBN/ISSN, holding code, copies, date received, date loaded.
My idea at this point in time, is to perform a loop, looking for the start of the data I need to grab. I have managed to get this part done using :
That part works fine. I then moved on to actually trying to extract the data from the crud. The idea was to just loop through each bracket of data, stripping the data out of each line separately.
So, for the first one, I designed a regex which matches everything except the two pieces of data I want from the "Order ID" line. The plan was to negate that match, and dump the results into a variable, then move on to the next line. Sounded relatively easy, but I've not been able to work out where I've gone wrong with it... I think if I can get a little help working out how to do this one line, then the rest of it should fall into place pretty readily...
The test I have been trying to use for this is :
As it stands above, I get a string printed with null values. If I change the "!~" to "=~" then I get the output:Order # Order ID:, Fiscal year: fiscal cycle:
... which is why I was trying to negate the regex match. So... could you please help me understand where it is I'm going wrong with this? Am I going about this the right way, or should I be thinking along different lines for processing all this text?
Thanks in advance for any assistance you can give.