Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

Hi guys. I've been mulling this one over for awhile now, trying to decide/work out how best to handle it.

One of my users has a report she would like to run which results in over 1000 pages of rather poorly-presented text. Needless to say, she would like it if I could arrange for this to be trimmed considerably. My intention is to build this up, firstly, getting the data extraction right, and then I will use Spreadsheet::WriteExcel to put it into a much more friendly format for her.

See below for an example of the kind of text I'm working with - note this is just part of one "order", and these are repeated numerous times throughout the file. All the dates etc. have the "ability" of being different, so each "order" and each "distribution" of each order need to be extracted and handled on an individual basis.

List of Distributions + + Produced Tuesday, 9 October, 2012 at 1:38 PM + + Order ID:PO-9999 fiscal cycle:21112 Vendor ID:VEND99 order type:SUBSCRIPT 15) requisition number: copies:9 call number:XX(9999999.999) ISBN/ISSN:9999-999X Title:Item title here. ISSN:9999-999X Publication info:More text here about stuff Distribution-- packing list:STUFF-I-DONT-NEED-999 holding code:CODEINFO1 copies:1 date received:27/6/2012 date lo +aded:27/6/ 2012 Distribution-- packing list:STUFF-I-DONT-NEED-999 holding code:CODEINFO3 copies:2 date received:27/9/2012 date lo +aded:27/6/ 2012 Distribution-- packing list:STUFF-I-DONT-NEED-999 holding code:CODEINFO2 copies:1 date received:25/8/2012 date lo +aded:27/6/ 2012

So, out of that, I need to grab the values from the (in order of appearance) Order ID, fiscal cycle, Vendor ID, the number to the left of "requisition number", copies, title, ISBN/ISSN, holding code, copies, date received, date loaded.

My idea at this point in time, is to perform a loop, looking for the start of the data I need to grab. I have managed to get this part done using :

open (IN, "<$distfile") or die "Can't open $distfile\n"; print "File opened\n"; while ($line = <IN>) { chomp($line); if ( $line =~ /^(\s+.+)Order ID/ ) { print "Found Order ID\n"; } }

That part works fine. I then moved on to actually trying to extract the data from the crud. The idea was to just loop through each bracket of data, stripping the data out of each line separately.

So, for the first one, I designed a regex which matches everything except the two pieces of data I want from the "Order ID" line. The plan was to negate that match, and dump the results into a variable, then move on to the next line. Sounded relatively easy, but I've not been able to work out where I've gone wrong with it... I think if I can get a little help working out how to do this one line, then the rest of it should fall into place pretty readily...

The test I have been trying to use for this is :

my ($order,$fiscal) = ($line !~ m/(\s+Order ID:|\s+fiscal cycle:)/g) +; print "Order # $order, Fiscal year: $fiscal\n";

As it stands above, I get a string printed with null values. If I change the "!~" to "=~" then I get the output:

Order #        Order ID:, Fiscal year:                   fiscal cycle:

... which is why I was trying to negate the regex match. So... could you please help me understand where it is I'm going wrong with this? Am I going about this the right way, or should I be thinking along different lines for processing all this text?

Thanks in advance for any assistance you can give.


In reply to How best to strip text from a file? by bobdabuilda

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others admiring the Monastery: (4)
As of 2024-04-25 06:16 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found