Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.
 
PerlMonks  

Getting lines in a file between two patterns

by Daikini (Initiate)
on Oct 30, 2015 at 00:02 UTC ( [id://1146421]=perlquestion: print w/replies, xml ) Need Help??

Daikini has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

I'm an absolute newbie to Perl and I'm trying to print the contents of a table (in a docx) into a tab-delimited or xlsx file, either is fine.

There are over 100 docx's that have tables in them with identical delimiters ("Version History", "Table of Contents").

I can print the entire contents of the file to a .txt, but I can't yet get the data between the delimiters.

Any suggestions?

Thank you!



An update with all of the code I'm currently using:
use strict; use warnings; use Win32::OLE qw(in); use Win32::OLE::Const 'Microsoft Word'; use Win32::OLE::Variant; $|=1; sub Parse{ my $document_name = 'C:\TestPolicy.rtf'; my $word = Win32::OLE->GetActiveObject('Word.Application') || Win32::OLE->new('Word.Application','Quit') or die Win32::OLE->LastError(); my $document = $word->Documents->Open($document_name) or die Win32::OLE->LastError(); my $paragraphs = $document->Paragraphs (); my $n_paragraphs = $paragraphs->Count (); my $outputfile = 'C:\testfile.txt'; open(INPUT, $document_name) or die "Failed to open $document_name\n"; while (<INPUT>){ if ($_ =~ /HISTORY/../TABLE/){ open(OUTPUT, '>'.$outputfile) or die "Can't create $output +file.\n"; print OUTPUT "$_\n"; close OUTPUT; } } close INPUT; } Parse()

Replies are listed 'Best First'.
Re: Getting lines in a file between two patterns
by Discipulus (Canon) on Oct 30, 2015 at 08:03 UTC
    hello and welcome to the monastery and to the wonderful world of Perl Daikini

    the name of what you are looking for is strange: the flip flop operator.

    A more serious description is 'bistable operator'. it is described in docs in the Range Operators.
    The normal usage looks like:
    while (<FH>) { print if $_ =~ /start/../end/ }
    ie, as you can see it evaluates true (and in the example above it print because it evaluate true) only since 'start' is matched. It evaluates true until 'end' is matched, after 'end' is matched it evaluetes false.

    Please note that for our convenience and hystorical reasons (?) if either opernads of flip-flop .. operator is constant it is considerd true if equal the current line number $. so if you write:
    if (101 .. 200) { print; }
    means:
    if ($. == 101 .. $. == 200) { print; }
    Or, in words: 'print only line from 101 to 200'.

    Have also a read of Flipin good, or a total flop? and flip-flop interpolation

    HtH
    L*
    There are no rules, there are no thumbs..
    Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.
      So, I've implemented your code, which is getting me closer to the goal. Thank you for that.

      Here is what I've got running:

      while (<INPUT>){ if ($_ =~ /HISTORY/../TABLE/){ open(OUTPUT, '>'.$outputfile) or die "Can't create $output +file.\n"; print OUTPUT "$_\n"; close OUTPUT; } }

      The data I'm trying to pull out is in a table (in Word) and it will find the flip/flop, print it to the OUTPUT. That much is working. However, the data it returns is a jumbled mess of characters. It isn't intelligible.

      So, I tried saving a docx as rtf... no luck (it won't copy the table data, only gobbledegook), then I tried .txt... That returns the text/data, but I lose the neat and tidy layout of the table.

      Any suggestions on how to copy the table as a table or at least delimited somehow?

      Thank you again,
      Daikini

        You have your file-opening code inside the loop through each line. It should be moved outside, so the file is only opened once instead of once for every line (and overwritten each time), and also would be better written as:

        open my $OUTPUT, '>', $outputfile or die "File open fail: $!\n";
        See open.

        As far as your data foramtting question, you probably should use a module for reading your Windows file format, maybe Win32::OLE? (disclaimer: I haven't programmed with Windows for years; search CPAN and you'll almost certainly find what you need if that's not it.)

        Hope this helps!

        The way forward always starts with a minimal test.
      Thank you very much, Discipulus, for the quick reply. I'll give this a shot and let you know how I do.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1146421]
Approved by GrandFather
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others meditating upon the Monastery: (5)
As of 2024-04-23 18:13 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found