http://www.perlmonks.org?node_id=1028637

better has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to parse a table from a RTF file and parse each line from a table into an array. I tried to read it as a normal file into an array.

while (<$fhIn>) { @arr = split ("row", $_);

I would like to remove all the chart defining paramters, like:.

;cell}\pard\intbl{\fs20\f2\cf0\cb1

After having parsed the table into an array, I tried with substitute: $_=~ s/foo//;/ But the wanted string doesn't grab the cell paramaters. They stay unchanged. All I can do is to delete single words like 'cell', 'cb1' etc.

Any hint how to start?

better

Replies are listed 'Best First'.
Re: parsing a table
by hdb (Monsignor) on Apr 14, 2013 at 19:56 UTC

    Would it be possible to post a few rows of our table?

    Have you looked at the RTF Parser module on CPAN?

Re: parsing a table
by Anonymous Monk on Apr 15, 2013 at 01:41 UTC

    Any hint how to start?

    Use a module like RTF::Tokenizer or read the rtf rfc if one exists

Re: parsing a table
by space_monk (Chaplain) on Apr 15, 2013 at 11:41 UTC
    RTF Specifications changed with each release of Word; the latest appears to be 1.9.1. There are links from the Wikipedia article to the specifications. There are lots of RTF Packages which may assist you; I suggest RTF::Parser and a quick browse of The RTF Cookbook may assist.
    A Monk aims to give answers to those who have none, and to learn from those who know more.
Re: parsing a table
by hdb (Monsignor) on Apr 15, 2013 at 12:38 UTC

    Looking at an .rtf file and the spec and the available modules, the situation seems difficult. RTF::tokenizer seems helpful to reduce the complexity a bit. I have created a sample rtf file using MS Word which contains one table only and the following script gets me most contents of the table (and some more). I do not dare say whether this helps in your situation.

    use strict; use warnings; use RTF::Tokenizer; my $rtf = RTF::Tokenizer->new( file => "A.rtf" ); my( $t, $a, $p ); my $on = 0; while( $t ne "eof" ) { ( $t, $a, $p ) = $rtf->get_token(); print "TYPE|$t|ARGUMENT|$a|PARAMETER|$p|\n" if $on and $t eq "text"; + $on = 1 if $t eq "control" and $a eq "ltrrow"; $on = 0 if $a eq "control" and $a eq "row"; }