kevyt has asked for the wisdom of the Perl Monks concerning the following question:
Can someone please help me search a rtf file?
I'm able to find desired text in a txt file but not a rtf file.
I would like to find the date after the text "Date of Last Update:" and count the number of signatures in the file.
I've not had luck with RTF::Tokenizer or Win32::OLE.
I will eventually print the directory, filename, date the file was uploaded, date of last update, and number of signatures.
Thanks for your help.
use strict;
use warnings;
use File::stat;
use Time::localtime;
use RTF::Tokenizer;
require Win32::OLE;
sub get_data_from_file ;
my $fullpath = my $dir = 'C:\Test';
my %hash;
opendir (DIR, $dir) or die $!;
while (my $dir = readdir(DIR)) {
my $file = $fullpath = 'C:\Test\\' . $dir;
opendir (FILE, $file) or die $!;
while (my $file = readdir(FILE)) {
if ($file =~/RTF/ || $file =~/rtf/ ){
# if ( $file =~/txt/ ){
# print "fullpath = $fullpath file is $file\n";
# $hash{$dir}{$file}{PATH} = $fullpath . '\\' . $file;
my $upload_date = ctime(stat($fullpath . '\\' . $file)
+->mtime);
$hash{$dir}{$file}{UPLOAD_DATE} = $upload_date;
# print "dir is $dir file is $file date is $hash{$dir}
+{$file}{UPLOAD_DATE}\n";
# print $hash{$dir}{$file}{PATH};
open (IN, $fullpath . '\\' . $file) or die ("Can't ope
+n the input file $!");
while ( <IN> ){
# Try to find date on this line "Document Owner: D
+ate of Last Update: 10/15/2013
#print $_;
chomp;
if( m/Date of Last Update:(.*)/) {
print $1;
next;
}
# Count the number of signature lines
}
close (IN);
}
}
#print "dir is $dir\n";
}
closedir(DIR);
closedir(FILE);
#print_data (\%hash);
sub print_data {
my ($h_ref) = @_;
foreach my $dir (keys %$h_ref){
foreach my $file (keys $h_ref->{$dir}){
print "$dir, $file, $h_ref->{$dir}{$file}{UPLOAD_DATE}
+\n";
}
}
}
Sample Document
Document #: 000 Version #: 1
Document Owner: Someone Date of Last Update: 10/15/2013
Written by: Someone Status: Approved
General Description
Purpose
Definitions
Procedures
Sign-Off Approvals
______________________________________________________/_____/_____
Jane Doe, CEO Date
______________________________________________________/_____/_____
John Doe, CFO Date
Re: How can I find a line in a RTF file?
by wjw (Priest) on Aug 08, 2014 at 01:21 UTC
|
Just out of curiosity: Would the RTF::TEXT::Converter module work instead? Have not used it myself, but was theorizing that it might be simpler to process plain text output from this module...? Different approach... Just a thought...
Update:
I became curious about this so tried the following:
#!/usr/bin/perl
use Modern::Perl;
use RTF::TEXT::Converter;
my $string;
my $object = RTF::TEXT::Converter->new(output => \$string);
$object->parse_stream( "/home/wjw/tmp/RTF_test.rtf" );
chomp $string;
my @string = split("\n", $string);
foreach my $line (@string) {
chomp $line;
next if $line !~ /\w+/;
next if $line =~ /^\_+\/\_+\/\_+$/;
say $line;
}
.. which is in-complete and with a leaning toothpick eye-sore in the second regex, but gets one to the point of only having to deal with the text.
Output is as follows:
Use of uninitialized value $cstylename in string ne at /home/wjw/perl5
+/perlbrew/perls/perl-5.20.0/lib/site_perl/5.20.0/RTF/Control.pm line
+1489, <GEN1> line 16.
Document #: 000 Version #: 1
Document Owner: Someone Date of Last Update: 10/15/2013
Written by: Someone Status: Approved
General Description
Purpose
Definitions
Procedures
Sign-Off Approvals
Jane Doe, CEO Date
John Doe, CFO Date
I did not dive into the warning issued. Note that the install of the referred to module actually depends on two others, one of which I think you were using already.
cpanm RTF::TEXT::Converter
--> Working on RTF::TEXT::Converter
Fetching http://www.cpan.org/authors/id/S/SA/SARGIE/RTF-Parser-1.12.ta
+r.gz ... OK
Configuring RTF-Parser-1.12 ... OK
==> Found dependencies: RTF::Tokenizer
--> Working on RTF::Tokenizer
Fetching http://www.cpan.org/authors/id/S/SA/SARGIE/RTF-Tokenizer-1.18
+.tar.gz ... OK
Configuring RTF-Tokenizer-1.18 ... OK
Building and testing RTF-Tokenizer-1.18 ... OK
Successfully installed RTF-Tokenizer-1.18
Building and testing RTF-Parser-1.12 ... OK
Successfully installed RTF-Parser-1.12
2 distributions installed
...the majority is always wrong, and always the last to know about it...
Insanity: Doing the same thing over and over again and expecting different results...
A solution is nothing more than a clearly stated problem...otherwise, the problem is not a problem, it is a facct
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
Hello wjw,
OMG, everyday that I keep reading posts here I discover so many new modules that Perl has that I can not even keep up. :)
It looks as possible solution though, thanks for sharing.
Seeking for Perl wisdom...on the process...not there...yet!
| [reply] [Watch: Dir/Any] [d/l] |
|
I got it to work thanks to everyone's help.
I placed a # in front of "use warnings" in C:\Perl64\site\lib\RTF\Control.pm so I would not see a screen of warning about an uninitialized value in that file.
I have 3 warnings in the print statement for using of an initialized value.
Thank you very much!
Test code:
#!/usr/bin/perl
use Modern::Perl;
use RTF::TEXT::Converter;
my $string;
my $object = RTF::TEXT::Converter->new(output => \$string);
$object->parse_stream( '..\Policies\Test\000144028.rtf' );
chomp $string;
my @string = split("\n", $string);
my $number_of_sigs = 0;
foreach my $line (@string) {
chomp $line;
if ($line =~ m/Date of Last Update:(.*)/){
my $date = $1;
$date =~ s/^\s+//;
my @date_of_last_update = split (/ /, $date);
print "Date of last update is: $date_of_last_update[0]\n";
}
if ($line =~ m/Document #:(.*)/){
print "\nDocument number is $1\n";
}
# Sign-Off Approvals
if ($line =~/_____\/_____\/_____/){
# print $line;
$number_of_sigs ++;
}
}
print "\nnumber of signatures is $number_of_sigs \n";
Finished Code:
use strict;
use warnings;
use HTTP::Date;
use File::stat;
use Time::localtime;
use Modern::Perl;
use RTF::TEXT::Converter;
my $fullpath = my $dir = 'C:\Policies\\';
my %hash;
my %months = (
"Jan" => "1",
"Feb" => "2",
"Mar" => "3",
"Apr" => "4",
"May" => "5",
"Jun" => "6",
"Jul" => "7",
"Aug" => "8",
"Sep" => "9",
"Oct" => "10",
"Nov" => "11",
"Dec" => "12"
);
#print "full path is $fullpath \n";
opendir (DIR, $dir) or die $!;
#print "dir is $dir \n";
while (my $dir = readdir(DIR)) {
my $file = $fullpath = 'C:\Policies\\' . $dir;
opendir (FILE, $file) or die $!;
while (my $file = readdir(FILE)){
# print "7 reading $file\n";
if (($file =~/RTF/ || $file =~/rtf/ )&& $file !~/^~/ ){
my $path_file = $fullpath . '\\'. $file;
#print '\nfullpath = ' . $fullpath . '\\' . $file . '\
+n';
my $string;
my $object = RTF::TEXT::Converter->new(output => \$str
+ing);
$object->parse_stream( $path_file );
# print ctime(stat($fullpath . '\\' . $file)->mtime);
my ($wday, $mon, $day, $time, $yr ) = split (/\s+/, ct
+ime(stat($fullpath . '\\' . $file)->mtime) ); # split on space
# print "\n$wday, $mon, $day, $time, $yr \n";
# print "file is $file my date is $months{$mon}/$day/$
+yr\n\n" ;
$hash{$dir}{$file}{UPLOAD_DATE} = $months{$mon} . "/"
+. $day . "/" .$yr;
chomp $string;
my @string = split("\n", $string);
my $line_num =0;
foreach my $line (@string) {
$line_num ++;
#print "Reading line number $line_num in $file\
+n";
chomp $line;
if ($line =~ m/Date of Last Update:(.*)/){
my $date = $1;
$date =~ s/^\s+//;
my @last_update_date = split (/ /, $date);
#print "Date of last update is: $last_update_da
+te[0] line number $line_num\n";
$hash{$dir}{$file}{LAST_UPDATE_DATE} = $last_up
+date_date[0];
}
# print "5 line is $line \n";
if ($line =~ m/Document #:(.*)/){ # not needed bec
+ause the file name is the policy number
#print "Document number is $1 line number $lin
+e_num\n";
my $Policy_number = $1;
$Policy_number =~ s/\t+/ /g; # replace all tab
+s with spaces
$Policy_number =~ s/^\s+//; # remove leading s
+paces
my @arr = split (/ /, $Policy_number );
$hash{$dir}{$file}{DOC_NUM_AND_VERSION} = $arr
+[0];
# print "array is @arr and element 0 is $arr[0
+]\n";
#print "line is $line \n";
#print "1Path is $fullpath " . "\\" . "$file\n
+";
}
# Sign-Off Approvals
if ($line =~/_____\/_____\/_____/){
#print $line;
$hash{$dir}{$file}{NUM_OF_SIGS} ++;
}
}
#print "$hash{$dir}{$file}{NUM_OF_SIGS}\n";
}
}
#print "dir is $dir\n";
}
closedir(DIR);
closedir(FILE);
print_data (\%hash);
sub print_data {
my ($h_ref) = @_;
open (OUT, '> C:\Dev\Policy_upload_dates.csv') or die ("Can't
+open the output file $!");
print OUT "Department,File_Name,Date_of_Last_Update,Upload_Dat
+e,Policy_Number,Number_of_Signatures\n";
foreach my $dir (keys %$h_ref){
foreach my $file (keys $h_ref->{$dir}){
printf OUT "%s,%s, %s,%s, %s,%s\n",
$dir,
$file,
$h_ref->{$dir}{$file}{LAST_UPDATE_DATE},
$h_ref->{$dir}{$file}{UPLOAD_DATE},
$h_ref->{$dir}{$file}{DOC_NUM_AND_VERSION},
$h_ref->{$dir}{$file}{NUM_OF_SIGS}
;
}
}
}
One of many input files:
COMPUTER APPLICATION SELECTION
Policy and Procedure
Document #: 123.456.789 Version #: 6
Document Owner: Date of Last Update: 04/22/2003
Written by: INFORMATION SYSTEMS Status: Approved
General Description
Purpose
To establish guidelines for computer application selection.
Policy
GENERAL INFORMATION
RESPONSIBILITY
Procedure
A. Pages of text …
DEAPRTMENT – INFORMATION TECHNOLOGY
Document Control
Revision History
Ver. 1 02/01/1996 INITIAL
Sign-Off Approvals
The person responsible for this document must verify accuracy and that
+ the steps for this procedure or work instruction have been tested an
+d validated. After you have approved this document, please sign and
+date below.
________________________________________________________________ _____
+/_____/_____
________________________________________________________________ _____
+/_____/_____
Output:
Department,File_Name,Date_of_Last_Update,Upload_Date,Policy_Number,Num
+ber_of_Signatures
Test,000777012.rtf, 08/23/2007,3/29/2010, 000.777.012,3
Test,000777034.rtf, 3/27/2013,6/5/2013, 000.777.034,3
Test,000777056.rtf, 05/10/2013,6/4/2013, 000.777.056,3
Test,000777078.rtf, 3/28/2013,6/13/2013, 000.777.078,3
| [reply] [Watch: Dir/Any] [d/l] [select] |
Re: How can I find a line in a RTF file?
by james28909 (Deacon) on Aug 08, 2014 at 01:01 UTC
|
I dont know if this will be of much help, but i worked for me and prints "Date of Last Update: 10/15/2013".
But but you may have to do a work around unless the dates are always displayed as "01/01/2014" (10 characters), if its "1/1/2014" (8 characters)then you will have to modify the pattern matches position like "@-"+1, and then read one less character as well. which should print " Date of Last Update: 1/1/2014"
use File::Slurp;
my $file = read_file( "file.rtf" );
$match = "Date of Last Update: ";
if ($file =~ /$match/){
open my $file, "<", "file.rtf";
my $startpos = "@-"+2; #if date is 8 chars then change to "@-"+1
seek $file, $startpos, 0;
read $file, my $calender_date, 31; #if date is 8 chars change to 30
print $calender_date;
}
And im not sure about the signatures, but hopefully the above will be helpful :)
You can also return the position in which the match happens with @- for beginning of the match or @+ for the end of the match.
nevermind the above <.< here is working code that will get the exact info you want.
use File::Slurp;
my $file = read_file( "1.rtf" );
$match0 = "Date of Last Update: ";
$match1 = "Sign-Off Approvals";
$match2 = "Jane Doe, CEO";
if ($file =~ /$match0/){
open my $file, "<", "1.rtf";
my $startpos0 = "@-";
seek $file, $startpos0 +2, 0;
read $file, my $calender_date, 29;
if ($calender_date =~ /\\/){
seek $file, $startpos0 +2, 0;
read $file, my $calender_date, 31;
print "\n";
print "$calender_date\n\n";
}
else {
print "\n";
print "$calender_date\n\n";
}
}
if ($file =~ /$match1/){
open my $file, "<", "1.rtf";
my $startpos1 = "@+";
seek $file, $startpos1 +42, 0;
read $file, my $first_sig, 66;
print "$first_sig\n";
print "Jane Doe\n\n";
}
if ($file =~ /$match2/){
open my $file, "<", "1.rtf";
my $startpos2 = "@+";
seek $file, $startpos2 +82, 0;
read $file, my $second_sig, 67;
print "$second_sig\n";
print "John Doe\n\n";
}
prints:
C:\Users\guy\Desktop\tests>test.pl
Date of Last Update: 10/15/2013
______________________________________________________/_____/_____
Jane Doe
______________________________________________________/_____/_____
John Doe
This simply just pattern matches then seeks from the match.
You should also be able to do something like
if (first_sig = "_____________________________________________________
+_/_____/_____"){
print "There is no first signature";
}
else{
print "Document has a first signature";
}
if (second_sig = "____________________________________________________
+__/_____/_____"){
print "There is no second signature\n";
}
else{
print "Document has a second signature\n";
}
this also includes if your date is only 8 characters eg: 1/1/2013. if not, then dont worry about it.
this is also based off of the example you provided and will only work if every rtf file your editing is formatted identical to it.
also fyi, if you open this or any document in a hex editor, you will see the formatting of the file, and can manually parse it to get whatever data you want. | [reply] [Watch: Dir/Any] [d/l] [select] |
|
Thanks. I almost have this working. I dont know how the index works in the code above. I need to do more searching for @- and @+.
I'm scanning the documents on a windows OS because the dates will change if I transfer to linux. the documents might also have several pages of text between the "Date of Last Update" and signatures. I'm on a team of a few people trying to determine why policies take so long to get signed. I have not used perl extensively for several years. Thanks for your help. I'll update the page with the code and 2 sample files when I am done.
#!/usr/bin/perl
use Modern::Perl;
use RTF::TEXT::Converter;
my $string;
my $object = RTF::TEXT::Converter->new(output => \$string);
$object->parse_stream( '..\Policies\Test\000144027.rtf' );
chomp $string;
my @string = split("\n", $string);
my $number_of_sigs = 0;
foreach my $line (@string) {
chomp $line;
if ($line =~ m/Date of Last Update:(.*)/){
my $date_of_last_update = $1;
$date_of_last_update =~ s/^\s+//;
print "Date of last update is: $date_of_last_update\n";
}
if ($line =~ m/Document #:(.*)/){
print "\nDocument number is $1\n";
}
# Sign-Off Approvals
if ($line =~/'Sign-Off Approvals'/){
print $line;
$number_of_sigs ++;
}
# next;
#next if $line !~ /\w+/;
#next if $line =~ /^\_+\/\_+\/\_+$/;
#say $line;
}
print "number of signatures is $number_of_sigs \n";
| [reply] [Watch: Dir/Any] [d/l] |
|
honestly it shouldnt matter if there are 400 pages between dat and signature. as long as they are all in the same document and there is never another instance of it. also with the code i posted, it shouldnt matter if the dates change at all. they can be 01/01/0001, or 1/1/0001, you can add an extra elsif and add to include if dat is 1/1/01 as well
| [reply] [Watch: Dir/Any] |
|
and its just a simple pattern match then it will seek so many characters at the beginning or the end of the match. i put in the code i posted @- and @+ to show you how its used. if you want to see how it works, open the file in HxD editor and look on the text side (right side) and compare with the script.
| [reply] [Watch: Dir/Any] |
|
Wow. Thanks for all of the replies. I was offline for most of the day but I'll read / try these and let you know how it worked! Thanks so much. Kevin
| [reply] [Watch: Dir/Any] |
Re: How can I find a line in a RTF file?
by thanos1983 (Parson) on Aug 08, 2014 at 01:43 UTC
|
Hello Kevyt,
I am not really familiar with RTF Files but I created a few solutions that I think is exactly what you need.
I have tested them on a file test.rtf that I create. Based on what I see all solutions work fine.
Upsate
Forgot to wright, never forget to use the or die (function) when opening and closing files. They have saved me several times.
Small code modification updates
#!/usr/bin/perl
use strict;
use warnings;
sub regex {
my $file = 'test.rtf';
my $substr = 'Date of Last Update:';
open( my $in ,"<", $file)
or die "Can not open file: ".$file.": $!\n";
while ( <$in> ) {
chomp($_);
# Solution: 1 substring
# if (index($_, $substr) != -1) {
# Solution: 2 regex string{times,} at least string 1s
# if( $_ =~ /(?:$substr){1,}/) {
# Solution: 3 regex match string or more times: +
# if( $_ =~ /(?:$substr)+/ ) {
# Solution: 4 quotemeta match string
if( $_ =~ /\Q$substr\E/ ) {
chop($_); # I use chop to remove the last trailing character (
+})
print $_ . "\n";
push(@_,$_);
next;
}
}
# Count the number of signature lines
print "I found the string: ".@_." time(s)!\n";
close ($in)
or die "Can not close file: ".$file.": $!\n";
}
regex();
sub my_grep {
# Solution 5: grep in perl!!!! (My favorite)
my $file = 'test.rtf';
my $substr = 'Date of Last Update:';
open(my $in ,"<", $file)
or die "Can not open file: ".$file.": $!\n";
@_ = <$in>;
chomp @_;
my @out = grep { $_ =~ /Date of Last Update:/ } @_;
print "I found the string: ".@out." time(s)!\n";
close ($in)
or die "Can not close file: ".$file.": $!\n";
}
# &my_grep();
I hope this solves your problems.
Seeking for Perl wisdom...on the process...not there...yet!
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
I'm absent-minded, so instead of trying to remember to add a "die" to every operation that might need it, I just keep "use autodie;" in the standard boilerplate I put at the top of every script and module I write. For my purposes it does the trick fine.
The autodie module and pragma on CPAN
| [reply] [Watch: Dir/Any] [d/l] |
|
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
Hello Anonymous Monk,
Well I am not an expert I consider my self just a beginner, so any suggestions for improvement are always welcome. :D
Thank you for your time and effort to write these comments. :D
Seeking for Perl wisdom...on the process...not there...yet!
| [reply] [Watch: Dir/Any] [d/l] |
Re: How can I find a line in a RTF file?
by bulrush (Scribe) on Aug 09, 2014 at 12:05 UTC
|
| [reply] [Watch: Dir/Any] |
|
There are over 1000 files.
| [reply] [Watch: Dir/Any] |
|
| [reply] [Watch: Dir/Any] [d/l] |
|
|