Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

How can I find a line in a RTF file?

by kevyt (Scribe)
on Aug 07, 2014 at 21:30 UTC ( [id://1096681]=perlquestion: print w/replies, xml ) Need Help??

kevyt has asked for the wisdom of the Perl Monks concerning the following question:

Can someone please help me search a rtf file?

I'm able to find desired text in a txt file but not a rtf file.

I would like to find the date after the text "Date of Last Update:" and count the number of signatures in the file.

I've not had luck with RTF::Tokenizer or Win32::OLE.

I will eventually print the directory, filename, date the file was uploaded, date of last update, and number of signatures.

Thanks for your help.

use strict; use warnings; use File::stat; use Time::localtime; use RTF::Tokenizer; require Win32::OLE; sub get_data_from_file ; my $fullpath = my $dir = 'C:\Test'; my %hash; opendir (DIR, $dir) or die $!; while (my $dir = readdir(DIR)) { my $file = $fullpath = 'C:\Test\\' . $dir; opendir (FILE, $file) or die $!; while (my $file = readdir(FILE)) { if ($file =~/RTF/ || $file =~/rtf/ ){ # if ( $file =~/txt/ ){ # print "fullpath = $fullpath file is $file\n"; # $hash{$dir}{$file}{PATH} = $fullpath . '\\' . $file; my $upload_date = ctime(stat($fullpath . '\\' . $file) +->mtime); $hash{$dir}{$file}{UPLOAD_DATE} = $upload_date; # print "dir is $dir file is $file date is $hash{$dir} +{$file}{UPLOAD_DATE}\n"; # print $hash{$dir}{$file}{PATH}; open (IN, $fullpath . '\\' . $file) or die ("Can't ope +n the input file $!"); while ( <IN> ){ # Try to find date on this line "Document Owner: D +ate of Last Update: 10/15/2013 #print $_; chomp; if( m/Date of Last Update:(.*)/) { print $1; next; } # Count the number of signature lines } close (IN); } } #print "dir is $dir\n"; } closedir(DIR); closedir(FILE); #print_data (\%hash); sub print_data { my ($h_ref) = @_; foreach my $dir (keys %$h_ref){ foreach my $file (keys $h_ref->{$dir}){ print "$dir, $file, $h_ref->{$dir}{$file}{UPLOAD_DATE} +\n"; } } }


Sample Document

Document #: 000 Version #: 1 Document Owner: Someone Date of Last Update: 10/15/2013 Written by: Someone Status: Approved General Description Purpose Definitions Procedures Sign-Off Approvals ______________________________________________________/_____/_____ Jane Doe, CEO Date ______________________________________________________/_____/_____ John Doe, CFO Date

Replies are listed 'Best First'.
Re: How can I find a line in a RTF file?
by wjw (Priest) on Aug 08, 2014 at 01:21 UTC

    Just out of curiosity: Would the RTF::TEXT::Converter module work instead? Have not used it myself, but was theorizing that it might be simpler to process plain text output from this module...?

    Different approach... Just a thought...

    Update:

    I became curious about this so tried the following:

    #!/usr/bin/perl use Modern::Perl; use RTF::TEXT::Converter; my $string; my $object = RTF::TEXT::Converter->new(output => \$string); $object->parse_stream( "/home/wjw/tmp/RTF_test.rtf" ); chomp $string; my @string = split("\n", $string); foreach my $line (@string) { chomp $line; next if $line !~ /\w+/; next if $line =~ /^\_+\/\_+\/\_+$/; say $line; }

    .. which is in-complete and with a leaning toothpick eye-sore in the second regex, but gets one to the point of only having to deal with the text.

    Output is as follows:

    Use of uninitialized value $cstylename in string ne at /home/wjw/perl5 +/perlbrew/perls/perl-5.20.0/lib/site_perl/5.20.0/RTF/Control.pm line +1489, <GEN1> line 16. Document #: 000 Version #: 1 Document Owner: Someone Date of Last Update: 10/15/2013 Written by: Someone Status: Approved General Description Purpose Definitions Procedures Sign-Off Approvals Jane Doe, CEO Date John Doe, CFO Date

    I did not dive into the warning issued. Note that the install of the referred to module actually depends on two others, one of which I think you were using already.

    cpanm RTF::TEXT::Converter --> Working on RTF::TEXT::Converter Fetching http://www.cpan.org/authors/id/S/SA/SARGIE/RTF-Parser-1.12.ta +r.gz ... OK Configuring RTF-Parser-1.12 ... OK ==> Found dependencies: RTF::Tokenizer --> Working on RTF::Tokenizer Fetching http://www.cpan.org/authors/id/S/SA/SARGIE/RTF-Tokenizer-1.18 +.tar.gz ... OK Configuring RTF-Tokenizer-1.18 ... OK Building and testing RTF-Tokenizer-1.18 ... OK Successfully installed RTF-Tokenizer-1.18 Building and testing RTF-Parser-1.12 ... OK Successfully installed RTF-Parser-1.12 2 distributions installed

    ...the majority is always wrong, and always the last to know about it...

    Insanity: Doing the same thing over and over again and expecting different results...

    A solution is nothing more than a clearly stated problem...otherwise, the problem is not a problem, it is a facct

      Hello wjw,

      OMG, everyday that I keep reading posts here I discover so many new modules that Perl has that I can not even keep up. :)

      It looks as possible solution though, thanks for sharing.

      Seeking for Perl wisdom...on the process...not there...yet!
      I got it to work thanks to everyone's help.
      I placed a # in front of "use warnings" in C:\Perl64\site\lib\RTF\Control.pm so I would not see a screen of warning about an uninitialized value in that file.
      I have 3 warnings in the print statement for using of an initialized value.
      Thank you very much!

      Test code:
      #!/usr/bin/perl use Modern::Perl; use RTF::TEXT::Converter; my $string; my $object = RTF::TEXT::Converter->new(output => \$string); $object->parse_stream( '..\Policies\Test\000144028.rtf' ); chomp $string; my @string = split("\n", $string); my $number_of_sigs = 0; foreach my $line (@string) { chomp $line; if ($line =~ m/Date of Last Update:(.*)/){ my $date = $1; $date =~ s/^\s+//; my @date_of_last_update = split (/ /, $date); print "Date of last update is: $date_of_last_update[0]\n"; } if ($line =~ m/Document #:(.*)/){ print "\nDocument number is $1\n"; } # Sign-Off Approvals if ($line =~/_____\/_____\/_____/){ # print $line; $number_of_sigs ++; } } print "\nnumber of signatures is $number_of_sigs \n";

      Finished Code:
      use strict; use warnings; use HTTP::Date; use File::stat; use Time::localtime; use Modern::Perl; use RTF::TEXT::Converter; my $fullpath = my $dir = 'C:\Policies\\'; my %hash; my %months = ( "Jan" => "1", "Feb" => "2", "Mar" => "3", "Apr" => "4", "May" => "5", "Jun" => "6", "Jul" => "7", "Aug" => "8", "Sep" => "9", "Oct" => "10", "Nov" => "11", "Dec" => "12" ); #print "full path is $fullpath \n"; opendir (DIR, $dir) or die $!; #print "dir is $dir \n"; while (my $dir = readdir(DIR)) { my $file = $fullpath = 'C:\Policies\\' . $dir; opendir (FILE, $file) or die $!; while (my $file = readdir(FILE)){ # print "7 reading $file\n"; if (($file =~/RTF/ || $file =~/rtf/ )&& $file !~/^~/ ){ my $path_file = $fullpath . '\\'. $file; #print '\nfullpath = ' . $fullpath . '\\' . $file . '\ +n'; my $string; my $object = RTF::TEXT::Converter->new(output => \$str +ing); $object->parse_stream( $path_file ); # print ctime(stat($fullpath . '\\' . $file)->mtime); my ($wday, $mon, $day, $time, $yr ) = split (/\s+/, ct +ime(stat($fullpath . '\\' . $file)->mtime) ); # split on space # print "\n$wday, $mon, $day, $time, $yr \n"; # print "file is $file my date is $months{$mon}/$day/$ +yr\n\n" ; $hash{$dir}{$file}{UPLOAD_DATE} = $months{$mon} . "/" +. $day . "/" .$yr; chomp $string; my @string = split("\n", $string); my $line_num =0; foreach my $line (@string) { $line_num ++; #print "Reading line number $line_num in $file\ +n"; chomp $line; if ($line =~ m/Date of Last Update:(.*)/){ my $date = $1; $date =~ s/^\s+//; my @last_update_date = split (/ /, $date); #print "Date of last update is: $last_update_da +te[0] line number $line_num\n"; $hash{$dir}{$file}{LAST_UPDATE_DATE} = $last_up +date_date[0]; } # print "5 line is $line \n"; if ($line =~ m/Document #:(.*)/){ # not needed bec +ause the file name is the policy number #print "Document number is $1 line number $lin +e_num\n"; my $Policy_number = $1; $Policy_number =~ s/\t+/ /g; # replace all tab +s with spaces $Policy_number =~ s/^\s+//; # remove leading s +paces my @arr = split (/ /, $Policy_number ); $hash{$dir}{$file}{DOC_NUM_AND_VERSION} = $arr +[0]; # print "array is @arr and element 0 is $arr[0 +]\n"; #print "line is $line \n"; #print "1Path is $fullpath " . "\\" . "$file\n +"; } # Sign-Off Approvals if ($line =~/_____\/_____\/_____/){ #print $line; $hash{$dir}{$file}{NUM_OF_SIGS} ++; } } #print "$hash{$dir}{$file}{NUM_OF_SIGS}\n"; } } #print "dir is $dir\n"; } closedir(DIR); closedir(FILE); print_data (\%hash); sub print_data { my ($h_ref) = @_; open (OUT, '> C:\Dev\Policy_upload_dates.csv') or die ("Can't +open the output file $!"); print OUT "Department,File_Name,Date_of_Last_Update,Upload_Dat +e,Policy_Number,Number_of_Signatures\n"; foreach my $dir (keys %$h_ref){ foreach my $file (keys $h_ref->{$dir}){ printf OUT "%s,%s, %s,%s, %s,%s\n", $dir, $file, $h_ref->{$dir}{$file}{LAST_UPDATE_DATE}, $h_ref->{$dir}{$file}{UPLOAD_DATE}, $h_ref->{$dir}{$file}{DOC_NUM_AND_VERSION}, $h_ref->{$dir}{$file}{NUM_OF_SIGS} ; } } }

      One of many input files:
      COMPUTER APPLICATION SELECTION Policy and Procedure Document #: 123.456.789 Version #: 6 Document Owner: Date of Last Update: 04/22/2003 Written by: INFORMATION SYSTEMS Status: Approved General Description Purpose To establish guidelines for computer application selection. Policy GENERAL INFORMATION RESPONSIBILITY Procedure A. Pages of text … DEAPRTMENT – INFORMATION TECHNOLOGY Document Control Revision History Ver. 1 02/01/1996 INITIAL Sign-Off Approvals The person responsible for this document must verify accuracy and that + the steps for this procedure or work instruction have been tested an +d validated. After you have approved this document, please sign and +date below. ________________________________________________________________ _____ +/_____/_____ ________________________________________________________________ _____ +/_____/_____

      Output:
      Department,File_Name,Date_of_Last_Update,Upload_Date,Policy_Number,Num +ber_of_Signatures Test,000777012.rtf, 08/23/2007,3/29/2010, 000.777.012,3 Test,000777034.rtf, 3/27/2013,6/5/2013, 000.777.034,3 Test,000777056.rtf, 05/10/2013,6/4/2013, 000.777.056,3 Test,000777078.rtf, 3/28/2013,6/13/2013, 000.777.078,3
Re: How can I find a line in a RTF file?
by james28909 (Deacon) on Aug 08, 2014 at 01:01 UTC
    I dont know if this will be of much help, but i worked for me and prints "Date of Last Update: 10/15/2013".

    But but you may have to do a work around unless the dates are always displayed as "01/01/2014" (10 characters), if its "1/1/2014" (8 characters)then you will have to modify the pattern matches position like "@-"+1, and then read one less character as well. which should print " Date of Last Update: 1/1/2014"
    use File::Slurp; my $file = read_file( "file.rtf" ); $match = "Date of Last Update: "; if ($file =~ /$match/){ open my $file, "<", "file.rtf"; my $startpos = "@-"+2; #if date is 8 chars then change to "@-"+1 seek $file, $startpos, 0; read $file, my $calender_date, 31; #if date is 8 chars change to 30 print $calender_date; }
    And im not sure about the signatures, but hopefully the above will be helpful :)

    You can also return the position in which the match happens with @- for beginning of the match or @+ for the end of the match.


    nevermind the above <.<
    here is working code that will get the exact info you want.
    use File::Slurp; my $file = read_file( "1.rtf" ); $match0 = "Date of Last Update: "; $match1 = "Sign-Off Approvals"; $match2 = "Jane Doe, CEO"; if ($file =~ /$match0/){ open my $file, "<", "1.rtf"; my $startpos0 = "@-"; seek $file, $startpos0 +2, 0; read $file, my $calender_date, 29; if ($calender_date =~ /\\/){ seek $file, $startpos0 +2, 0; read $file, my $calender_date, 31; print "\n"; print "$calender_date\n\n"; } else { print "\n"; print "$calender_date\n\n"; } } if ($file =~ /$match1/){ open my $file, "<", "1.rtf"; my $startpos1 = "@+"; seek $file, $startpos1 +42, 0; read $file, my $first_sig, 66; print "$first_sig\n"; print "Jane Doe\n\n"; } if ($file =~ /$match2/){ open my $file, "<", "1.rtf"; my $startpos2 = "@+"; seek $file, $startpos2 +82, 0; read $file, my $second_sig, 67; print "$second_sig\n"; print "John Doe\n\n"; } prints: C:\Users\guy\Desktop\tests>test.pl Date of Last Update: 10/15/2013 ______________________________________________________/_____/_____ Jane Doe ______________________________________________________/_____/_____ John Doe
    This simply just pattern matches then seeks from the match.

    You should also be able to do something like
    if (first_sig = "_____________________________________________________ +_/_____/_____"){ print "There is no first signature"; } else{ print "Document has a first signature"; } if (second_sig = "____________________________________________________ +__/_____/_____"){ print "There is no second signature\n"; } else{ print "Document has a second signature\n"; }
    this also includes if your date is only 8 characters eg: 1/1/2013. if not, then dont worry about it.
    this is also based off of the example you provided and will only work if every rtf file your editing is formatted identical to it. also fyi, if you open this or any document in a hex editor, you will see the formatting of the file, and can manually parse it to get whatever data you want.
      Thanks. I almost have this working. I dont know how the index works in the code above. I need to do more searching for @- and @+.

      I'm scanning the documents on a windows OS because the dates will change if I transfer to linux. the documents might also have several pages of text between the "Date of Last Update" and signatures. I'm on a team of a few people trying to determine why policies take so long to get signed. I have not used perl extensively for several years. Thanks for your help. I'll update the page with the code and 2 sample files when I am done.

      #!/usr/bin/perl use Modern::Perl; use RTF::TEXT::Converter; my $string; my $object = RTF::TEXT::Converter->new(output => \$string); $object->parse_stream( '..\Policies\Test\000144027.rtf' ); chomp $string; my @string = split("\n", $string); my $number_of_sigs = 0; foreach my $line (@string) { chomp $line; if ($line =~ m/Date of Last Update:(.*)/){ my $date_of_last_update = $1; $date_of_last_update =~ s/^\s+//; print "Date of last update is: $date_of_last_update\n"; } if ($line =~ m/Document #:(.*)/){ print "\nDocument number is $1\n"; } # Sign-Off Approvals if ($line =~/'Sign-Off Approvals'/){ print $line; $number_of_sigs ++; } # next; #next if $line !~ /\w+/; #next if $line =~ /^\_+\/\_+\/\_+$/; #say $line; } print "number of signatures is $number_of_sigs \n";
        honestly it shouldnt matter if there are 400 pages between dat and signature. as long as they are all in the same document and there is never another instance of it. also with the code i posted, it shouldnt matter if the dates change at all. they can be 01/01/0001, or 1/1/0001, you can add an extra elsif and add to include if dat is 1/1/01 as well
        and its just a simple pattern match then it will seek so many characters at the beginning or the end of the match. i put in the code i posted @- and @+ to show you how its used. if you want to see how it works, open the file in HxD editor and look on the text side (right side) and compare with the script.
      Wow. Thanks for all of the replies. I was offline for most of the day but I'll read / try these and let you know how it worked! Thanks so much. Kevin
Re: How can I find a line in a RTF file?
by thanos1983 (Parson) on Aug 08, 2014 at 01:43 UTC

    Hello Kevyt,

    I am not really familiar with RTF Files but I created a few solutions that I think is exactly what you need.

    I have tested them on a file test.rtf that I create. Based on what I see all solutions work fine.

    Upsate

    Forgot to wright, never forget to use the or die (function) when opening and closing files. They have saved me several times.

    Small code modification updates
    #!/usr/bin/perl use strict; use warnings; sub regex { my $file = 'test.rtf'; my $substr = 'Date of Last Update:'; open( my $in ,"<", $file) or die "Can not open file: ".$file.": $!\n"; while ( <$in> ) { chomp($_); # Solution: 1 substring # if (index($_, $substr) != -1) { # Solution: 2 regex string{times,} at least string 1s # if( $_ =~ /(?:$substr){1,}/) { # Solution: 3 regex match string or more times: + # if( $_ =~ /(?:$substr)+/ ) { # Solution: 4 quotemeta match string if( $_ =~ /\Q$substr\E/ ) { chop($_); # I use chop to remove the last trailing character ( +}) print $_ . "\n"; push(@_,$_); next; } } # Count the number of signature lines print "I found the string: ".@_." time(s)!\n"; close ($in) or die "Can not close file: ".$file.": $!\n"; } regex(); sub my_grep { # Solution 5: grep in perl!!!! (My favorite) my $file = 'test.rtf'; my $substr = 'Date of Last Update:'; open(my $in ,"<", $file) or die "Can not open file: ".$file.": $!\n"; @_ = <$in>; chomp @_; my @out = grep { $_ =~ /Date of Last Update:/ } @_; print "I found the string: ".@out." time(s)!\n"; close ($in) or die "Can not close file: ".$file.": $!\n"; } # &my_grep();

    I hope this solves your problems.

    Seeking for Perl wisdom...on the process...not there...yet!

      I'm absent-minded, so instead of trying to remember to add a "die" to every operation that might need it, I just keep "use autodie;" in the standard boilerplate I put at the top of every script and module I write. For my purposes it does the trick fine.

      The autodie module and pragma on CPAN

      A couple of suggestions for your code:

      1. Although it works, open(IN, ,"<", $file) has an extra comma in it. Also lexical filehandles are generally considered to be better (e.g. open my $in, '<', $file or die $!;)
      2. You use the array @_ in your subs for storing data. While that's possible, @_ is generally only used to access the arguments passed into a sub, and using it the way you're doing is very likely to confuse others working with your code.
      3. Although I'm not sure what your intentions were, I suspect the regex /Date of Last Update:+/ is not doing what you expect: It'll match the string "Date of Last Update" followed by one or more colons. If you want to match the string itself multiple times, you need a group: /(?:Date of Last Update:)+/, although I'm not sure if the input file will ever contain the string "Date of Last Update:Date of Last Update:". Also, {1,} is equivalent to +.
      4. You can use your variable $substr in your regular expressions, e.g. /\Q$substr\E/ (for the meaning of \Q...\E see quotemeta).
      5. Naming a sub the same as a Perl keyword, in this case "grep", is generally a bad idea - the only exception is when you're actually trying to replace the built-in grep - because it will create much confusion as to which function is supposed to be called when, both for the readers of your code and for Perl. This confusion is why you had to call your "grep" as &grep();. Also, inside your "grep", you're calling Perl's grep, any slight mistake in syntax may call your "grep" instead and create infinite recursion.
      6. A small hint: grep { $_ =~ /Date of Last Update:/ } can be written as grep { /Date of Last Update:/ } and "... file: ".$file.": $!\n"; can be written as "... file: $file: $!\n";
      7. On style: If you find yourself using $_ a lot, over multiple lines, then usually it's better to use a lexical variable instead. $_ is global and can (accidentally) be manipulated by other code, especially code that someone else inserts later. You can use something like while (my $line = <$in>) { instead.

        Hello Anonymous Monk,

        Well I am not an expert I consider my self just a beginner, so any suggestions for improvement are always welcome. :D

        Thank you for your time and effort to write these comments. :D

        Seeking for Perl wisdom...on the process...not there...yet!
Re: How can I find a line in a RTF file?
by bulrush (Scribe) on Aug 09, 2014 at 12:05 UTC
    Why can't you save this as a text file and search the text file? Any line breaks should come out properly in the text file. One purpose of RTF is to mark where line breaks are, among other things.
    Perl 5.8.8 on Redhat Linux RHEL 5.5.56 (64-bit)
      There are over 1000 files.

        ... So, automate.

        A few lines (using Win::OLE or even a system call to whatever RTF-capable editor you have) to begin your processing with the "save as text" operation shouldn't add more than a trivial coding job, and should incur only a small time penalty.


        check Ln42!

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1096681]
Approved by Laurent_R
Front-paged by toolic
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (5)
As of 2024-03-28 08:45 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found