http://www.perlmonks.org?node_id=11102663


in reply to Re^3: String manipulation in a document
in thread Grep a particular text

Dear All, i have created the below script working fine to extract the query text and store in a variable. After that, i need to place the values after two carriage returns. I don't know how to place the query values after two carriage return.

Kindly suggest friends!

use strict; use warnings; my $InputXmlFile = $ARGV[0]; my $OutputXmlFile = $InputXmlFile; my $OutProcesText = ""; $OutputXmlFile =~ s#\.tex$#\.tex#gsi; my $workingpath = $1 if $InputXmlFile =~ m/(.*)\\(.+)\.tex/i; my $filename = $2 if $InputXmlFile =~ m/(.*)\\(.+)\.tex/i; open (INTXT, "$InputXmlFile") ||die ("Can't open the input file $I +nputXmlFile"); my $inText = join("",<INTXT>); close (INTXT); while($inText =~ m#\\section\{(.*?)\}\n\n#gi) { my $qrtext = $1; if ($qrtext =~m/\\qr\{(.*?)\}/) { my $qrFindText = $1; print $qrFindText; } }

Replies are listed 'Best First'.
Re^5: String manipulation in a document
by haukex (Archbishop) on Jul 11, 2019 at 15:50 UTC

    Thank you for showing your code. To begin, several recommendations:

    • You have several unused variables that don't seem relevant to this example, such as $filename or $workingpath. For a SSCCE, it's better to remove them.
    • The variable name $InputXmlFile is kind of confusing, since it's a .tex file.
    • It's best to use proper indentation and formatting. perltidy can help with that.
    • Regarding your open, see "open" Best Practices: open my $fh, '<', $filename or die "$filename: $!";
    • Reading an entire file at once is better done with the following "slurp" idiom instead of the join: my $inText = do { local $/; <INTXT> };

    Now moving on to the question of how to search and replace text. First of all, reading from input files and writing to output files is described in "Files and I/O" in perlintro - I would recommend reading the whole document, since it's not very long and it gives a good overview of Perl.

    As for the algorithm, there are several general approaches I'd consider:

    1. You should look for modules that are able to parse and write TeX/LaTeX.
    2. You could read the entire file into memory and then use regular expressions to process it, the disadvantage being that it won't work well for large files.
    3. You could read the file line-by-line, keeping the current state (i.e. whether \section{...\qr{...}} has been seen, storing the text to be added back in), i.e. a "state machine" type approach. While very powerful and flexible, it can sometimes be a bit more verbose.
    4. Another possibility would be to read the file in "chunks" - in this case, your example input appears to have a blank line between each line you want to process. If that really is the case, you could use $/ to read the file in "paragraph mode" (sections are separated by one or more blank lines).

    In the following, I'm using a combination of the third and fourth points, but note that this code makes quite a few assumptions about your input file format, which you haven't shown a lot of. In your real script, you'd have to replace DATA with the filehandle you opened.

    use warnings; use strict; local $/ = ""; my $buffer; while (<DATA>) { if ( s/\\section\{.*?\K(\\qr\{.*?\})(?=\})//i ) { $buffer .= $1; } elsif (defined $buffer) { s/^\S+\s+\S+\K/$buffer/; undef $buffer; } print; } print $buffer if defined $buffer; __DATA__ \section{Results\qr{text ... text}} Normal paragraph text here... \section{Funding\qr{text ... text}} Funding text here...

    Output:

    \section{Results} Normal paragraph\qr{text ... text} text here... \section{Funding} Funding text\qr{text ... text} here...

    I'm really not sure how you choose the insertion point for the \qr{} - in the original question, it seems you wanted it inserted after the fourth word in the paragraph, while in this example, it looks like you wanted it inserted after the second word.