Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Simple Text Manipulation

by pseingalt (Initiate)
on Feb 05, 2011 at 09:28 UTC ( #886378=perlquestion: print w/ replies, xml ) Need Help??
pseingalt has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

 

I have a very long text file with lines similar to the following throughout the text:

 

Article 1 News

Article 2 Older News

Article 3 Even Older News

Article 4 Real Old News

 

Etc.

 

Iíd like to change this so that the file reads:

 

\subsection*{Article 1 News}

\subsection*{Article 2 Older News}

\subsection*{Article 3 Even Older News}

\subsection*{Article 4 Real Old News}

 

I canít use a simple search and replace because the Article numbers change. Can Perl be used to do this?

 

Thanks in advance.

Comment on Simple Text Manipulation
Re: Simple Text Manipulation
by Corion (Pope) on Feb 05, 2011 at 09:37 UTC

    You don't show what you've tried so far, so I'll give you a general description of how I'd approach the problem. The problem can be split up into three parts:

    1. Loop through each line of the file
    2. Check whether the current line matches your target ("starts with 'Article'")
    3. Output the reformatted line

    The code for each part also is relatively self-contained:

    1. #!perl -w use strict; # loop through the file while (<>) { print };
    2. if (/^Article\b/) { s/\s*$//; # remove trailing whitespace and newlines print "Line [$_] is a heading."; };
    3. $_ = "Article 1 News"; print "\\subsection*{$_}";

    You'll have to assemble these parts into a coherent whole and/or ask questions about the parts that are still unclear to you. If you chose a different approach, it will help us to help you better if you show your approach.

Re: Simple Text Manipulation
by JavaFan (Canon) on Feb 05, 2011 at 11:10 UTC
    s/^(Article\s+[0-9]+\s+\N*\S)/\\subsection*{$1}/gm;

      This is the response I get:

      Missing braces on \N{} at article.pl line 2, within pattern Nested quantifiers in regex; marked by <-- HERE in m/^(Article\s+0-9+\s+* <-- HERE \S)/ at article.pl line 2.

        Upgrade your Perl.

        What the JavaFan means is that  \N (without curly braces) in Perl 5.12+ is an experimental backslash sequence that will match any character but a newline. The negated character class  [^\n] will do the same for pre-5.12 Perl (but I haven't tested JavaFan's actual regex). (See Backslashed sequences (5.12.0), fourth paragraph in the Whitespace subsection, for discussion.)

Re: Simple Text Manipulation
by Albannach (Prior) on Feb 05, 2011 at 15:21 UTC
    I canít use a simple search and replace because the Article numbers change. Can Perl be used to do this?

    Well fortunately Perl doesn't restrict you to simple search and replace, instead you have arguably the most powerful regular expression engine in the known universe! Perlre's eat changing numbers for breakfast whilst blindfolded with all eight limbs tied behind their backs. Some suggestions to add to Corion's sage advice:

    • figure out exactly how you identify the lines you want to change. Corion had to assume they were all starting with Article since you weren't specific about that
    • decide exactly how the lines may be constructed: are they always just letters and spaces, or may they have numerals, punctuation, carriage returns?

    As a trivial solution I offer this simple modification of JavaFan's solution which you would apply line by line as you read through the file, but note it will not work if you have a different specification for the lines. I prefer to be as specific as I can in regular expressions, so know your data.

    s/^(Article\s+[0-9]+\s+[\w\s]+$)/\\subsection*{$1}/;

    --
    I'd like to be able to assign to an luser

      I must be doing something wrong, when I type:

      $perl script.pl file1.txt

      I'm returned to the command prompt. I'm using mac os, if that makes any difference.

      Here's the script; I added : because the word "Article" which always begins the line, is followed immediately by a colon. Using the syntax, "$ perl script.pl test.tex -w" I'm returned to the command line and the file was not processed.

      #!/usr/bin/perl

      s/^(Article\s+{0-9}+{:}+\s+{\w\s}+$)/\\subsection*{$1}/;

      (I've replaced brackets [] with braces {} since they don't seem to show up here.

        I gave you three steps you need to implement for your program. Which of these steps is your program supposed to implement?

        As an aside, if you use <code>...</code> tags around your code (and data), as suggested when you compose a node, your code will render as code, without HTML or bracket interpolation..

        I think you need to read our responses more carefully. Corion gave you most of the answer, and I helped you with the substitution, noting in my response: "...which you would apply line by line as you read through the file". If you don't understand what Corion wrote, then ask a specific question, but please don't just ignore the advice. As you seem to need a little more fundamental guidance, I'll say this: Perl will let you do what you want in many possible ways, but a basic approach to run your code in a file (your script.pl) from the Perl command line as you appear to have chosen, that code will have to:

        • open the input file (tip: check out the "null filehandle" in perlop)
        • read lines one by one,
        • apply the substitution regex you want, and
        • write the result somewhere. Your output could go to the console, pipe to a file, get written to a file you opened in your code, or you could even investigate Perl's handy feature for editing a file in place via the -i command line option (read up on this in perlrun).
        All your code (and please do use code tags - see Writeup Formatting Tips just above the editing box when you're composing your message) does is apply a substitution to nothing and ends, so of course you get no result.

        Quick tips on your regex modification: The colon has no special meaning so you need not put it in square brackets. Functionally there is no difference, but using unnecessary characters makes it harder for others to easily pick out what you mean to do. Also, unless you want to accept cases of more than one colon, don't put a + after the colon. Please read up on regular expressions; the + symbol is not used to assemble parts of a regex, it is a quantifier and means 'match one or more of the preceding element'.

        --
        I'd like to be able to assign to an luser

Re: Simple Text Manipulation
by 7stud (Deacon) on Feb 05, 2011 at 22:32 UTC

    Me too, check it out:

    $ perl my_macOSX_proggy.pl input.txt $

    Can anyone one tell me what's wrong with my code?

      42

        Because of the way Perl counts lines in conditional blocks, the problem is actually on line 41!

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://886378]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (3)
As of 2014-07-13 02:08 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    When choosing user names for websites, I prefer to use:








    Results (245 votes), past polls