Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Massive File Editing

by Kage (Scribe)
on Dec 15, 2002 at 03:23 UTC ( #219950=perlquestion: print w/ replies, xml ) Need Help??
Kage has asked for the wisdom of the Perl Monks concerning the following question:

Okay, I need to parse through every file I have on my server that has the extention of .shtml and find all instances of <a href="main.php?page=...."> and change the main.php?page= to /?id= ONLY of the main.php?page is in an anchor href.

I have it down to a point of near complete, but apperantly, substitute, transition, and all those only want to switch once.
sub changelinks { my ($file) = @_; if (-d $file) { opendir(DIR, $file) || die "$!"; my @filenames = readdir(DIR); foreach my $filename (@filenames) { if (($filename ne "." && $filename ne "..")) { changelinks($file."/".$filename); } } closedir(DIR); } else { if ($file =~ /.shtml/i) { And then whatever method of opening and editing each file goes here } } }


Any ideas on how to do this?
A script is what you give the actors. A program is what you give the audience. ~ Larry Wall

Comment on Massive File Editing
Download Code
Re: Massive File Editing
by Zaxo (Archbishop) on Dec 15, 2002 at 03:30 UTC
Re: Massive File Editing
by pg (Canon) on Dec 15, 2002 at 03:33 UTC
    Two suggestions:
    1. Use File::Find with File::Glob. One for digging into sub directories, the other one for matching .shtml files. Much less coding on your own.
    2. When you read files, call read with a big buffer, to read in the whole shtml file by one call. This improves performance. Don't handle line by line, too slow. Memory is not an issue in your case(, unless you have huge shtml files). When you s///, use /g modifier.
    3. Don't consider +< in this case, as your new content is shorter than the old one. (I am not saying you cannot use it, but using it in this case, requires more coding effort)
    use File::Find; use File::Glob ':glob'; use strict; find(\&wanted, "c:/perl58/bin"); #replace with your directory sub wanted { if ((-d $File::Find::name) && ($_ ne ".") && ($_ ne "..")) { my @shtml_files = bsd_glob("*.shtml"); foreach my $shtml_file (@shtml_files) { print $shtml_file; open(SHTMLFILE, "<", $shtml_file); my $buffer; read(SHTMLFILE, $buffer, 10000); #give some big number, wh +ich exceeds the size of all your .shtml files close(SHTMLFILE); $buffer =~ s/<a href="main\.php\?page=(.*?)\"/<a href="mai +n.php?id=$1"/g; open(SHTMLFILE, ">", $shtml_file); close(SHTMLFILE); } } }

      For each file or directory you find with File::Find, you're finding all of the files in its containing directory with File::Glob. You don't really need File::Glob.

        I know this. The purpose of using File::Glob here is to reduce coding effort, so you don't need to match file name patterns on your own. When File::Glob can do this for you, what is the point to reinvent it?

        I would agree File::Glob is too much a waste, if File::Find is improved to support patterns, and only return those entities, whose name match a certain pattern (ideally a regexp). But why the Perl community didn't do it? Reason is simple, because of the existance of File::Glob, there is no point to repeat/reinvent the same functionality in another class.

        If you look at this from an OO view, it does make a lot of sense. Although File::Find and File::Glob might take care of different but tightly related tasks in some programs, they still should be abstracted as two different classes.
Re: Massive File Editing
by MarkM (Curate) on Dec 15, 2002 at 03:42 UTC

    Problem #1 : You are invoking opendir(DIR) recursively, meaning that closedir(DIR) will generate system errors on the way back up. Still, this should not present a problem, as you are invoking readdir(DIR) in list context before recursing, so your results should not be affected. Either move the closedir() immediately after the readdir(), or use a lexical (my $dir) instead of a global (DIR).

    Problem #2 : The regexp /.shtml/i is not anchored to the end of string (/.shtml$/i or /.shtml\z/i), and the '.' is not escaped to make it literal (/\.shtml$/i or /\.shtml\z/i). Meaning - any file that contains the string "shtml" after the first character will match. Again, this is not probable, and the problem would be that too many files were processed, instead of too few, so this is not likely to be your problem.

    Sorry... I don't see any obvious errors other than the above two. What behaviour are you seeing that you find to be unexpected?

      Nothing unexpected, I just can't make any good way to replace every instance of an anchor href of "main.php?page=..." to "/?id=..." in every .shtml file.
      A script is what you give the actors. A program is what you give the audience. ~ Larry Wall

        The easiest solution (although not the most efficient) would be to invoke Perl from within Perl. For example, use the loops you already have defined to gather a set of path names into an array. Then, invoke:

        system('perl', '-I.bak', '-pe', 's[main.php\?page=][/?id=]g', @pathnam +es) == 0 or die "Subcommand failed with return code $?.\n";

        Try to make the regexp as accurate and complete as possible to avoid incorrect alterations. If you want a longer term solution that is a bit more efficient, see the perlrun manpage to see an example of the code that approximates the behaviour of -pI.bak.

        NOTE: If the above system() invocation does not work, try changing 'perl' to be $^X ($EXECUTABLE_NAME with 'use English'). If the regular expression becomes very complicated, it may be easier to store the command in a perl script, and use the sub-script name instead of -e '...' in the system() invocation.

        First open the file and read the content into a scalar
        Then do a
        $fileContent =~ s/(<a href=")main\.php\?page\=/$1\/\?id\=/g;
        And then write the new contents back to the same file (or a new one with an added extension like .new or so
Re: Massive File Editing
by atcroft (Monsignor) on Dec 15, 2002 at 03:45 UTC

    Would this not be a case where File::Find would be appropriate to find those files you need? Perhaps something along the lines of:

    # # UNTESTED CODE # use File::Find; # If desired, change following to absolute path(s) to search my @directories = ('.'); find(\&search_and_replace, @directories); sub search_and_replace { # Item from search is a file -f && # Item's full name ends with '.shtml' $File::Find::name =~ m/\.shtml$/ && # Rename file so there is a backup rename($File::Find::name, $File::Find::name . '.bak'); # Read from original (now .bak), writing to target open(INF, $File::Find::name . '.bak') or die('Input: ', $!, "\n"); open(OUTF, '>' . $File::Find::name) or die('Output: ', $!, "\n"); { # Localize $_ to prevent potential problems with call local($_); # Loop thru file, doing replacements while ($line = <INF>) { $line =~ s!(href="?)main\.page\?page=!$1/id=!g; print(OUTF $line); } } close(OUTF); close(INF); }

    Hope that the idea above at least helps.

    Update: Added comments for clarification.

Re: Massive File Editing
by Aristotle (Chancellor) on Dec 15, 2002 at 14:02 UTC
    Code lifted from the POD in File::Find::Rule and HTML::TokeParser::Simple and adapted. Not tested, so proceed with necessary caution, though I don't see any mistakes.
    #!/usr/bin/perl -w use strict; use File::Find::Rule; use HTML::TokeParser::Simple; print "Processing:\n"; for my $file (File::Find::Rule->file->name('*.shtml')->in(@ARGV)) { print "\t$file,\n"; rename $file, "$file.old" or die "Cannot rename $file to $file.old +: $!"; open my $fh, ">", $file or die "Cannot open $file for writing: $!" +; my $p = HTML::TokeParser::Simple->new("$file.old"); while (my $token = $p->get_token) { print $fh +( $token->is_start_tag('a') ? new_a_tag($token->return_attr, $token->return_attrseq) : $token->as_is ); } close $fh; } print "done.\n"; sub new_a_tag { my ($attr, $attrseq) = @_; $attr->{href} =~ s</main\.php\?page=!></?id=!> if exists $attr->{h +ref}; map "<a $_>", join ' ', map qq<$_="$attr->{ $_ }">, @$attrseq; }
    Update: fixed some small left-overs from the POD version of the tokeparser code.

    Makeshifts last the longest.

Re: Massive File Editing
by Zapawork (Beadle) on Dec 16, 2002 at 05:28 UTC
    Hi Kage,
    I know the others are suggesting moving towards using file::find or some other function, which is a great idea if you can install the modules on the end host. I recently had a similar problem where I could not easily install modules, long story, and had to write the function from scrap. What I changed is that I feed in a ls -laR listing into the program to parse out the files I wanted to use and then modified those files.

    Example:

    # Try to match the expected input line format (from "ls" output) # if ("$_" =~ /^\-.+ ([0-9]+) ([A-Z|a-z]+ [ ]?[0-9]+ [ ]?[:|0-9]+) (.+) +$/) { # Set some defaults to avoid potentially problematic missing field +s # $file1 = "FULLNAME"; $file2 = "BASENAME"; $fext = "NO EXTENSION"; # Set file size, date and compelte filename variables $fsize =$1; $date = $2; $file1 = $3; if ("$file1" =~/^([\.]?.+)\.(.+))$/) { $file2 = $1; $fext = $2; }
    Then for your example you would test to see if the file extension was .shtml and if it was open the file and read it, whether to read in the file as a glob or line really depends on two issues;

    1) How many times do you plan to run this, let's be honest if your only going to run this once you don't need a perfectly efficient piece of code. even though I hate to admit that.

    2) How many files and the size of the files you'll be reading in.

    Then as you hit the line you could either do a s// or just replace the contents of the substring. I like to cheat with a sanity test since I substitute operations seem to always do bad things to me data.

    if ($_ =~ /<a href="main.php?page/) { s/main.php?page=/main.php/?id=/g }
    then the file open operator should be pretty straight foward (no more directory recursion woo!), if you have some problems with the output of the ls statement you may have to embedded a directory. Other than that it should be pretty straight foward. I had to write this to deal with a terrabyte file system in a lawsuit, so my solution may require more work then you are willing to deal with.

    Dave -- Saving the world one node at a time

      Bad advice.

      File::Find has been part of the core Perl distribution forever. If it isn't available on your host, it means their installation of Perl is incomplete. Complain to them and if they don't react, move to somewhere else. There is no excuse for not offering File::Find.

      I feed in a ls -laR listing
      How robust is your ls parsing pattern? And why not use find for the job of find? Something like the following does all you want, with minimal coding of your own. $ file . -type f -name '*.shtml' -print0 | xargs -0 ./myscript.pl
      Iterate over @ARGV using the diamond operator; it might even suffice to do something like $ file . -type f -name '*.shtml' -print0 | xargs -0 perl -i.old -pe's!/main\.php\?page=!/id=!g'
      See perldoc perlrun. Use the tools intended for your job to do your job, don't reinvent round wheels.

      Makeshifts last the longest.

        Hi Aristotle,

        Had no idea File::Find was part of the core distribution. The reason I did not use a find function in my example was I had to parse each file in the filesystem.

        Why? We had a 2 terrabyte file system from a litigation that we needed to type, index, hash and store in a mysql database. Then when we needed to find files of certain types, patterns, sizes, dates we could query a hashed index instead of running find each time. I agree that if you are looking for a certain type of file this is not the best idea, however in my situation it had to be done. However, if you know of a better way to do this, PPLLEEASSEE let me know.

        Not the best advice, but it worked for me.

        Dave -- Saving the world one node at a time

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://219950]
Approved by talexb
Front-paged by wil
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (8)
As of 2014-07-30 10:22 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (230 votes), past polls