Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Faster file read, text search and replace

by sabas (Novice)
on Feb 13, 2018 at 22:16 UTC ( #1209090=perlquestion: print w/replies, xml ) Need Help??
sabas has asked for the wisdom of the Perl Monks concerning the following question:

I have a large xml file that has 946,388,628 lines. I created a simple .pl script to read and count the lines but it took so long just to finish reading the whole file without any logic added just read each line and count it. Is there a way I can speed up the process in PERL (I am new in PERL). I am planning to search for a certain "old string" in each line and replace with the "new string".

$ARGV[0] or die "ERROR: No file for 114 lines"; $ARGV[1] or die "ERROR: No file for 114 lines"; open my $bigfile,"<",$ARGV[0] or die "ERROR: COuld not open big file $ +ARGV[0]:$!"; open my $outfile,">",$ARGV[1] or die "Error: Could not open output fil +e $ARGV[0]:$!"; $datestring = localtime(); print $outfile "Processing started...at $datestring\n"; print "Processing started...at $datestring\n"; my $lctr = 0; while (my $line = <$bigfile>) { chomp $line; $lctr++; } $datestring = localtime(); print $outfile "Processing Ended...at $datestring\n"; print $outfile "Total Lines read in $ARGV[0] = $lctr"; close $bigfile; close $outfile;

Replies are listed 'Best First'.
Re: Faster file read, text search and replace
by hippo (Canon) on Feb 13, 2018 at 23:06 UTC
    Is there a way I can speed up the process in PERL

    Without changing the approach you can speed it up by losing the unused $line scalar and the pointless chomp. Change that loop to:

    while (<$bigfile>) { $lctr++; }

    That should buy you a few percent. Beyond that it would be better not to process the file line by line but rather block by block with a variable (ie. tunable) block size. Maybe start with 16MB or so. Then just count the newlines in each block once it is in memory.

    BTW, did you spot the bug on this line?

    open my $outfile,">",$ARGV[1] or die "Error: Could not open output file $ARGV[0]:$!";
      < yes i saw the bug $ARGV[0] should be $ARGV1 >
Re: Faster file read, text search and replace
by NetWallah (Canon) on Feb 14, 2018 at 04:56 UTC
    An XML file larger than ~ 500 MB is indicative of a poorly designed application system.

    The reason is that typically, XML files are serialized/processed after reading them into memory, and at over 500M, memory demands start to enter the region where they need special treatment for resource allocation.

    Consider loading the XML file into a database that can manage memory much better, while providing structured access.

    Something like this sqlite UI with an XML plug-in could help.

                    Python is a racist language what with it's dependence on white space!

      While I agree about the poorly designed system, reading whole XMLs into memory is more often than not poor design as well. Whether the file is huge (already) or not, if you do not have to, do not load the whole file into a huge maze of interconnected objects, but rather process it in chunks. XML::Twig or XML::Rules make that fairly easy to do.

      Jenda
      Enoch was right!
      Enjoy the last years of Rome.

Re: Faster file read, text search and replace
by Jenda (Abbot) on Feb 14, 2018 at 12:18 UTC

    It's Perl (the language) or perl (the "interpreter"), not PERL. And no, there's nothing on the Perl side that can make this quicker. The IO costs will greatly overweight anything you can do on the Perl side. The data should not be in the XML format. It's one of the least space efficient ways to store data and when reading and writing is involved, space equals speed.

    If you can't change the way you store the data, you might at least store it compressed and then decompress as you read and compress as you write. While it will mean more work for the CPU, the IO costs ought to be much lower. See PerlIO::gzip and PerlIO::via::Bzip2.

    Also ... making changes to a XML file without the use of a module that actually understands the format is dangerous. Sooner or later you run into problems with encoding, entities or comments. I'm not saying you may never ever do it ... if it's a one time transformation of a known XML and the changes are simple enough, go ahead ... but do be careful.

    Jenda
    Enoch was right!
    Enjoy the last years of Rome.

Re: Faster file read, text search and replace
by haukex (Abbot) on Feb 13, 2018 at 22:28 UTC

    Could you tell us a bit more, like how long did it take to process this file, and exactly what the "old string" and "new string" are?

Re: Faster file read, text search and replace
by Cristoforo (Curate) on Feb 14, 2018 at 20:29 UTC
    An example that edits an XML file was recently discussed here.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1209090]
Approved by haukex
help
Chatterbox?
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (6)
As of 2018-06-22 21:50 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Should cpanminus be part of the standard Perl release?



    Results (124 votes). Check out past polls.

    Notices?