http://www.perlmonks.org?node_id=929613


in reply to How to process with huge data's

700,000 lines is a small input file ...OK, a medium sized input file, especially when each line is 6 characters. It should take a few seconds to read and process.

Using the default variable $_ is great when you have a few lines of code, but when you have a real program, I find it better to use real variables all the time. For one thing, when you extend the code, you may introduce inner or outer loops which re-assign $_, so you are forced to use real variables anyway. And in your case, you explicity specify the variable you are chomping, so there's no saving. And since you assign $_ to $inputtext, why not use that variable right from the while loopo:

while ( my $input_text = <INP> ) { ....

Regular expressions are great, when you need them, but they're not always the best solution. I would split the input when I receive it into it's separate components:

while ( my $line = <INP> ) { chomp $input; my ( $cmd, $fnum, $snum ) = split ' ', $line; if ( $cmd eq 'q' ) { $answered++ if contains_pair( $connectionText, $fnum, $snum ) or contains_pair( $connectionText, $snum, $fnum ); $queries++; } ...

Encapsulating a search into a routine is better than having it inline, because the name or the routine provides some documentation. As well, splitting a complicated search into a pair of simpler searches is much easier to understand, and can be faster, too.

I think what you're trying to do is record pairs of numbers$fnum,$snum, and separate the pairs by vertical pipes. This is part of what is slowing your program down. For each line of input, all 700,000 of them, each regular expression searches linearly through the connectionText string... making your program complexity N^2. Instead of storing it as a string, you should use an array of pairs, or a hash of pairs, or a two layer hash.

But if you WERE going to stick with a string, you can simplify the last three lines by checking for blank components on the simple pair, rather than checking the whole string ... after all, you know after each pass that the connectionText has no half-pairs, so you don't need to check the whole string, only the new addition:

# instead of ... else { $connectionText.=$fnum.",".$snum."|"; } $connectionText=~s/,,/,/g; $connectionText=~s/,\|/\|/g; # how about ... else { my $pair = join ',', grep { defined } ( $fnum, $snum ); $connectiontext .= $pair . '|'; }

Given a list, ($fnum, $snum), grep for the defined components, discarding the rest. (This is a place where default variables ARE useful!) join() the resulting list with commas ... that is place commas BETWEEN components. If there is only one component, there's no comma, so no mess to clean up. But in any case, have you encountered this situation, or is it merely fear that makes you handle missing parts. because, you know, you assign $fnum and $snum from regular expressions that match numerals. If there is no numberal, there would be no match, so else() test would fail, and you'd never enter the else block at all.

I don't think your "trailing pipe or end of string" search (\||$) will ever find anything but a trailing pipe, since you add it with each addition, and preserve it with the changes. So why not simply check for and leave in place the trailing pipe?

But it would be a WHOLE lot simpler to use data structures instead of strings. If you use an array of pairs, you can grep through the list of pairs to search for those that match:

push @allPairs, grep { defined } ( $fnum, $snum ); # and at another time .. # find half-pairs, which I think is impossible my @halves = grep { scalar @$_ == 1 && ( $_->[0] == $fnum || $_->[0] = += $snum ) } @allpairs; # find true pairs with a leading fnum my @matches = grep ( scalar @$_ == 2 && $_->[0] == $fnum } @allpairs # or a trailing fnum my @matches = grep ( scalar @$_ == 2 && $_->[1] == $fnum } @allpairs

Searching the entire array for each line is still an N^2 algorithmn, but at least cleaner than searching the string character by character. Using hashes would definitely be faster.

It seems you're adding things to what I called 'pairs', to form growning lists. So maybe you want a hash of hashes, or a hash that uses $fnum as the key, and has a n array of $snum as the associated value. If you need to be able to find the sets that include a particular $snum, you could also have similar reverse data structure, keyed on $snum with an array of $fnum that use that $snum. Lots of different ways to generate data structures, depending on your needs, and all of them are better than a long string.

As Occam said: Entia non sunt multiplicanda praeter necessitatem.

Replies are listed 'Best First'.
Re^2: How to process with huge data's
by arivu198314 (Sexton) on Oct 05, 2011 at 13:39 UTC

    Thanks for your reply

    After consulting all the replies, I have moved into array in hashes

    Please find my code below, but still i'm struggling lot since it is taking long time to finish this task.

    #!/usr/bin/perl use Data::Dumper; my $queryCount=0; my $answeredCount=0; my @connectionValues; open(INP, $ARGV[0]); while(my $inputtext=<INP>) { chomp($inputtext); my ( $cmd, $fnum, $snum ) = split ' ', $inputtext; if ($cmd eq 'q') { for( my $arrVal=0; $arrVal < @connectionValues; $arrVal++) { if (exists $connectionValues[$arrVal]{$fnum} and exists $c +onnectionValues[$arrVal]{$snum}) { $answeredCount++; } } $queryCount++; } elsif ($cmd eq 'c') { if ($fnum ~~ @connectionValues or $snum ~~ @connectionValues) { my $fHashVal; my $sHashVal; my $valid=0; for( my $arrVal=0; $arrVal < @connectionValues; $arrVal++) { next if (!$connectionValues[$arrVal]); if (exists $connectionValues[$arrVal]{$fnum} and not e +xists $connectionValues[$arrVal]{$snum}) { if ($valid) { $connectionValues[$arrVal]={%{$connectionValue +s[$arrVal]}, %{$connectionValues[$sHashVal]}}; delete $connectionValues[$sHashVal]; $sHashVal=0; } else { $connectionValues[$arrVal]{$snum}=1; $fHashVal=$arrVal; $valid=1; } } elsif (exists $connectionValues[$arrVal]{$snum} and no +t exists $connectionValues[$arrVal]{$fnum}) { if ($valid) { $connectionValues[$arrVal]={%{$connectionValue +s[$arrVal]}, %{$connectionValues[$fHashVal]}}; delete $connectionValues[$fHashVal]; $fHashVal=0; } else { $connectionValues[$arrVal]{$fnum}=1; $sHashVal=$arrVal; $valid=1; } } } } else { push(@connectionValues, {$fnum=>1, $snum=>1}); } } } close (INP); print "$answeredCount,".($queryCount-$answeredCount)."\n";

      Let's start by generating specifications for what you're trying to achieve.

      First start with a short, one-line summary of the project. I'll make one up ... Summarize activity of users at my web site.

      Then generate a description, in English, of what you want to achieve. I'll make stuff up, guessing what i think you might be thinking

      • Read the data file, line by line, specifying the edges of a graph.
        • Split the line into a command, the 'from' node, and the 'to' node.
      • For a connection 'command', if the 'from' node and 'to' node have not been seen, add them.
        • If the 'from' node has been seen, but the 'to' has not, then ... do something
        • if the 'from' node has not been seen, but the 'to' node has, then ... do something

      Once you have an idea of what you want to achieve, then you can consider how to implement it.

      At the moment, if you are asked to "connect 1 2", you create a hash { 1 => 1, 2 => 1}. How do you know whether that is "connect 1 2" or "connect 2 1"?

      If this is homework, you should ask your teacher or TA.

      As Occam said: Entia non sunt multiplicanda praeter necessitatem.