Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Sharing Hash Question

by jmmach80 (Initiate)
on Jul 04, 2012 at 16:49 UTC ( #979887=perlquestion: print w/ replies, xml ) Need Help??
jmmach80 has asked for the wisdom of the Perl Monks concerning the following question:

At work, I'm in the middle of multi-threading a script that parses a lot of very large text files, in addition to various other things. The script usually takes over an hour to run, which is annoying, so I decided to multi-thread the parts of the script that taking forever. Anyway, originally I was only needing a small subset of data out of the text files. The code looked something like this...

use Threads;
use Threads::Shared;

my %hash : shared;
.
.
.
while (my $line = <FH>) {
chomp($line);
lock(%hash);
if($line =~ /regex/) {
my @parts = split('|', $line);
$hash{$file} = &share({});
share ( $hash{$file}{start} ); # for pulling start time from file
share ( $hash{$file}{stop} ); # for pulling stop time from file
$hash{$file}{start} = $parts[ 11]
$hash{$file}{stop} = $parts[ 13]
}
}
.
.
.
... This is a short simplified generic snippet from what I can remember off of the top of my head. Doing the above worked perfectly fine; however I realized that I needed additional information out of the text file. Basically, I'm trying to figure out how to correctly share a two-dimensional hash that references (i think) an array. I tried something like this, but it didn't work...

use Threads;
use Threads::Shared;

my %hash : shared;
.
.
.
my $i : shared = 0;
while (my $line = <FH>) {
chomp($line);
lock(%hash);
if($line =~ /regex/) {
$hash{$file} = &share({});
share ( $hash{$file}{certain_row_type}->[$i] );
$hash{$file}{certain_row_type}->[$i] = $line;
$i++;
}
}
.
.
.
So, basically every time we pattern match a certain row type we update the hash. I probably left some details out, but like I said I'm trying to recreate this from memory. I hope what I'm asking makes sense. I have no idea if what I'm trying to do above is even legal; which I guess it's not because I always get the, "can't start threads" message. I actually tried a bunch of other ways to share this hash, but nothing works. Any help would certainly be appreciated.

Thanks!

Comment on Sharing Hash Question
Re: Sharing Hash Question
by BrowserUk (Pope) on Jul 04, 2012 at 18:28 UTC

    There are no modules called Threads; or Threads::Shared.

    There is no such error message as "can't start threads".

    The code you've posted -- without even bothering with code></code> tags --

    • Is totally incomplete to the point it demonstrates nothing.
    • Is a complete mess.
    • Shows no attempt to read, let alone understand the documentation.
    • Could never even begin to address your stated problem.

    I summary, you appear to have made no effort at all to:

    1. Explain why the code you are attempting to speed up with threading is so slow in the first place.
    2. What the actual code you are attempting to compile looks like.
    3. What actual error messages you are getting when you try to run it.

    With you having made so little effort, why should anyone here make the effort to help you?


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

      Likely referring to threads::shared

      edit: I especially like the part where I get down-voted for wanting to be helpful to a first-time PM poster instead of being a dick.

        No shit Sherlock! :)

        But the point is, if he hasn't spelt the name correctly, how could his code work? How could he be getting the error message he claims? Etc.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        The start of some sanity?

      Douche-bag, I work in a classified environment and there is no way to port over the code to a unclass area. I specifically said, I'm at home trying to remember the code off of the top of my head. If you take 20 secs and look at the snippet of code you'll be able to see what I'm asking help for. I'm not asking you to syntactically review the code I typed. If you're not going to be helpful, please don't respond at all. Can someone please help me with how to correctly share a two dimensional hash that references an array?

        You have to share each hash and array. You don't have to share the values you put into the hash/array.

        So, I believe these two lines should be dropped:

        share ( $hash{$file}{start} ); # for pulling start time from file share ( $hash{$file}{stop} ); # for pulling stop time from file

        More importantly, the second lines needs to be done more like the first line:

        $hash{$file} = &share({}); share ( $hash{$file}{certain_row_type}->[$i] ); $hash{$file}{certain_row_type}->[$i] = $line;

        So, something more like:

        $hash{$file} = &share( {} ); $hash{$file}{row_type} = &share( [] ); $hash{$file}{row_type}[$i] = $line;

        - tye        

Re: Sharing Hash Question
by ig (Vicar) on Jul 05, 2012 at 09:00 UTC

    It may not help, but I would solve the performance problem without threads if I could. If I had to used threads, I would limit the shared data to the simplest data structures possible. I can't tell from the information you provided whether threads are necessary or, if they are, whether it is necessary to share complex data structures, but you might consider a re-design to avoid threads entirely, thus avoiding all your current problems.

Re: Sharing Hash Question
by sundialsvc4 (Monsignor) on Jul 05, 2012 at 14:31 UTC

    Let’s just all please forget the first misguided volley in this tennis-match and see what can be done to address the problem.   I am not sure that spinning it off into multiple threads will be helpful at all, particularly since the to-be shared hash data structure would necessarily be common to all of them and, hence, their execution would wind up being serialized anyhow.   I think that BrowserUK was (correctly) trying to focus your attention onto that, even though his choice of wording was ... less-than-delicate.

    So, given that we have a technical problem here, let’s just stay focused on that, shall we?   The only plausible reason to use threads is to achieve overlapping of I/O.   If the root problem is, as I suspect, “paging churn,” having a bunch of threads or processes “churning” at once will merely make the completion time very-significantly poorer than before.

    You don’t (and of course, you can’t) explain what “among other things” might be, but my initial impression about almost-any program that takes “a long time” to process “very large” files is that you are burning-up too much memory and/or causing excessive paging behavior ... easy to do with large random access data structures.   I would suggest measuring the program as it runs, even informally, to see what kind of memory footprint it has and what’s actually causing the (single...) process to wait.   Then, I would reconsider the possible solutions, but setting-aside threading as one of the alternatives.

      The only plausible reason to use threads is to achieve overlapping of I/O.

      That statement is bo.. er .. has no basis in reality.

      Likewise the rest of this misguided garbage.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      The start of some sanity?

        In case anybody cares, I finally got it to do what I was trying to do. If anybody needs to parse large text files using multi-threading, here's a simple script that might help. I apologize about my original post's vagueness. I have never posted on here before and had forgotten about the tags that you can use. Anyway, If people can improve upon it, feel free. I'm always looking for better ways to do things.
        use strict; use warnings; use threads; use threads::shared; use Thread::Queue; # Constant that hold maximum amount of threads to start use constant MAX_THREADS => 10; # Main data structure that holds all the data my %hash : shared; # A new empty queue my $q = Thread::Queue->new(); # Build list of files my @files = qw/<file1> <file2> <file3> <etc.>/; chomp(@files); # Enqueue the files $q->enqueue(map($_, @files)); # Start the threads and wait for them to finish for(my $i=0; $i<MAX_THREADS; $i++) { threads->create( \&thread, $q )->join; } # Print out the data structure when we're finished foreach my $key1 (keys %hash) { print "$key1 =>\n"; foreach my $key2 (keys %{$hash{$key1}}) { print "\t$key2 =>\n"; print map("\t\t$_\n", @{$hash{$key1}{$key2}}); } } ############################# # This code runs inside of the thread ############################# sub thread { my ($q) = @_; while (my $file = $q->dequeue_nb()) { my @array1 : shared; my @array2 : shared; my @array3 : shared; # Lock the main hash before writing lock(%hash); chomp($file); # Initialize has with the file/key $hash{$file} = &share({}); # Open the file and pattern match the lines open(FH, $file) or die "Can't open\n"; while(my $line = <FH>) { chomp($line); # Build arrays of the things we're # looking for in the file(s) if($line =~ /^<regex1>/) { push(@array1, $line); } elsif($line =~ /^<regex2>/) { push(@array2, $line); } elsif($line =~ /^<regex3>/) { push(@array3, $line); } } close(FH); share ( $hash{$file}{<type1>} ); share ( $hash{$file}{<type2>} ); share ( $hash{$file}{<type3>} ); # Can only assign arrays as a reference $hash{$file}{<type1>} = \@array1; $hash{$file}{<type2>} = \@array2; $hash{$file}{<type3>} = \@array3; } } exit;

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://979887]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others about the Monastery: (13)
As of 2014-07-22 20:53 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (127 votes), past polls