Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

H of A optimization

by abitkin (Monk)
on Aug 30, 2002 at 20:11 UTC ( #194235=perlquestion: print w/replies, xml ) Need Help??

abitkin has asked for the wisdom of the Perl Monks concerning the following question:

I have a loop here, that is requiring a lot of time to run. (I believe) The purpose is to read an edge list and convert it into a hash, there's about 30,000 nodes in this graph, but 90,000 lines. So there are loops which need to be eliminated. (Graph is undirected.) The data is such that:
0 1
1 2
1 3
1 4
2 4
4 5

With my code looking like:
my %edgeHash; my %countHash; while(<stdin>){ # Reading an edge list if(/^\s*([0-9]+)\s+([0-9]+)\S*$/){ my $a = "$1"; my $b = "$2"; my $good = "1"; # Eliminate loops foreach my $item (keys %edgeHash){ foreach my $item2(@{$edgeHash{$item}}){ if($item2 eq $b){ $good = "0"; } } } # Add items to the edge hash if($good eq "1"){ if(exists($edgeHash{$a})){ push @{$edgeHash{$a}}, $b; }else{ push @{$edgeHash{$a}}, $b; } } } }

Replies are listed 'Best First'.
Re: H of A optimization
by sauoq (Abbot) on Aug 31, 2002 at 02:04 UTC

    This question has resulted in quite a mess of answers.

    The purpose is to read an edge list and convert it into a hash, there's about 30,000 nodes in this graph, but 90,000 lines. So there are loops which need to be eliminated. (Graph is undirected.)

    I am a little confused about what it is that you want. It seems there has been some confusion in the answers as well. The fact that the graph is undirected doesn't have anything to do with how many "loops" it has. An undirected graph may certainly have loops. It may even have multiple edges with the same vertices or edges where both ends are the same vertex. The undirected graph: 0-1,1-0,1-2,2-0,2-2 has loops with 1, 2, and 3 vertices.

    All "undirected" means is that the edges don't have a direction so it doesn't matter what order you list the vertices when defining an edge. In an undirected graph, a-b is the same as b-a. That's not the case in a directed graph.

    Your code seems to suggest that you don't want any cycles in the graph. Really, all it does is not allow the same node to show up more than twice on the right hand side of your input. If the graph is undirected, the side that the vertex shows up on shouldn't even matter. Most of the solutions presented to you are optimizations of your own code including those by dws, ferrency, fruiture, and Arien.

    With your example input:

    0 1 1 2 1 3 1 4 2 4 4 5

    Your code will simply eliminate the edge "2 4". Because 4 shows up more than once on the right.

    If this is what you want, then great; you've got lots of answers. This doesn't jive with the idea that they are undirected graphs though.

    When I first read it, I suspected that runrig had a better idea of what you wanted to do. I thought that maybe you wanted to keep only one of sets of duplicate edges. His (runrig's) code would do that for a directed graph but it would still allow two edges like a-b and b-a to exist together. I'm guessing Zaxo had the same thought and was under the impression that the Graph module would do what you wanted. It doesn't. (In fact, it seems to have a bug(?) such that I can use Zaxo's code to add 7 edges and actually end up with 9. I'll demonstrate that below in a readmore section.)

    In order to get a good answer here, I think you are going to have to more clearly explain what it is that you are trying to do. Is the graph really undirected? Are you trying to eliminate all duplicate edges? (You could have a 30K node graph with 90K unique edges, by the way.) Are you trying to eliminate all cycles? (There are likely to be many ways of doing that which result in totally different graphs.) Or are you trying to do something else entirely?

    If you want to know more about that bug(?) I encountered in Graph...
    -sauoq
    "My two cents aren't worth a dime.";
    
      Actually I was turning an aciclic undirected graph into a spanning tree, so removing 2 4 was the correct approach. The code I posted works correctly for me, with that and some help I got on the CB, I went from 80+ minutes of run time, to 15.

      Thanks for your thoughts though
        Actually I was turning an aciclic undirected graph into a spanning tree,

        But you weren't.

        Firstly, the data you started with was not acyclic (the edges 1-2, 1-4, and 2-4 composed a cycle) and you treated it as if it were directed by paying special attention to the nodes on the right side of your input edges. Secondly and more importantly, your algorithm won't guarantee that the resulting graph is connected or acyclic, the two requirements for a spanning tree.

        so removing 2 4 was the correct approach.

        If the nodes had been listed as 2-1, 1-4, 4-2 the cycle would not have been eliminated.

        The code I posted works correctly for me, with that and some help I got on the CB, I went from 80+ minutes of run time, to 15.

        I'm glad you got an answer that worked for you. Certainly the code you posted was not very efficient. But efficiency is meaningless without correctness and your statement of the problem leaves some question about the solutions.

        Maybe you could summarize the help you got on the CB for those monks who may stumble upon this node while seeking enlightenment on their own graph related problems.

        Good luck!

        -sauoq
        "My two cents aren't worth a dime.";
        
Re: H of A optimization
by dws (Chancellor) on Aug 30, 2002 at 20:32 UTC
    I have a loop here, that is requiring a lot of time to run.

    Since the point of the loop is to look for cycles, stop as soon as you find one.

    SEARCH: foreach my $item (keys %edgeHash){ foreach my $item2(@{$edgeHash{$item}}){ if($item2 eq $b){ $good = "0"; last SEARCH; # one is enough, stop now. } } }
Re: H of A optimization
by runrig (Abbot) on Aug 30, 2002 at 20:29 UTC
    I would create a HoH in addition to the HoA and replace that double nested foreach loop:
    next if exists $edgeHoH{$a}{$b}; $edgeHoH{$a}{$b} = undef;
    Then you can get rid of the whole 'if $good' if/else statement, and just push unconditionally (and there was no difference in your 'if' and 'else' actions anyway).

    Update: Oops, misread another question. But now it seems like you just need a simple hash to store the b's, and skip if you've already seen it.

      I think abitkin wants to make sure that $b isn't in any of the lists in %edgeHash, not to make sure that this specific edge didn't already exist.

      I think a better answer is:

      next if $already_seen{$b}++;
      If the values are $b are densely packed small positive integers, an array may be a better choice of data structures for already_seen

      The new code:

      my %edgeHash; my %countHash; my %already_seen; while(<stdin>){ # Reading an edge list if(/^\s*([0-9]+)\s+([0-9]+)\S*$/){ my $a = "$1"; my $b = "$2"; # Eliminate loops next if $already_seen{$b}; push @{$edgeHash{$a}}, $b; } }
      I hope this helps.

      Alan

Re(ALL): Putting it together
by abitkin (Monk) on Aug 30, 2002 at 20:44 UTC
    Putting everything together, I have:
    my %edgeHash; my @alreadySeen; my %countHash; while(<stdin>){ # Reading data from the inet generator, # making sure the format is correct. if(/^\s*([0-9]+)\s+([0-9]+)\S*$/){ my $a = "$1"; my $b = "$2"; push @{$edgeHash{$a}}, $b if not $alreadySeen[int($b)]++; } }
    Thanks to all those who helped. ++ tomorrow, when I get the votes.

      I'd rather say:

      my (%edgeHash,@alreadySeen); while(<STDIN>){ # simpler regexp if(/(\d+)\D+(\d+)/){ # don't use '$a' and '$b'! # and don't quote "$vars" # in fact you can keep $1 and $2 push @{$edgeHash{$1}}, $2 unless $alreadySeen[int($2)]++; } }
      --
      http://fruiture.de

        Well, in that case...

        my (%edge, @seen); while (<STDIN>) { $seen[$2]++ or push @{$edge{$1}}, $2 if /(\d+)\D+(\d+)/; # push @{$edge{$1}}, $2 if /(\d+)\D+(\d+)/ and not $seen[$2]++; }

        — Arien

        Edit: added the commented alternative.

Re: H of A optimization
by ides (Deacon) on Aug 30, 2002 at 21:08 UTC

    Removing this would probably help performance a bit as it is totally unnecessary:

    if(exists($edgeHash{$a})){ push @{$edgeHash{$a}}, $b; }else{ push @{$edgeHash{$a}}, $b; }
    Unless this is a mistake, either condition of the if test does the same thing.

    -----------------------------------
    Frank Wiles <frank@wiles.org>
    http://frank.wiles.org

Re: H of A optimization
by abitkin (Monk) on Aug 30, 2002 at 20:12 UTC
    I'd like to note, the only reason I'm asking here, is every time I do, I learn a new trick to help myself out.

      Ok, dpuu suggested Graph, which I recommend. This does it:

      #!/usr/bin/perl -w use strict; use Graph::Undirected; my $net = Graph::Undirected->new; while (<DATA>) { $net = $net->add_edge( split ); } use Data::Dumper; print Dumper($net); __END__ 0 1 1 2 1 3 1 4 2 4 4 5

      Graph.pm takes care of circular references for you.

      After Compline,
      Zaxo

Re: H of A optimization
by dpuu (Chaplain) on Aug 30, 2002 at 21:45 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://194235]
Approved by Snuggle
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others about the Monastery: (11)
As of 2020-06-03 10:28 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Do you really want to know if there is extraterrestrial life?



    Results (22 votes). Check out past polls.

    Notices?