Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw

Managing a graph with large number of nodes

by nishant2984 (Novice)
on Nov 01, 2009 at 19:52 UTC ( #804383=perlquestion: print w/replies, xml ) Need Help??
nishant2984 has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I have an application where I need to parse a lot of files and create a graph of users from them. Each node will have some attributes and will be connected to some other nodes with some edge weights which can change depending upon the type of analysis. The number of nodes is extremely large (i.e. more than 10 million). I tried using hashes to represent these graphs once created using Storable but when I try to load them, the system goes out of memory. I wanted to know what is the best possible way of representing graphs in such applications. Should they be represented as disk files i.e. each node stored as a separate file in which case I would need a good distributing algorithm for putting them into a hierarchy in filesystem. Or they can be represented on single files without killing too much memory. Please enlighten. Thanks
  • Comment on Managing a graph with large number of nodes

Replies are listed 'Best First'.
Re: Managing a graph with large number of nodes
by BrowserUk (Pope) on Nov 01, 2009 at 23:26 UTC

    What does a node look like?

    If performance is important, before you consider paying the penalties of constructing, updating and traversing a disk-based graph, consider using a more memory efficient data structure.

    If you store 10 million of the following simple nodes as:

    1. Array of hashes:
      >perl -E"$a[$_] = {a=>1,b=>3.2,c=>'a short string'} for 1..10e6; <>"

      It requires ~3.9 GB.

    2. Array of arrays:
      >perl -E"$a[$_] = [1,3.2,'a short string'] for 1..10e6; <>"

      ~2.8 GB.

      As an array of simple (or packed) strings:

      >perl -E" $a[$_] = qq[1,3.2,'a short string'] for 1..10e6; <>" >perl -E" $a[$_] = pack 'VdA8', 1,3.2,'a short string' for 1..10e6; <> +"

      Only 1.1 GB.

    By saving 75% of the memory requirement, you may be able to fit all your data in to ram comfortably.

    It means that you need to split or unpack and join or repack each time you access or modify a node, but that will be far faster than doing disk IO. You can easily hide the details of the splitting and joining behind a tied array or an OO class. And you can even add a simple caching mechanism to avoid re-splitting in a read-update sequence.

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Managing a graph with large number of nodes
by bart (Canon) on Nov 01, 2009 at 22:12 UTC
    That sounds like a typical application for a tied hash to me, the replacement of perl4's dbmopen and related functions.

    As for backend, you might want to choose between the various backends as supported by AnyDBM_File; BerkeleyDB, in particular BerkeleyDB::Hash might be a good choice.

    The advantage is that no rewriting of your code is necessary: it's just a hash. You will likely lose raw speed, though.

    If you don't mind rewriting, you might want to look into DBD::SQLite as a backend for DBI, which gives you a SQL interface to your data.

Re: Managing a graph with large number of nodes
by jethro (Monsignor) on Nov 01, 2009 at 21:41 UTC
    You might use a disk based hash, DBM::Deep. Seems to me your best option without knowing any more details

    Or an SQL database (lots of database interfaces around, for example DBIx::Simple, DBIx:Class), but you also would need to install a database engine.

    Files for individual nodes would be a bad idea, as every access of a node would entail the opening of a file, heavy overhead I would assume.

      Installing a database engine is not such a problem. There are quite a few to choose from. Even most of the big expensive ones have free developer versions. Plus you may use DBD::SQLite which is a database engine inside a module. No need to install anything more.

      Enoch was right!
      Enjoy the last years of Rome.

      Depending on the nature of the nodes and how they're going to be accessed/manipulated, an object store (such as Pixie, Tangram, or KiokuDB) seems likely to work well for something like this, while avoiding the need to worry about SQL schemas (even if the objects are ultimately stored into an SQL database).

        Well, in this case it seems that the whole schema will be just two tables

        CREATE TABLE dbo.Users ( Id int not null identity(1,1) primary key, Code varchar(...), FirstAttribute ..., SecondAttribute ..., ... )go CREATE INDEX idx_Users_Code ON dbo.Users(Code) go CREATE TABLE dbo.Links ( UserId1 int not null, UserId2 int not null, Weight ..., CONSTRAINT pk_Links PRIMARY KEY CLUSTERED (UserId1, UserId2) )go
        and that's about it.

        Object store is most likely an overcomplication in this case.

        Enoch was right!
        Enjoy the last years of Rome.

Re: Managing a graph with large number of nodes
by Ea (Hermit) on Nov 02, 2009 at 12:35 UTC

    I've made a graph using the Graph module that has 0.5 million nodes and 100 thousand links. It takes > 300M of memory, so tieing it to disk or increasing the amount of swap space might work. The other possibility, depending on the nature of the graph, is to try PDL which is implemented in C and use a matrix representation for the weighted graph with other data structures to handle the various attributes.

    If you get desperate enough to write it yourself, take a gander at the C++ code under Optimising "generalised modularities". For my graph, it runs very fast.


    perl -e 'print qq(Just another Perl Hacker\n)' # where's the irony switch?
Re: Managing a graph with large number of nodes
by Anonymous Monk on Nov 02, 2009 at 01:33 UTC
    May we know the type of data? May be a more elegant solution already exists. Database is a good solution. If RAM is the only problem, there could be many solutions. But you have to keep the application in mind too.
Re: Managing a graph with large number of nodes
by zentara (Archbishop) on Nov 02, 2009 at 16:46 UTC
    .... not saying its the best solution, but if i was presented with the problem, with my background, i would just use a Storable database ( memory is getting cheap...alot of systems have 4 GiG standard now).... and just load it into a well designed hash.... then use Tk, Gtk2, or Zinc to display the nodes as different colored little icons, that are interactive with the mouse.... which is a cool feature.... a graph that responds in realtime to the mouse..... but 1 million data points may overload the widget..... but there are ways around the problem, by breaking the data points into smaller sets for display..... a multi-planar layering probably could be possible too. See Tk Realtime data aquisition

    I'm not really a human, but I play one on earth.
    Old Perl Programmer Haiku
Re: Managing a graph with large number of nodes
by NiJo (Friar) on Nov 02, 2009 at 22:36 UTC
    If you are just trying to plot the nodes, I'd look no further than Graphviz. It even has a Perl module. In that case you don't need all the nodes in memory. Parse your input files and transform them line by line into a large dot file. Of coure GraphViz also takes edge weights into account.
Re: Managing a graph with large number of nodes
by MadraghRua (Vicar) on Nov 02, 2009 at 18:11 UTC
    I'm not sure of your ultimate app but if you are working with RDF triples and using these to draw graphs you might try RDF Store.It appears to have a lot of the functionality you are looking for. Like others I am interested in this topic too and would appreciate if you could share more information on what you need the graph for. Thanks!

    yet another biologist hacking perl....

Re: Managing a graph with large number of nodes
by spx2 (Deacon) on Nov 14, 2009 at 23:25 UTC
    I think you want to cluster them. Also, notice that in graphs that big, the biggest problem you will face is the big number of edges crossings which you don't want( and which you will probably not be able to avoid if a planar embedding does not exist ). If you do cluster them(on which criteria you decide) you will probably have less edge crossings inside the clusters so the drawing will be less confusing to the eye. Apart from that, you can look up force based algorithms, those can progressively find a planar embedding of the graph. Graphviz can use force-based algorithms if you use the fdp layout. Also see Visualizing very large datastructures thejit Graphing huge social networks.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://804383]
Approved by AnomalousMonk
Front-paged by biohisham
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (4)
As of 2018-04-22 23:02 GMT
Find Nodes?
    Voting Booth?