|Problems? Is your data what you think it is?|
Re: Re: Re: Re: Re: Re: Re: Re: A (memory) poor man's hashby BrowserUk (Pope)
|on Nov 26, 2003 at 09:46 UTC||Need Help??|
Okay Tilly, if you want to play the game of revising history through selective quotes, let's do that. Your first reply contained 3 points.
To summarise and paraphrase.
Your statistics are bad, DB_File is perfect, and if you knew better, you wouldn't be so enamoured with hashes in the first place because they are old hat and destined to be replaced.
Does any of this help in solving or smoothing the performance knee that arises for perl 5 programmer dealing with large keyed data sets today?
Despite this, I attempted to address each of your points seriously in a constructive and open-minded manner.
I disagree that optimising for better use of multi-level memory is unimportant though....
Can you point to where I said such optimisation was "unimportant"?
I did say that such optimisations were
I'm still somewhat bemused by the relevance this entire subject had relative to the OP. The technique I outlined in the OP was aimed not at performance, but at conserving memory. And at trading memory conservation for performance whilst smoothing the knee of the transition. Perl's native hashes are already as fast as they are likely to be. Maybe Judy arrays would be faster, but having spent 3 hours reading the 10-minutes "explanation" of them, which gave little or no information on how to actually implement them I moved onto the 3-hours diatribe which repetitively extols their virtues without giving any real information upon their implementation. Whilst the code may be available for download,
Until someone is willing and capable of providing a native C implementation of Judy arrays for incorporation directly into the perl code base, Judy arrays and the cache-level performance optimisations that can, with considerable effort, be derived from them, will be nothing but a blue-sky desire for the average perl 5 programmer. Whether the performance benefits are required is questionable as perl's hashes are already pretty fast and very flexible. Whether they would in any way address the memory consumption issue is beyond my ability to predict, but it is unlikely that they will become available any time soon. It is this latter issue that I was attempting to solve.
Your third reply
On the low-memory hashing. I'm sure that what you are doing is possible. I'm sure that it saves memory.
Damned, with faint praise. I know that it is possible, I said as much.
With cache optimization, we need to specify our goals first.
Cache optimisation is not my goal.
...Sure, something else might now be optimal. But pretending that access times are flat will lose by more in the future than it does now.
I'm not pretending anything. I acknowledged the reality that perl's hashes, useful as they are now, have a limitation for certain applications. I attempted to find a solution for this limitation that could be made available now. I believed I had the germ of such a solution and meditated on it in the hope that the collective resource of PM could see a way of utilising that germ to bring a sufficiently generic solution to fruition. Now, or in the immediate future.
The current design and implementation of Perl has many such limitations that affect particular, specific applications. Many of these are un-addressable given the current code base. P6/Parrot and Ponie offer the potential for addressing these, but these remain some time away from being available to the average perl programmer for production use.
On relational databases, I think that you are missing the boat.
Sorry, but the RDBMS boat sailed in the 1970's and reached it's destination in the 1980's. Little or nothing has changed on the RDBMS front since then except that their proprietary internals have been tuned in ways that are specific to the implementations --each proprietary solution vying for supremacy in each area. Whilst open-source RDBMSs are now available, they are somewhere between 10 and 20 years behind in development and many (most?) do not share the benefits of the transactional mechanics, nor efficient stored procedures nor the triggers to effectively use them.
The only way to realise high performance from an RDBMS is to move as much of the processing into the domain of the RDBMS as possible which requires the use of triggers and stored procedures. The reason for this is that the communications between the application and the RDBMS and the required, associated locking are relatively inefficient. It therefore necessitates keeping the communication between the application and the RDBMS to a minimum. Where efficiency is a consideration, this requires the application programmer to move as much of the logic of his application into the RDBMS as possible so that as much work can be done as a result if each communication as possible. Preferably, the work should be done automatically (triggered) as a consequence of other communications (inserts) as possible. With the high-end RDBMSs, this is possible and efficient.
The downside is that utilising these proprietary mechanisms, besides causing lock-in, requires that all applications utilising the RDBMS be written and structured in a way that is friendly to it. This includes limiting the data formats and structures used to those the RDBMS supports, or imposes the expense of converting between the applications formats and those of the RDBMS. It also requires the use of SQL (plus the myriad proprietary extensions to it).
The abstract nature of SQL is very convenient for the RDBMS producer, but not at all for the application writer. The time spent normalising recursive data-structures, many-to-many relationships, and breaking apart/ reconstructing complex compound data structures in order to serialise them to and from an RDBMS is almost pure wasted effort in most cases. Just as hierarchal DBs were very good at representing hierarchal data, but inefficient and clumsy at handling other types, so imposing a relational view upon non-relational data is equally inefficient and clumsy.
The number of times I have been party to the whole 2inner-join versus outer-join quandry", or the "my data is variable length and greater than 255 chars but I need to sub-select the records using a full text search but blobs/clobs/* don't support that.", and the many other issues that arise solely as a result of using RDBMSs, are part of the source of my distaste for them. The levels of abstraction that RDBMSs impose upon the application programmer, and the time spent complying with and pandering to those abstractions are nothing more than a drain upon resources.
The only time data truly benefits from the use of such abstraction is when the data may be compounded with other data held within the same DB (Note: DB,, not RDBMS) and utilised in another application at some point in the future. This is the oft-quoted holy grail of the RDBMS, but it is so rarely actually used in the real world because most applications place their data in separate DBs. It is not possible to effectively compound the data from several applications unless there data reside in the same DB!. In order that the data from several application be effectively combined in a single DB, requires that a schema be designed that permits all required uses of the data be met efficiently. In order that the 'future, compounded data applications' requirements be met by the compound schema, requires the DBA designing the schema have knowledge of those requirements (or guess them) when designing the schema. Whilst a new schema can be devised at a later data and the existing data transformed to that new schema, the chances that the original applications will be unaffected by that transformation and not need to be re-written are very slim.
The singularly most pervasive and persistent paradigm in computing to date is OOD & OOP, and it would be a brave man to predict either it's demise or supercedence.
OO is not inherently relational nor is it easily transmuted into the relational abstraction. It is easy and productive to design OO-systems that contain recursive, many-to-many and inherited data-structures. Converting between these and a relational model for the purposes of persistence is slow, messy, laborious and fraught with traps for the application programmer. Whilst the rules are pretty clearly defined and a DBA fully conversant with relational theory is capable of performing the transformations required, in most shops I have been party to, the RDBMS and the departments that manage them are not set up to assist the application programmer. They are often structured such that problems encountered by the AP with his application data are considered "his problem", and the data he deposits in the RDBMS become detached from his control.
In your example,
(Just consider the amount of reprogramming that you have to do to match what a DBA accomplishes by adding the right index to speed up existing queries.), in most cases, any speed gained by an application by the addition of the right index, would be entirely unnecessary if the data was maintained locally by the application, already so indexed.
In fact, it is entirely possible to encode the rules for normalisation and conversion of these data structure types such that the conversion to a relational view can be performed auto magically. It should be possible for the OO-programmer to throw his objects at his central data store and have it perform whatever conversions are required to store them. Having every programmer produce code to perform these types of conversions is unproductive and risky. Structuring every application around the constraints, rules and demands of the relational model and the SQL abstraction similarly so. The purpose of an abstraction should be to hide the implementation detail from the concern of the user of the abstraction. Neither the relational model nor SQL do this. They serve only one purpose -- to simplify the task of the RDBMS writer.
As an OO programmer, I want to be able to create an instance of an object and be given an handle to it. If that instance need to persist beyond the life of the program, I want to be able to indicate this at creation time, or invoke a method that causes that to happen. And that should be that.
Retrieving that instance from another application should be transparent to me. I create the instance using whatever initialisation (derived from whatever source) and if that instance already exists in the DB, I should get a handle to it. If not, it should be created and I get a handle to that new instance. The process of serialisation, conversion to the underlying storage medium, normalisation, indexing etc. should be entirely transparent to me. Then, and only then, will I be able to concentrate on writing my application instead of spending my time trying to adapt the requirements of my application to the limitations, rules and specificities of the DMBS.
That's the source of my distaste for RDMBSs. They force me, as an application programmer, to adapt my application to them, rather than the other way around. Imagine that my garage was 4 feet wide and 30 feet long and required me to split my car in two in order to park it. A crude analogy, but pertinent.
Your dream of effectively utilizing the performance of our processors is the diametric opposite of my dream, and runs counter to every trend in programming.
My immediate reaction to this sentence was: What is the point in having hugely fast processors if we are unable to utilise them? The dream I described was the dream that the tools we use when writing programs would perform more of the detail of the programming function freeing the human being to concentrate upon the bigger picture of design. C compilers often laud the fact that they can compile n lines of code in milliseconds. I fail to see how this performance benefits the programmer, if the processor sits nearly idle while he spends hours writing and running tests or trying to find the reasons why the code produced in those milliseconds doesn't work?
What I was envisioning is the time when the compiler utilises the performance of the processor to automate as many of these tasks as possible. If there was a mechanism by which the programmer could clearly define the requirements, constraints and boundary conditions of his algorithms, then it would be possible for the compiler to generate test data and test code to exercise that algorithm. Perhaps my choice of example was flawed, but you introduced the cache-utilisation-optimisation scenario into the discussion, I simply ran with it. My point was that the rules for such optimisations should be encapsulated into the same, processor specific place as the mechanisms that produce the processor specific code -- the compiler, and not be the concern of the AP.
I agree that processor performance is cheap, and given so, it should be thrown at whatever tasks can be easily automated leaving the programmer to concentrate on those tasks that aren't. It will be a while before a CPU is capable of gathering and analysing requirements, and designing applications, but the process of dynamically analysing an algorithm and applying a set of processor specific rules to enhance the performance of that algorithm is well within it's grasp. Most compilers, and even interpreters already do this to some extent. Codifying whatever rules and methods that can be used by the programmer to do this (as exemplified by Judy arrays) would be much simpler and ultimately more productive. Doing this dynamically on a virtual representation of the application, performing benchmarking and profiling, optimising data structures used by objects and algorithms is both conceivable and would be considerably more effective than the current peep-hole style optimisations that result from static analysis. If my compiler spent 10 minutes or even 10 hours rather than 10 seconds compiling my application, but performed this type of analysis and optimisation rather than my having to do it myself, I would be extremely happy.
This is what I meant by more fully utilising the performance of my CPU. Having the tools I use perform more of the drudgery of the job, so that I can do less. This is the essence of VHLs and (IMO) the greatest benefit of Perl over C or assembler. I enjoy assembler as a hobby, just as I enjoy hand-crafting things in wood, but if cars were manufactured using nothing more than hacksaws, files, rules and scribes, then they would be far more expensive and much less developed than they are now. The secret of productivity is to have tools that do more of the work, more quickly, more accurately and with less supervision. Computers, as tools, are being applied in every area of human endeavour to this end with amazing results.
With one, notable and lamentable exception -- software development.
Your last post
As for the rest, please be specific in your complaints. Vaguely waving hands doesn't give me much feedback.
I consider that my last post was specific, contained no hand-waving and was very targeted in it's feedback.
The whole discussion regarding cache optimisations and Judy arrays is nothing more than an entertaining blind alley with respect to solving the original problem until all those that have complained of the memory hungry nature of perl's hashes have access to an implementation of Judy arrays optimised for their platform from perl 5. I'll admit to have continued the discussion, and having moved beyond it and into even more blue-sky arenas in my attempt to be seen to maintain my (rare) intellectual interaction. Perhaps that was a mistake on my behalf, but I did attempt to segregate that discussion from the purpose of the meditation.
My complaint is that by your choosing to combine
This combined with your reputation as (one of) PM's most valuable contributors means that my careful analysis and the time I spent documenting it becomes completely diluted by the FUD, to the point where the effort that went in to it, and the possible outcome are demerited beyond viability.
Through one piece of misinformation and arguments against several non-sequitous statements that I didn't make, you have effectively rendered the efforts I made to finding a solution to the original problem, which is real enough that a super-search of this site brings up many examples of it, completely discredited. As such, the possible solution will never see the light of day, and I will be even more reluctant to ever proffer my ideas in this forum again.
I make no claims to my abilities nor suggest that my opinions have any merit. I'm a coder pure and simple. I have tried very hard to limit the expression of my ideas to those areas where I was able to back up those ideas with measurable and demonstrable facts. This is not the first time that my expression of simple, verifiably, statements of fact, combined with a directly measurable solution, having been attacked and demerited on the basis of unsubstantiated doubts expressed by a reputable monk.
It will however, be the last time -- which was probably the aim.