Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Re^4: Removing digits until you see | in a string

by Animator (Hermit)
on Jan 08, 2007 at 14:11 UTC ( [id://593537]=note: print w/replies, xml ) Need Help??


in reply to Re^3: Removing digits until you see | in a string
in thread Removing digits until you see | in a string

What is bad about it (IMHO):

  • file size is unknown
  • processing does not start untill you are done reading
  • using at the very least three of four times the size of the file in memory
  • most people will not realizse that you are reading it in memory first. If you really want to read it all at once then I would suggest reading it in an array first.
  • the processing will be slower then reading it with a while and immedaitly creating a hash-element for each record. Now you read it in a temporary list, then you loop over that list, while looping over it you create a new list, and then you finally assign that list to a hash. Not that you will notice the speed/memory difference but that doesn't mean it's not there.

And with this technique you can't check (easily) for duplicate elements - but that wasn't asked

Replies are listed 'Best First'.
Re^5: Removing digits until you see | in a string
by johngg (Canon) on Jan 08, 2007 at 16:55 UTC
    Taking your points in order:

    file size is unknown

    Not by us but kevyt is probably aware and can make a value judgement reconciling the size of his file with the memory resources available.

    processing does not start untill you are done reading

    I can't think why that would be a problem here. Please could you expand on why this is bad.

    using at the very least three of four times the size of the file in memory

    Yes, but as in point one kevyt can decide whether he has the resources to accomodate this. We don't know what resources are available.

    most people will not realizse that you are reading it in memory first. If you really want to read it all at once then I would suggest reading it in an array first

    This is a difficult topic. To what extent do you balance using using the features of Perl, or any language, against making your code accessible to beginners in the language. It has to depend on the type of workplace, the experience level of the workforce and the amount of staff churn. An experienced, stable programming team can perhaps make greater use of language features. However, if you never expose people to new techniques, they will never learn them. This exposure can be via training/mentoring or by encouraging and rewarding self-study. Personally, I am in favour of educating programmers so they can make more informed choices from a larger tool bag in order to solve problems.

    the processing will be slower then reading it with a while and immedaitly creating a hash-element for each record. Now you read it in a temporary list, then you loop over that list, while looping over it you create a new list, and then you finally assign that list to a hash. Not that you will notice the speed/memory difference but that doesn't mean it's not there

    Well, let's test it. Using a data file kludged up from /usr/dict/words so that we have unique keys as the first of four pipe-delimited fields per line (file size just under 1MByte) I ran some benchmarks. Here's the code

    I ran the benchmark five times and the map solution came out faster than the line-by-line approach on four of them, although the difference is probably not statistically significant. Reading into an array was consistently the slowest by a larger margin. Here's the output

    $ spw593475 s/iter Array ByLine Map Array 1.30 -- -14% -15% ByLine 1.12 16% -- -1% Map 1.10 18% 1% -- $ spw593475 s/iter Array Map ByLine Array 1.43 -- -14% -18% Map 1.22 17% -- -5% ByLine 1.16 23% 5% -- $ spw593475 s/iter Array ByLine Map Array 1.31 -- -14% -15% ByLine 1.12 17% -- -0% Map 1.12 17% 0% -- $ spw593475 s/iter Array ByLine Map Array 1.31 -- -13% -15% ByLine 1.13 16% -- -1% Map 1.11 17% 1% -- $ spw593475 s/iter Array ByLine Map Array 1.30 -- -14% -16% ByLine 1.12 16% -- -3% Map 1.09 19% 3% -- $

    I also ran each method in separate scripts to look at memory usage. As you would expect, line-by-line was most frugal with an image of about 7MB, array came next at about 9MB and map was most expensive at about 1MB, so your estimate of three to four times data file was spot on.

    The platform is SPARC/Solaris, an Ultra 30 with 300MHz processor and 384 MB of memory running Solaris 9 and the data file was on a local disk; the Perl version was 5.8.4 compiled with gcc 3.4.2.

    Regarding your final (added?) point, yes, I would have approached the problem a different way had duplicate detection been a requirement.

    Cheers,

    JohnGG

    Update: Fixed typo

      (I tried to make this as readable as possible - but I'm not sure suceeded)

      file size is unknown

      Not by us but kevyt is probably aware and can make a value judgement reconciling the size of his file with the memory resources available

      Yes he knows the file size and yes he can decide if it is a problem to read the entire file in memory.
      But the major point is: he probably does not (or didn't) know that the entire file is first read into memory.

      processing does not start untill you are done reading

      I can't think why that would be a problem here. Please could you expand on why this is bad.

      Reasons why this could be(come) a problem:

      • data that is being read from STDIN (or via a pipe): if he decides to add debugging information then he will not see the output (or atleast not when he expects it)
      • In a more unlikely scenario: really large files - if the list that holds the data gets swapped out

      most people will not realizse that you are reading it in memory first. If you really want to read it all at once then I would suggest reading it in an array first

      This is a difficult topic. To what extent do you balance using using the features of Perl, or any language, against making your code accessible to beginners in the language.

      Well, the thing is, this is not my code. And this is not your code either. It will be kevyt's code. He needs to fully understand it.
      In the code I normally write - when not helping people - I do not really care about it and use every feature I need.

      It has to depend on the type of workplace, the experience level of the workforce and the amount of staff churn.

      Exactly. And this is a site that offers help (to beginners?).
      So you should keep it as simple as possible or add enough explenation so that they can understand it (or atleast references to the documentation).

      An experienced, stable programming team can perhaps make greater use of language features. However, if you never expose people to new techniques, they will never learn them. This exposure can be via training/mentoring or by encouraging and rewarding self-study. Personally, I am in favour of educating programmers so they can make more informed choices from a larger tool bag in order to solve problems.

      I'm in favour of educating aswell. But IMHO he can't educate himself from your post. If you wanted to edcuate him then you should have (IMHO) started by explaining why he can not use tr/// to accomplish this task and then move on to a long-version of the code (as in reading in an array) and then finally moving to the shorter version.

      Basically, what you did was giving him some code and hoping that he would either understand it or look it up.

      The benchmark

      If I run your benchmark I get completly different results which show that ByLine is the fastest (on Slackware)...

      File: 689K, perl version: 5.6.1

      File: 689K, perl version: 5.8.4

      File: 689K, perl version: 5.8.7

      File: 5.1M, perl version: 5.6.1

      File: 5.1M, perl version: 5.8.4

      File: 5.1M, perl version: 5.8.7

      But as stated before - the difference will hardly be noticed by someone.

        You have made some interesting points and I have sympathy with some of them. However,

        He needs to fully understand it

        You are falling into the error of assuming how much kevyt does or does not know. He does not say whether he is a beginner or much more advanced and I'm not sure you can divine too much from the question; tr is often misunderstood. I am wary of offending people by replying to their posts at too elementary a level. If they do not understand at first they are at perfect liberty to post a reply asking for a fuller explanation. Only if an OP says "I'm completely new to Perl" do I think it appropriate to reply straight away with what amounts to a mini-tutorial.

        this is a site that offers help (to beginners?)

        This is a site that offers help to all. I have seen Monks as experienced as japhy and Ovid posting questions in recent weeks. I think responses should always be helpful but certain audiences would not find simple code examples with detailed explanations helpful at all but rather offensive. Imagine explaining to Edison how a light bulb worked.

        Basically, what you did was giving him some code and hoping that he would either understand it or look it up.

        Not at all. I gave a basic solution, cleanly laid out. It depends on the individual but, personally, I find I gain more from other Monk's replies if I have a crack at trying to figure out what they are doing for myself rather than just reading the explanatory text. I hope that kevyt understood my post but I hope more that, if not, she or he would post again to say, "Didn't quite get that, could you explain further." I'd be only too pleased to do so.

        Benchmarking on different platforms can always throw up surprises. There seems to be something about SPARC that favours map. I am guessing (from the law of averages) that you are on an AMD or Intel platform. I wonder if other RISC architectures (Alpha or HP-PA for example) would mirror SPARC.

        I have enjoyed exploring our philosophical differences. It has made me think about how I should answer posts.

        Cheers,

        JohnGG

        Update: Fixed typo

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://593537]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others examining the Monastery: (6)
As of 2024-03-28 21:23 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found