Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Re: Find what characters never appear

by Narveson (Chaplain)
on Sep 06, 2009 at 14:32 UTC ( [id://793779]=note: print w/replies, xml ) Need Help??


in reply to Find what characters never appear

Final Report

Thanks for the responses, which fell into three groups:

  • Build a histogram.
  • Match a dynamically updated character class.
  • Consider doing something else instead.

The histogram is a classic recipe. When I ran kennethk's implementation against my big file, I added a printout showing all the character counts as well as the unused characters I'd been looking for. Although pipe occurred 43 times and tilde occurred once, there were in fact three printable ASCII characters that were never used.

The job ended up taking 79 minutes. Having heard that hash lookups are expensive, I was attracted by almut's suggestion to put the histogram in an array instead of a hash. That modification ran in 77 minutes.

Either the hash mechanism isn't that expensive after all, or a hash whose keys are single ASCII characters somehow achieves the same performance as an array.

The way to do this job fast is to quit looking at characters that have already been seen. I ran kennethk's correction (using quotemeta) to almut's illustration of how to dynamically generate a character class from a list, and it took only a couple of minutes (I didn't bother to put it in a harness to get an exact timing).

Thanks, finally, to all who pointed out that the solution to this puzzle has no business value. What I didn't mention was that we're writing a file to be read by Microsoft SQL Server Integration Services (SSIS). So one of the CSV formats is probably the way to go. My own preference had been to just use pack and generate a fixed-width file, but our SSIS developers think reading fixed-width data is too much trouble. I'm planning to spend the rest of the weekend Googling for ways in which SSIS might learn to read a configuration spec and unpack fixed-width data as easily as I know Perl can.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://793779]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chanting in the Monastery: (3)
As of 2024-04-24 02:03 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found