Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"

Text Processing

by vroom (His Eminence)
on May 26, 2000 at 01:24 UTC ( #14872=sourcecodesection: print w/replies, xml ) Need Help??
Check21/X9.37 text extractor
on Jul 31, 2009 at 19:53 UTC
by delirium
This is a simple script to extract the EBCDIC text from an X9.37 formatted file. For those unfamiliar, this is a file format used in banking that has scanned check images in it, mixed with flatfile data describing the account numbers, dollar amounts, etc. The file format is obfuscated, but straightforward. You have a 4 byte record length field, then that many bytes of EBCDIC text, with one exception: the "52 Record". The first two bytes of data are the record number. Record 52 has 117 bytes of EBCDIC, and the remainder is binary TIFF data. This script has a flag that determines whether or not to ignore the binary TIFF data, or export it to files.
LaTeX Abbreviations for Linguists
on Jul 06, 2009 at 18:40 UTC
by Omukawa
This is a script which finds the abbreviations in the glossings and lists them alphabetically. The abbreviations defined by Leipzig Glossing Rules ( are left out per default. If you want to list the LGR abbreviations, too (and their definition), you should use the "-lgr" suffix.
Multiple / Mapping Search and Replace
on Mar 17, 2009 at 12:42 UTC
by VinsWorldcom
Ever want to search and replace, but on many terms and don't want to run a SAR routine over and over again for each instance? Script searches and replaces text in columns based on a mapfile. The output is a tab delimited text file.
col-uniq -- remove lines that match on selected column(s)
on Nov 05, 2008 at 23:51 UTC
by graff
This is like the standard unix "uniq" tool to remove lines from an input stream when they match the content of the preceding line, except that the determination of matching content can be limited to specific columns of flat-table data. There's an option to keep just the first matching line or just the last matching line. Note that if the input is not sorted with respect to the column(s) of interest, non-adjacent copies will not be removed (just like with unix "uniq"). (update: note that column delimiters are specifiable by a command-line option)

The code has been updated to avoid the possible "out of memory" condition cited by repellent in the initial reply below.

RFC: XML::Pastor v0.52 is released - A REVOLUTIONARY way to deal with XML
on Jun 29, 2008 at 19:23 UTC
by aulusoy

Hello all,

Having just released the first available version (v0.52) of XML::Pastor, I have found this discussion list that might best suit speaking about it.

Let me run the commercial....

Now you don't need to write code in an ugly language like Java in order to be able to get native XML support. Because now, there is XML::Pastor.

XML::Pastor is a revolutionary new way to handle XML documents in Perl.

Give it a try. You won't regret it. Guaranteed (or almost).

In fact, if you are familiar with Java's XML CASTOR, this module (XML::Pastor) will be very familiar to you. XML::Pastor is very similar to Java's Castor, however, as usual with Perl, it's more flexible. On the other hand, full XSD support is not achieved yet (a lot is already supported, see below).

XML::Pastor will actually generate Perl code starting with one or more W3C Schema(s) (XSD). The generated code is as easy, if not easier, to use as XML::Simple. YET (and this is the tricky part), you can also easily read and write to and from an XML document ('instance' of the schema) abiding by the rules of the schema. You can even validate against the original schema before writing the XML document.

However, you don't need the original schema at run-time (unless you are doing run-time code generation). Everything is translated into Perl at code generation time. Your generated classes 'know' about how to read, write, and validate XML data against the rules of the original schema without the actual schema at hand.

Attributes and child elements can be accessed through auto-generated accessors or hash items. You don't need to know or worry about whether or not a child element appears once or multiple times in the XML document. This is automagically taken care of for you. If you access the child element as an array (with a subscript), it is OK. But you don't need to. You might as well access the first such element directly, without needing to know that there are others.

Code can be generated at 'make' time onto disk (the so-called 'offline' mode) or it can be generated and 'eval'ed for you as a big chunk at run-time. Or you can get it as a string ready to be evaled. In 'offline' mode, you can choose to use a 'single' module style (where all code is in one big chunk), or in 'multiple' style, where each class is written to dik in a separate module.

There is also a command line utility, called 'pastorize' that helps you do this from within 'make' files.

Gone with the multiplicity problem of XML::Simple. Gone with the complexity of dealing with XML as XML. Now you can deal with XML data as real Perl objects (with accessors for child elements and attributes). But you can still get XML back out of it at the end of the day.


Most of the W3C XSD Schema structures are supported. Notable exception is substitution sets. Namespace support is currently a bit shaky (at most one namespace per generation). That's why schema 'import' is not - yet- supported. However, schema 'include' and 'replace' are supported.

All W3C schema builtin types have Perl counterparts with validation. This includes datetime and date types as well (You can get/set these with CPAN's DateTime objects as well).

Internally, XML::Pastor uses XML::LibXML to deal with actuial xml reading/writing XML (but not for validation). But, you don't need to know anything about XML::LibXML for being able to use XML::Pastor.

Give it a try please. And enjoy...

Note: It's already on CPAN. Just search for XML::Pastor on


Ayhan Ulusoy

on May 30, 2008 at 00:40 UTC
by graff
A basic, no-frills method to extract data from Excel spreadsheet files and print it as tab-delimited lines to STDOUT (e.g. for redirection to *.txt or *.tsv, or for piping to grep, sort or whatever). Works fine on macosx, linux, unix, once you have Spreadsheet::ParseExcel installed. (updated to fix silly tab indenting, and to remove some cruft, and to make my email address less of a spammer target -- also updated the node title and pod to match the name of the script.) Oh, and I should mention: it does the right thing when cells contain unicode characters: output is utf8.
Sol::Parser - a sol file reader
on Dec 18, 2007 at 08:47 UTC
by andreas1234567
Local Shared Object (LSO), sometimes known as flash cookies, is a cookie-like data entity used by Adobe Flash Player. LSOs are stored as files on the local file system with the .sol extension. This module reads a Local Shared Object file and return content as a list.
Code Typesetter
on Jun 02, 2007 at 23:55 UTC
by Minimiscience
For each filename passed as an argument, this script will produce a PostScript file that neatly typesets the text of the original file, complete with a simple header on each page, page numbers, numbering of every fifth line, & long lines broken up with red hyphens. The primary purpose is for printing source code beautifully (for a given value of "beautiful"), though it will work on any kind of ASCII text document, as long as the file name has an extension & there are no backslashes or parentheses in the filename (otherwise, bad things happen).
Contextualized weighted grammar generator
on Jun 01, 2007 at 03:43 UTC
by raptur
Reads in a directory of corpus files in the format used by the Penn Treebank:

(ADJP (ADV just) (ADJ another))
(NN perl)(NN hacker)))

finds all the grammar rules demonstrated in the corpus files for a given node, contextualized relative to the local left-sister (or mother if -m is passed). Moreover, it counts how many times each rule is observed, and produces a weighted grammar for the given node only. Designed for exploring the sort of information contained in local context in natural language, and whether there are meaningful clusters of context that improve the accuracy of natural language parsing. For earlier work in this area, see:
Johnson, Mark (1998). The effect of alternative tree representations on tree bank grammars. In D.M.W. Powers (ed.) New Methods in Language Processing and Computational Natural Language Learning, ACL, pp. 39-48.

This is a revised title and description for the earlier post entitled in response to others' suggestions. This post is otherwise unchanged.

tlu -- TransLiterate Unicode
on May 06, 2007 at 09:11 UTC
by graff
So many unicode characters (and so many ways to represent them), so little time... So, here's a quick and easy way to convert them all into a consistently readable (or programmable) form of your choice. When faced with really bizarre stuff, the one-character-per-line output can be very instructive.

(I was espcially pleased to find that this even works for the "really wide characters" above U+FFFF: the UTF-16 "Surrogate" thing is handled as if by magic, and 'unicore/' covers a lot of that "higher plane" stuff.)

Updated to fix regex error (lines 33 and 34) regarding utf-16be option ("ub"). Also, the "Python escape notation" support was added recently, and now I've added mention of that in the POD.

txt2docbook 2
on Apr 16, 2007 at 19:14 UTC
by Maze
this guesses the semantic structure from a text document, stripping the line endings and guessing where the paragraph breaks and headers should be. Good for processing Gutenburg 'plain vanilla ASCII' version 3 of txt2docbook, modularised ready for expansion
txt2docbook 3
on Apr 16, 2007 at 19:08 UTC
by Maze
this guesses the semantic structure from a text document, stripping the line endings and guessing where the paragraph breaks and headers should be. Good for processing Gutenburg 'plain vanilla ASCII' this is version 3 of txt2docbook, the obfuscated beyond repair one
Building a data Hierarchy
on Mar 20, 2007 at 11:54 UTC
by SlackBladder
Firstly, most of the code was donated by "Graff", however, I want to make it available to all as it works very well. This code connects to a SQL Server machine, extracts the data via a query and works out hierarchical data structures (if any exist) within a dataset of two columns of data. It's pretty fast (currently extracts over 7000 records from a SQL Server system and builds the hierarchy within around 3 to 4 seconds). I have replaced "company" specific data.
file-search: File search with minimum horizontal scrolling of output
on Oct 08, 2006 at 04:51 UTC
by parv

(From POD) The main reason for existence of this program is to minimize horizontal scrolling by displaying the file name only once (on a line of its own) before display of the lines matched, and by collapsing tabs and multiple spaces. Other reasons are to strip the current directory from the file name paths, and to have matched text highlighted.

Updated version can be found in as file-search-\d+[.]\d{2}

unichist -- count/summarize characters in data
on Sep 19, 2006 at 07:02 UTC
by graff
A nice tool for getting to know your unicode or other non-ASCII text data. For all files named in @ARGV, or all data piped to STDIN, we count and report how many times each character occurs, and optionally summarize the quantities according to the current standard set of Unicode "charts" (groupings of characters according to language or function). Also lets you convert legacy encoded data to utf8 on input. Check the POD for details.

(updated to fix a minor glitch involving code points gt "\x{2FA1F}")
(second update: added logic to check for malformed utf8 and adjust output if needed) (third update: major revision (thanks to Jim) to use unicore/Blocks.txt instead of __DATA__; also added "--names" option, changed "--chart" option to "--blocks", and altered spacing of output table columns)
on Aug 26, 2006 at 02:10 UTC
by tcf03
I needed to parse logs from a barracuda spam filter so I wrote this - its fairly simple, and so far seems to be doing well - it was written for barracudas "new" logging method. The hardest part was typing in the action/reason codes.
Massage the driving directions (in USA)
on Jun 18, 2006 at 17:30 UTC
by parv

This program massages the driving directions (in USA) for plain text printing (preferably in monospace font, w/ a blank line between each non empty line), with these priorities (in no particular order) ...

  • up-case road names, exit numbers, places
  • low-case everything else
  • expand n, w, e, s to north, west, east, south
  • shorten road type (like mailing address), e.g. 'road' to 'rd', 'lane' to 'ln', etc.
  • removes the annoying 'go' from 'go <this much>'

Edit a List of Files
on Apr 10, 2006 at 20:15 UTC
by lev36

With much help from answers provided by fellow monks to my previous questions, I have come up with a script (my first real script, actually!) that will read a list of filenames and then apply a series of edits to the files. In the process, it will back the originals up in a tarball.

If any of you are willing to look over my code, I'd appreciate feedback. Any lines stand out as the wrong way to do things? Are there things that could be written more elegantly? Can error handling be improved? I'm quite new at Perl, so I wouldn't be surprised if there are things I could have done better.

So here's the script, which I've called fled for 'file list edit'.

The reason I'm not simply using perl -pi.bak -e 's|foo|bar|g' files is that these files are not necessarily all in the same directory, and besides, the in-place edit feature of Perl changes the owner of the edited file to whoever runs it. This script will maintain the same owners on the original files.

Since it runs from a list of files, that list can be generated by grep, or tcgrep, or however.

Anyway, here it is:

pi generation
on Jan 29, 2006 at 03:09 UTC
by Pete_I
calculates the digits of pi to $ARGV[0] accurately
OpenOffice document with all possible fonts
on Jul 20, 2005 at 07:44 UTC
by oakbox
I use OpenOffice a lot in my work. Generating XML files is, if not a snap, at least not too difficult, and from those XML files I get a whole universe of file types to export to.

This bit of code is useful for two reasons: 1) It does something useful. 2) If you have not fiddled around with OO, this is a chance to see how easy it is.

One issue I run into regularly is "What fonts can I use here?" You see, while OpenOffice might render something beautifully to the screen, it's pretty much a crap shoot on how those fonts will look in a PDF export, or how MS Word will render things on screen. For the *most* part, everything works just fine, but errors do creep in.

So, with this in mind, I wrote a little test script. This program:

  1. Looks in your OpenOffice settings (we have to assume you have OO installed)
  2. Opens up the pspfontcache file, this file holds all of the fonts that OO has found on your system
  3. Parses pspfontcache, grab the names of the fonts.
  4. Generate a simple content.xml file showing each font in action.
  5. zip's the results into a .sxw file called 'fonttest.sxw'

You can open fonttest.sxw in OpenOffice and print, exporttoPDF, email a copy of it to your designer and say "You can only use these fonts", etc.

This code has been tested only in Linux.

UPDATE: add comments, add code that searches for most recent version of OO.

on Jul 16, 2005 at 20:18 UTC
by jdporter

See embedded pod.

This module was inspired by Array - Reading frame problem, and was written with the intent of solving that problem directly.

Understanding how this works is greatly aided by understanding substr.

serialise/unserialising data structures
on May 29, 2005 at 12:51 UTC
by monoxide
These two functions will serialise (bEncode) or unserialise (bDecode) data. It is based on 'Bencoding' which was designed as part of the bittorrent protocol but seems to be a well designed way of doing things. This is a perl implementation of this encoding. Below is a list of how this encoding works, and covers everthing from scalars to hashs, very nicely. I know this is _VERY_ poorly commented, but if someone has some spare time and wants to comment these 100 odd lines, it would be appreciated :).

* Strings are length-prefixed base ten followed by a colon and the string. For example 4:spam corresponds to 'spam'.
* Integers are represented by an 'i' followed by the number in base 10 followed by an 'e'. For example i3e corresponds to 3 and i-3e corresponds to -3. Integers have no size limitation. i-0e is invalid. All encodings with a leading zero, such as i03e , are invalid, other than i0e , which of course corresponds to 0.
* Lists are encoded as an 'l' followed by their elements (also bencoded) followed by an 'e'. For example l4:spam4:eggse corresponds to ['spam', 'eggs'].
* Dictionaries are encoded as a 'd' followed by a list of alternating keys and their corresponding values followed by an 'e'. For example, d3:cow3:moo4:spam4:eggse corresponds to {'cow' => 'moo', 'spam' => 'eggs'} and d4:spaml1:a1:bee corresponds to {'spam' => ['a', 'b']} . Keys must be strings and appear in sorted order (sorted as raw strings, not alphanumerics).
on Apr 09, 2005 at 20:28 UTC
by Rudif
hugepad is a viewer for huge text files, where huge is up to 500 MB. Advice and help from perlmonks BrowserUk and zentara is gladly acknowledged. See also recent threads Displaying/buffering huge text files and perl Tk::Scrolled : how do I hijack existing subwidget bindings?.
format text which mixed english and chinese characters.
on Feb 20, 2005 at 04:18 UTC
by Qiang
we at translate lots nice english perl articles to chinese. there are a lot of articles submitted with badly formatted style, especially the case when chinese characters and english squeeze together.

this is a quick hack to add one space to seperate english words/digits from chinese (before and after chinese characters). so that there will not be {chinese}etc{chinese}, instead will be {chinese} etc {chinese}

cgrep: Egrep clone with function name display
on Feb 02, 2005 at 12:48 UTC
by ambrus

Important: You can download the actual newest version of this program from That version has recursive search in a directory tree and other useful stuff. I won't update the old version here anymore.

This is an egrep clone that can display the name of the function in which the matching line is in, together with the line. This feature is already present in gnu diff, and I found it very useful for diffing C code, so I wrote this to have it in grep too. Use the -p or -P switch to enable the feature.

This uses perl regexps instead of posix regexps, and is much slower then the real egrep.

Update: there appears to be a bug: cgrep sometimes does not print the -- delimiters in context mode. Update 2: More precisely, there's no delimiter printed between chunks in different files. Also YuckFoo has noted that the -B switch does not work. I'll post a patch asap.

Update: these bugs (and one more undocumented bug) have been fixed. You see the new code here, and I'll post the diff from the old code to the new code as a reply.

Update 2006 dec 17: see also Top 10 reasons to start using ack.

Update 2007 jun 5: see also diotalevi's grep.

Update 2007 aug 29: the -q option is buggy. I will fix it in the version at but not in the older version in this node. The bug is in the found_exit_zero function.

Update 2008 nov 14: see also peg - Perl _expression_ (GNU) grep script.

on Dec 15, 2004 at 14:36 UTC
by spacepony
This is a proposed module that I'd like to submit to CPAN as Spreadsheet::ParseCSV. It works in a similar fashion to SpreadSheet::ParseExcel and is unlike Text::CSV since it takes a file-oriented approach. You instantiate a parser object with a file passed to the constructor and then read it row by row rather than parsing a line at a time which allows for rows that contain line breaks.
But... SimpleTemplate really is.
on Dec 15, 2004 at 06:23 UTC
by sauoq

Yeah... sigh... it's another templating system. I know CPAN is a bit overrun with them. (Just try to eke out a reasonable namespace for another templating module.) But, I've found it to be so useful over the years I can't not share it, so I'm sharing it here. I've cleaned it up a bit and fleshed out the documentation for public consumption, but it started as a quickie hack, so forgive me if it still seems like one.

I've always called it Local::SimpleTemplate. I've removed the "Local" for posting here. If you don't like the name for any reason feel free to change it; it's just a matter of changing the package line.

Originally, I just needed a replacement for a brittle little shell script that generated a bunch of config files from here documents and shell variables. I broke it up into a bunch of separate but similar perl scripts, one for each config file with the config's template in the __DATA__ section of the script. Each script read key/value pairs from stdin and substituted corresponding tags in the template with the input values. That little bit of history will explain this module's quirky (but surprisingly useful) default behavior.

The included POD documentation should be sufficient to learn how it works, but if you have questions please let me know. If you find any bugs, please let me know about those too. It has worked for me in a variety of uses but I did make a few tweaks before posting and may not have tested completely. cli lookup
on Dec 11, 2004 at 06:54 UTC
by diotalevi
Convenient access to from emacs, your shell, and if you like, other programs.
on Sep 18, 2004 at 23:48 UTC
by TheEnigma
This is a program that performs what I think is called Markov Chaining of a text, on a letter by letter basis. This is based on something I read in Byte about 20 years ago, they called their program Travesty.

I've seen other programs that do this on a word basis, for instance the program in the Text Processing section of the Code Catacombs called; but I havent't seen many that do it on a letter by letter basis.

The comments at the beginning of the program should give good enough instructions on how to use it.

I am planning to keep improving this program, and any and all comments|suggestions|critiques are welcome.

on Sep 06, 2004 at 01:42 UTC
by quartertone
I always look at my Apache server log files from the command line. It always bothered me to see "GET /robots.txt" contaminating the logs. It was frustrating trying to visually determine which were crawlers and which were actual users. So I wrote this little utility, which filters out requests were made from IP addresses which grab "robots.txt". I suspect there are GUI log parsers that might provide the same functionality, but 1) i don't need something that heavy, 2) I like to code, 3) imageekwhaddyawant.
on Sep 02, 2004 at 07:40 UTC
by rhythmicus

A simple script to create 'linked' text files for viewing on an iPod. To those who don't know, the iPod imposes a 4k limit on text files viewed on-screen. The script will take files that exceed 4k and create new ones with a filename in the form of file.1, file.2, etc. The width of the numbered extension will grow if needed (ie. file.001 .. file.100). The files are also linked for easy navigation.

DO NOT USE ON ORIGINALS unless you want to piece the file back together.

on Apr 29, 2004 at 21:41 UTC
by japhy
Yet Another Anagram Finder.
A little script that combines `head' and `tail' utilities
on Apr 29, 2004 at 12:28 UTC
by cosimo
This little utility extracts the "body" from some text given on STDIN. It does What You Mean(tm), given that you already know `head' and `tail' command line utilities. I wrote it to parse huge (>1Gb) PostgreSQL database dumps and to extract only single tables schema from them. Hope you will find it useful...
text munging utility
on Jan 26, 2004 at 16:28 UTC
by robobunny
This is a small tool I use to do some common operations on formatted lines of data, usually in a pipe with sort/uniq/grep/etc. By default, it will split the lines into columns separated by whitespace, but you can provide an alternate column separator or a regex.

Here's an example. I occasionally want to get the host name for all the IP's in my http access log:
ax -u 0 -c '0 nr ^10.0' -e 'host \0'
The "-u 0" says to skip any lines in which the value of column 0 has been seen before. "-c ..." says to only examine lines where column 0 doesn't match the regex ^10. (my private IPs). "-e" says to run the given command for each line that passes the filters. \0 refers to column 0.

Of course, you could do the same thing using cut, grep, sort and xargs. Run with "-h" for a list of arguments.
Oracle Alert Log Monitor
on Oct 15, 2003 at 01:15 UTC
by Lhamo Latso

I looked around for some alert log parsers, but none seemed to be out there for public consumption. If anyone knows of a better way for me to approach this, I will be grateful for any comments. Also, if anyone knows better techniques to code in Perl, I will be happy to hear about it. The code is my own, except for techniques lifted from a few O'Reilly books.

Script to monitor Oracle alert logs. Output is to stdout, or to a mail address, depending on the parms.

log days - defaults to look one day back.
mail address - if not set, output is printed, otherwise mailed.

exclude_list is a list of oracle error numbers to
exclude from printing. But, this exclusion only takes
effect if all oracle errors in the stanza are in the exclude list.

The totals printed at the bottom include all oracle errors found.

dict-compare: a dictionary evaluation script
on Sep 03, 2003 at 18:07 UTC
by allolex

As promised in Constructive criticism of a dictionary / text comparison script, here is a cleaned-up version of the dictionary comparison code that the monks helped me with.

What follows I just copied out of the POD in the script itself. It might make this code easier to find.

A generic script for building dictionaries by comparing them to real-world texts.

This program compares the words in a given text file to a list of words from a dictionary file. It is capable of outputting lists of words that occur or do not occur in a given dictionary file, along with their frequency in the text. Debugging output using token tag marks is also available.

RTF diff
on Aug 28, 2003 at 10:52 UTC
by sheriff
If you're working with RTF, sometimes you'll want to compare two RTF files to see if they're different. Traditional diff falls down here, because RTF can have all sorts of crazy whitespace, some of which is significant, some of which isn't. rtfdiff, below, rasterizes the token streams from two rtf files, and then diffs those, allowing you to easily see if two rtf files are the same :-) Tada!
multiline inplace edit
on Aug 07, 2003 at 01:09 UTC
by meonkeys
Edit files in-place using a single regex (regular expression) that will be applied to the entire file as a string. Useful for applying multiline regexes and avoiding shell redirection and backup files. Also, quite dangerous!

Call like so:  ./ <regex> <files>...


./ 's/your([\n\s]*?)friend/joe${1}${2}/ms' foo.txt

Please let me know if there is are easier/more elegant ways to do this, especially as a one-liner. When making suggestions, please keep in mind that my goals were to provide:

  • complete power of search and replace with Perl regexes
  • in-place file editing on multiple files
This script is mainly for Perl-savvy users or those familiar with Perl-compatible regular expressions.
on Jul 31, 2003 at 18:36 UTC
by hash
Code usefull to edit files automaticly removing the last "." in the selected string lines.
Plaintext monthly journal generator
on Jul 20, 2003 at 00:31 UTC
by Dragonfly
I've used this script under both Win32 and OpenBSD to generate a plain-text file that I use for keeping a little journal of my thoughts.

I wrote it a long time ago (it's written in baby Perl) but have been using it monthly to make my journals, which are then really easy to edit and archive. I like keeping them in plain text format because then the entries are easy to cut and paste into other applications (emails, HTML forms, word processors, etc) without having to start a gigantic program or be online, etc.

I also like it because it simply writes a date for each day of the current month, like "Monday, July 14, 2003", with a few line breaks thrown in. That way, I can write down what I worked on that day, keep little notes or code snippets, lyrics, and so forth, and easily go back and review my month. And cross-platform date handling is a little trickier than I had initially expected, so I learned some things writing it, too.

Anyway, I know it isn't fancy, but since I use it every month, I figure somebody else out there might.

Ouput XML::Writer to scalar
on May 02, 2003 at 14:05 UTC
by jeffa
From time to time i have found myself wanting to capture the output from XML::Writer to a scalar instead of a file handle. It is simple enough, but i thought i would share here. Since XML::Writer expects an IO::Handle module for it's OUTPUT parameter, pass it an IO::Scalar. At the end, we can get a hold of that handle via XML::Writer::getOutput().
search/browse freebsd ports INDEX*
on Jan 18, 2003 at 04:56 UTC
by parv

Parse-index.perl eases searching & browsing of FreeBSD Ports INDEX* (without make(1) and without the restriction of being in /usr/ports) with help of Perl regular expressions.

this program uses the home made (not inculded in this post).

one may need to adjust the modules in use lib 'modules' as appropriate.

the program itself is also available from...

Fixed a regex bug, in version 0.025, which would have matched for the wrong reasons. Code below and at above URL has been updated.

eye on procmail log
on Dec 28, 2002 at 09:07 UTC
by parv

From_, Subject:, and folder name (see LOGABSTRACT in procmailrc(5)) are printed to stderr, and interesting messages other than as specified by tell_skip_regex() to current output file descriptor, (stdout is default?).

this program is unsuited to actually debug the recipes. to debug, consult your actual recipes & full verbose procmail log. consult procmail mailing list, and various man & web pages for more help & information.

it is also available from...

Bitwise Operations for Template Toolkit
on Dec 01, 2002 at 09:33 UTC
by rob_au
The following two patches can be used to add bitwise AND and XOR operators for use in Template Toolkit templates using the & and ^ tokens respectively. These patches are to be applied to the parser/ and parser/Parser.yp files in the Template Toolkit source. Following the application of these patches, the parser/yc script will need to be run to merge these changes into the installation files - When doing this, expect an inconsequential shift/reduce conflict and reduce/reduce conflict.

Bitwise OR and NOT operations have not also been incorporated as the corresponding tokens, | and ! respectively, have already been assigned roles within the Template Toolkit grammar.

Alternatively a complete patch for Template Toolkit can be found here for application on the version 2.08 Template Toolkit source.

per - selects every Nth line
on Nov 23, 2002 at 23:17 UTC
by jkahn
For how many tasks have you wanted to use a sampling of every Nth line of a file?
  • selecting a "random" subset before running on all five million lines
  • getting a flavor of what's in a line-oriented database
  • holding out test data

Well, for me, it's nearly every line-based text-processing tool I write -- if it's not a standard requirement, it's usually much more informative to test on every 50th line of my test corpus than it is to use the first 50 lines for test data.

In fact, I find it very frustrating that there's no Unix power tool a la grep or tail that does this.

So, per is an addition to the Unix power-tool library -- it's sort of like head or tail except that it takes every Nth line instead of the first or last N. Save it as ~/bin/per (or /usr/bin/per) and use it every day, like me.

Windows users can run pl2bat on this and put it somewhere in your path -- my NT box happily uses a variant of this.

Usage info is in POD, in the script. But here it is in HTML anyway (I love pod2html):


per - return one line per N lines


  per -oOFFSET -N files
  per -90 -o2 file.txt  # every 90th line starting with line 2
  per -o500 -3 file.txt # every 3rd line starting with line 500
  per -o1 -2 file.txt   # every other line, starting with the first
  per -2 file.txt       # same as above

It can also read from STDIN, for pipelining:

  tail -5000 bigfile.txt | per -100 # show every 100th line for the
                                    # last 5000 in the file


per writes every Nth line, starting with OFFSET, to STDOUT.


the integer value N provided (e.g. -50, -2) is used to decide which lines to return -- every Nth.

the value OFFSET provided says how far down in the input to proceed before beginning. The output will begin at line number OFFSET. Default is 1.


Note that per works on files specified on the commandline, or on STDIN if no files are provided. The special input file - indicates that remaining data should be read from STDIN.

Approximate Matching w/o C
on Oct 14, 2002 at 22:34 UTC
by nothingmuch
Manber-Wu algorithm implemented in perl. This subroutine generates approximate matchers, pre initialized for a certain pattern. You then pass the anon sub strings to match upon. A perl only alternative to String::Approx
Parse Xerox Metacode printer data
on Oct 07, 2002 at 01:44 UTC
by diotalevi

This provides a text parse of the previously undocumented and proprietary Xerox Metacode printer file format. This isn't complete but it represents what I've been able to glean after reverse engineering some sample documents.

on Oct 04, 2002 at 21:46 UTC
by seattlejohn
Create an object that will pretty-format numbers using SI prefixes, your preferred truncation behavior, and so on. For example, return a number like '12' as '12 bytes' and a number like 1048976 as '1 MB'.
Sendmail pairs
on Sep 14, 2002 at 19:13 UTC
by Limbic~Region
Builds a hash whose index is a "from" "to" pair, and then increments it every time the pair is encountered. Finally, the information is sorted displaying the highest pairs first. Tested on HPUX 11.0 running Sendmail 8.9.3.
Parse::Report - parse Perl format-ed reports.
on Aug 14, 2002 at 00:20 UTC
by osfameron
After reading this question about a "generic report parser", I got interested in the idea. The question itself has been bizarrely (I think) downvoted, as it's an interesting topic. I've gone for the approach of parsing Perl's native format strings.

This is a very early of this code, and can probably be better done (e.g. could all the code be incorporated into the regex?!) I've made no attempt to parse number formats, and the piecing together of multiline text is unsophisticated (e.g. no attention to hyphenation), but it's a start.

on Jul 29, 2002 at 08:54 UTC
by DamnDirtyApe
I wrote this module because I wanted a super-simple way to separate my SQL queries from my application code.
on Jul 29, 2002 at 05:44 UTC
by DamnDirtyApe
This program generates a skeleton LaTeX file from a simple text file. This allows a large document to be `prototyped', with the LaTeX tags later generated automatically. See the POD for a more detailed explanation.
on Jul 28, 2002 at 21:51 UTC
by vxp
This is a little something that parses an mbox file and grabs email address out of it (I use it at work to parse a bounce file and grab email addresses out of it for various purposes). Feel free to modify it, use it, whatever. (Credit info: this was actually not written by me, but by the previous network admin)
on Jul 22, 2002 at 23:28 UTC
by ignatz
Takes in a LOH (List of Hashes) and an array of keys to sort by and returns a new, sorted LOH. This module closely relates to Sort::Fields. in terms of it's interface and how it does things. One of it's main differences is that it is OO, so one can create a Sort::LOH object and perform multiple sorts on it.

Comments and hash criticism are most welcome. I tried to find something here or on CPAN that did this, but the closest that I got was Sort::Fields. Close, but no cigar. Perhaps there is some simple way to do this with a one liner. Even so, it was fun and educational to write.

Snort IDS signature parser
on Jun 24, 2002 at 00:50 UTC
by semio
I wanted to obtain a list of all enabled signatures on a Snort IDS e.g. a listing of sigs contained in all .rules files as well as some general information for each, such as the signature id and signature revision number. I created one large file on the IDS called allrules and wrote this script to present each signature, in a comma-delimited format, as msg, signature id, signature revision number.
Pod::Tree dump for the stealing
on May 31, 2002 at 15:03 UTC
by crazyinsomniac
ever use Pod::Tree; ?
ever my $tree = new Pod::Tree; ?
ever $tree->load_file(__FILE__); ?
ever print $tree->dump; ?
Wanna do it yourself?
Here is goood skeleton code (care of the Pod::Tree authors)
Annotating Files
on Apr 17, 2002 at 20:32 UTC
by stephen

I was writing up a list of comments on someone's code, and got tired of retyping the line number and filename over and over again. Also, I liked to skip around a bit in the files, but wanted to keep my annotations sorted.

So for starters, I wrote a little XEmacs LISP to automatically add my annotations to a buffer called 'Annotations'. It would ask me for the comment in the minibuffer, then write the whole thing, so that I could keep working without even having to switch screens. I bound it to a key so I could do it repeatedly. Pretty basic stuff.


(defun add-note () "Adds an annotation to the 'annotations' buffer" (interactive) (save-excursion (let ( (annotate-comment (read-from-minibuffer "Comment: ")) (annotate-buffer (buffer-name)) (annotate-line (number-to-string (line-number))) ) (set-buffer (get-buffer-create "annotations")) (goto-char (point-max)) (insert-string (concat annotate-buffer ":" annotate-line " " ann +otate-comment "\n" ) ) ) ) ) (global-set-key "\C-ca" `add-note)

This would generate a bunch of annotations like this:

comment_tmpl.tt2:1 This would be more readable if I turned on Template +'s space-stripping options. More informative error message would probably be +good. Need a better explanation of data structure. annotate.el:1 Should properly be in a mode... annotate.el:11 Should be configurable variable annotate.el:13 Formatting should be configurable in variable annotate.el:11 Should automatically make "annotations" visible if it i +sn't already annotate.el:21 Control-c keys are supposed to be for mode-specifics...

Next, I wanted to format my annotations so I could post them here in some kind of HTML format. So I wrote a little text processor to take my annotations, parse them, and format the result in HTML. This was not difficult, since most of the heavy lifting was done by the Template module.

Here's a standard template file... pretty ugly, really, but you can define your own without changing the code...

[% FOREACH file = files %][% FOREACH line = file.lines %] <dt>[% %]</dt> <dd><b>line [% line.number %]</b> <ul>[% FOREACH comment = line.comments %] <li>[% comment %]</li> [% END %]</ul> </dd> [% END %][% END %]

Alternatively, I could have had my XEmacs function output XML and used XSLT. Six of one, half a dozen of the other... Plus, one could write a template file to translate annotations into an XML format.

The Output

line 1
  • Should properly be in a mode...
line 11
  • Should be configurable variable
  • Should automatically make "annotations" visible if it isn't already
line 13
  • Formatting should be configurable in variable
line 21
  • Control-c keys are supposed to be for mode-specifics...
line 31
  • More informative error message would probably be good.
line 71
  • Need a better explanation of data structure.
line 1
  • This would be more readable if I turned on Template's space-stripping options.
Shortcuts Engine: Packaged Version
on Apr 06, 2002 at 21:42 UTC
by munchie
This is the shortcuts engine I made put into packaged format. It is now more portable, and offers more flexibility. I have never submitted anything to CPAN, so I want fellow monks' opinion on wheter or not this module is ready for submission to CPAN.

This module requires Text::xSV by tilly. If you have any suggestions on making this better, please speak up.

UPDATE 1: Took out the 3Arg open statements for slightly longer 2Args, to make sure that older versions of Perl will like my program. (thanks to crazyinsomniac)

UPDATE 2: I just uploaded this on PAUSE, so very soon you'll all be able to get it on CPAN! (My first module!)

Shortcuts engine for note taking
on Apr 02, 2002 at 20:42 UTC
by munchie
This code allows the user to set shortcuts (a character surrounded in brackets) to allow for fast note taking/document writing. (Thank you tilly for the awesome Text::xSV module!)
Yet another code counter
on Mar 01, 2002 at 14:27 UTC
by rinceWind
Here is a code counter that can handle languages other than Perl. It was required for sizing a rewrite project, and gave some useful metrics on code quality as a by-product.

It is easy to add other languages by populating the hashes %open_comment and %close_comment, and/or %line_comment.

The code counts pod as comment, but bear it in mind that this script was not primarily designed for counting Perl.

on Jan 07, 2002 at 09:24 UTC
by seattlejohn
When working on large-ish projects, I've sometimes found it gets to be a real pain to manage all the error messages, status messages, and so on that end up getting scattered throughout my code. I wrote MessageLibrary to provide a simple OO way of generating messages from a centralized list of alternatives, so you can keep everything in one easy-to-maintain place.
Matching in huge files
on Dec 02, 2001 at 03:13 UTC
by dws
A demonstration of how to grep through huge files using a sliding window (buffer) technique. The code below has rough edges, but works for simple regular express fragments. Treat it as a starting point.

I've seen this done somewhere before, but couldn't find a working example, so I whipped this one up. A pointer to a more authoritative version will be appreciated.

Regexp::Graph 0.01
on Nov 24, 2001 at 23:11 UTC
by converter

Regex::Graph provides methods for displaying the results of regular expression matches using a "decorated" copy of the original string,with various format-specific display attributes used to indicate the portion of the string matched, substrings captured, and for global pattern matches, the position where the pattern will resume on the next match.

This module encapsulates a regular expression pattern and a string against which the pattern is to be matched.

The `regshell' program (included with this distribution) demonstrates the use of the ANSI formatter module, and provides a handy tool for testing regular expressions and displaying the results. Other format modules are in the works, including a module to output text with HTML/CSS embedded styles.

regshell includes support for readline and history, and can save/load pattern/string pairs to disk files for re-use.

NOTE: I have not been testing this code on win32. The regshell program is not strict about paying attention to which terminal attributes are set, so it may go braindead on win32. I'll pay more attention to win32 on the next revision.
on Oct 02, 2001 at 14:55 UTC
by tfrayner
Here's a little script I wrote as an exercise. My aim was to implement a user-friendly way of substituting text across a heirarchy of files and directories. There is of course a simple way to do this without resorting to this script. However, the development of the script allowed me to add some nice features. Try 'perldoc' for details.

I'd welcome any comments, particularly with regard to efficiency/performance and portability.

on Sep 11, 2001 at 03:06 UTC
by jryan
Someone in the chatterbox the other day wanted an easier way to create a tree construct without emedding hashes up the wazoo, so here is my solution: Tie::HashTree. You can create a new tree from scratch, or convert an old hash into a Tree, whatever floats your boat. You can climb up and down the branches, or jump to a level from the base. Its up to you. The pod tells all that you need to know (I think), and I put it at the top for your convienience :) If you have any comments/flames, please feel free to reply.
on Aug 27, 2001 at 10:50 UTC
by tachyon

This module implements the halve the difference algorithm to efficiently (and rapidly) find an element(s) in a sorted file. It provides a number of useful methods and can reduce search times in large files by several orders of magnitude as it uses a geometric search rather than the typical linear approach. Logfiles are a typical example where this is useful.

I have never written anything I considered worthy of submitting to CPAN but thought this might be worthwhile.
on Aug 02, 2001 at 19:19 UTC
by kingman
This module reformats paragraphs (delimited by \n\n) into equal-width or varied-width columns by interpreting a format string. See the synopsis for a couple of examples.

I just read yesterday that formats will be a module in perl 6 so I guess there's already something like this out there?

Erriccsons Dump Eric Data Search and output script
on Jul 16, 2001 at 19:07 UTC
by brassmon_k
First off this is a menu script with simple options that include multiple scripts.

Uses Erriccssons Dump_Eric decrypter for cell traffic call records and I've developed a tool to search on the encrypted file names (a ksh and a CGI, posting ksh though)
You specify the date first - Then the time - By doing this the records you pick are thinned out allowing for faster processing of the call record files. It finds the call record files by using a simple pattern match.
From top to bottom here is the process.

Search tool - Specify date & time.
sends the names of the files found to a file
then a "sed" statement is created to put dump_eric infront of all filenames in the file
then the output is sent to another file
then the awk script is run after the above is done and you put in your msisdn and the awk script searches on the output in the second file and outputs that to another file.
then after all that you can view the results.
Lastly (as we all know the files that dump_eric runs on are rather large)We delete the search results as you're done with them(You're givne the option to delete)
Only 2 flaws as I'm aware of is the fact that you can only do one search at a time or else the files with the output get overwritten if somebody else is running a search after you. (I had my own purposes for that) You can easily get around this by having the script ask you what you want to name the output files, to solve the unknown factor for other users just keep a known file extension on it.
Last flaw (not really a flaw on my part a necessity because dump_eric is picky - If you run the searchtool from a different directory it includes the fullpath in the file so your call record location output would be (for me atleast) /home/bgw/AccessBill/TTFILE.3345010602123567 and dump_eric won't take anything but the call record file name and not the path) The date&time search tools must be in the same directory as the calltrace records....All the other scripts can go anywhere you wish.
Now the code I will list below is multiple scripts each with their own heading.

NOTE: Don't forget to change your PERL path for the "#!/usr/bin/perl" as your path might be different.
NOTE: There are 3 search tools: A dateonly, a timeonly, and a date&time
NOTE: I only put in the date&time search tool because it's really easy to change this script to a timeonly or dateonly and change the menu to suit your needs so you can change it at your leisure(and to save space down here:-).
NOTE: THE AWK SCRIPT(except the part where you append or output to your file)can't have any whitespace after each line or it won't work so cut and paste it but make sure that you go through it and get rid of any after each line if there is any.

I'll list the code in order.
If any help is needed don't hesitate to contact myself at ""
on Jun 29, 2001 at 04:58 UTC
by ton
Data::Dumper is a great utility that converts Perl structures into eval-able strings. These strings can be stored to text files, providing an easy way to save the state of your program.

Unfortunately, evaling strings from a file is usually a giant security hole; imagine if someone replaced your stucture with system("rm -R /"), for instance. This code provides a non-eval way of reading in Data::Dumper structures.

Note: This code requires Parse::RecDescent.

Update: Added support for blessed references.
Update: Added support for undef, for structures like [3, undef, 5, [undef]]. Note that the undef support is extremely kludgy; better implementations would be much appreciated!
Update2: Swapped the order of FLOAT and INTEGER in 'term' and 'goodterm' productions. FLOAT must come before INTEGER, otherwise it will never be matched!

on Mar 17, 2001 at 05:50 UTC
by tilly
I am tired of people asking how to handle CSV and not having a good answer that doesn't involve learning DBI first. In particular I don't like Text::CSV. This is called Text::xSV at tye's suggestion since you can choose the character separation. Performance can be improved significantly, but that wasn't the point.

For details you can read the documentation.

Fixed minor bug that resulted in quotes in quoted fields remaining doubled up.

Fixed missing defined test that caused a warning. Thanks TStanley.

Frequency Analyzer
on Mar 16, 2001 at 04:00 UTC
by Big Willy
Updated as of March 16, 2001 at 0120 UTC Does frequency analysis of a monoalphabetic enciphered message via STDIN. (Thanks to Adam for the $i catch).
Fast file reader
on Mar 15, 2001 at 22:51 UTC
by Malkavian
Following discussions with a colleague (hoping for the name Dino when he gets round to appearing here) on performance of reading log files, and other large files, we hashed out a method for rapidly reading files, and returning data in a usable fashion.
Here's the code I came up with to implement the idea This is a definate v1.0 bit of code, so be gentle with me, although constructive criticism very welcome.
It's not got much in the way of internal documentation yet, tho I'll post that if anyone really feels they want it.
It requires you have the infinitely useful module Compress::Zlib installed, so thank you authors of that gem.

Purpose: The purpose is to have a general purpose object that allows you to read newline seperated logs (in this case from Apache), and return either a scalar block of data or an array of data, which is comprised of full lines, while being faster than using readline/while.

Some quick stats:
Running through a log file fragment, using a while/readline construct and writing back to a comparison file to check integrity of file written took 15.5 seconds.
Running the same log file with a scalar read from the read_block and writing the same output file took 11.3 seconds.
Running the file with an array request to read_block took 11.3 seconds.
Generating the block and using the reference by the get_new_block_ref accessor and writing the block uncopied to the integrity test file took 8.3 seconds.
For those who take a long time reading through long log files, this may be a useful utility.

on Feb 24, 2001 at 09:59 UTC
by damian1301
This is the first script I have ever made, though it could use some improvements. This is about 2 months old out of my 5 month Perl career. Any improvements or suggestions will be, as usual, thankfully accepted.
Perl Source Stats
on Feb 15, 2001 at 01:42 UTC
by spaz
This lil app will give you the following info about your Perl source files
  • Number of subroutines (and their line number)
  • Number of loops (and their line number)
  • Number of lines that are actual code
  • Number of lines that are just comments
Markov Chain Program
on Feb 02, 2001 at 01:38 UTC
by sacked
Taking the suggestion from Kernighan and Pike's The Practice of Programming, I wrote another version of their Markov Chain program (chapter 3) that allows for different length prefixes. It works best with shorter prefixes, as they are more likely to occur in the text than longer ones.

Please offer any tips for improving/enhancing this script. Thanks!
Code counter
on Feb 01, 2001 at 07:18 UTC
by Falkkin
This program takes in files on the command-line and counts the lines of code in each, printing a total when finished.

My standard for counting lines of code is simple. Every physical line of the file counts as a logical line of code, unless it is composed entirely of comments and punctuation characters. Under this scheme, conditionals count as separate lines of code. Since it is often the case that a decent amount of the code's actual logic takes place within a conditional, I see no reason to exclude conditionals from the line-count.

Usage: [-v] [filenames]

The -v switch makes it output verbosely, with a + or - on each line of code based on whether it counted that line as an actual line of code or not.

Totally Simple Templates
on Jan 24, 2001 at 06:14 UTC
by japhy
Using my recently uploaded module, DynScalar, template woes are a thing of the past. By wrapping a closure in an object, we have beautiful Perl code expansion.
on Dec 22, 2000 at 00:11 UTC
by epoptai
coder encodes text and IP addresses in various formats. Text can be encoded to and from uppercase, lowercase, uuencoding, MIME Base64, Zlib compression (binary output is also uuencoded, uncompress expects uuencoded input), urlencoding, entities, ROT13, and Squeeze. IP addresses can have their domain names looked up and vice versa, converts IPs to octal, dword, and hex formats. Query strings can also be decoded or constructed.
SuperSplit code
on Jan 02, 2001 at 18:35 UTC
by jeroenes
Extends split/join to multi-dimensional arrays
IP Address sorting
on Oct 23, 2000 at 16:38 UTC
by japhy
Sorts N IP addresses in O(Nk) time, each and every time. Uses the technique called radix sorting.
In-Place editing system
on Oct 10, 2000 at 23:13 UTC
by Intrepid

Update made on Fri, 25 Jul 2003 +0000 ...

...mostly for historical interest; if this horrible code has to remain up on PM it might as well be readable (removed the <CITE> tags and so forth).

Generally: Perl with the -i switch (setting $^I) does in-place editing of files.


Passed a set of arguments for the string to match to a regex and the replacement string, this system will find every file in the current working directory which matches the glob argument and replace all instances of the string.

There's some ugliness here: the need to make 2 duplicates of the @ARGV array (@Y and @O) and the need to write a TMP file to disk (major YEEECHHH). So I am offering it as a work-in-progress with the additional acknowledgement that it *must* be merely a reinvent of the same wheel built by many before me, yet I never have found a finished script to do this (particularly on Win32 which does not possess a shell that will preglob the filename argument for you).

The system consists of two parts: one is the perl one-liner (note WinDOS -style shell quoting which will look so wrong to UNI* folk) and the other is a script file on-disk (which is called by the one-liner). It's probably a decidedly odd way to do this and I may later decide that I must have been hallucinating or otherwise mentally disabled when I wrote it :-).

The system keeps a fairly thorough log for you of what was done in the current session. If optionflag -t is set it will reset all the timestamps on completion of the replacements, allowing one to make a directory-wide substitution without modifying the lastmod attribute on the files (might be highly desirable in certain situations ... ethical ones of course). The -d switch currently doesn't work (like I said this a work in progress). (see next para for explanation). When it worked it was for debugging, that is, doing a dry-run.

The Perl -s switch (option flag to perl itself) wouldn't give me the ability to set custom options in this use (with -e set) nor would the module Getopt::Std. One of my reasons for posting this code is to invite help in coming up with an answer as to "why" (particularly in the case of the latter).

on Aug 08, 2000 at 23:57 UTC
by turnstep

Just a simple hex editor-type program. I actually wrote this back in 1996, so go easy on it! :) I cleaned it up a little to make it strict-compliant, but other than that, it is pretty much the same. I used it a lot when I was learning about how gif files are contructed. Good for looking at files byte by byte.

I have no idea why it was named wanka but it has stuck. :)

Automatic CODE-tag creation (Prototype)
on Jun 21, 2000 at 20:28 UTC
by Corion
Out of a discussion about how we can prevent newbies from posting unreadable rubbish, here is a program that tries to apply some heuristics to make posts more readable. This version isn't the most elegant, so it's called a prototype.
on Aug 01, 2000 at 00:05 UTC
by Tally
PINE is a common text-based email viewer on many UNIX systems. The PINE program stores email in large text files which makes it very handy to archive your old email... except that there's no table of contents at the beginning of the file to let you know what messages are stored there. This script solves that problem by parsing the PINE email store and creating a separate table of contents from the headers of each email. The resulting TOC lists the message number, title, sender info and date in formatted columns. I usually concatinate the TOC and email storage file, and then save the resulting file in my email archives.

Note: This script works very well with version 3.96 of PINE, which I use, but there are other versions that I have not tested it on.

PLEASE comment on this code. I'm a fairly new perl programmer and would appreciate feedback on how to improve my programming.
on Aug 01, 2000 at 19:29 UTC
by mikfire
A down and dirty little script that reads a file, looking for subroutine definitions. It extracts these and then parses through a whole bunch of other files looking for calls to those functions. It isn't perfect, but it works pretty well.

Usage: list_call source file [ ...]
where source is the file from which to extract the calls and file is the file to be searched.
Glossary Maker
on Apr 27, 2000 at 00:35 UTC
by gregorovius
I wrote a set of scripts that will automatically find rare words in a book or text.

1. The first script will FTP a very large number of ascii coded classic books from the gutenberg project (

2. The second one computes a histogram of word frequencies for all those books.

3. The third one takes the text where one wants to find rare words. It will start by showing all the words in it with count 0 in the histogram, then the ones with count 1 and so on. The user chooses manually which words he wants to include in the glossary and then chooses to stop as the scripts starts showing words with higher counts.

4. The chosen words are looked up automatically on web dictionary.

5. We have our glossary ready! The next step is unimplemented but what follows is to generate a TeX file for type-setting the ascii book with the dictionary terms as footnotes or as a glossary on the back.

Note: The scripts are short and easy to understand but not too orderly or properly documented. If you want to continue developing them feel free to do so, but please share any improvements you make to them.

Here is a description of their usage: LIST OF LASTNAMES

Will download all the book titles under each author on the list of names into a local archive. It must be run from the directory where the archive resides.


% mkdir archive
% Conan\ Doyle Conrad Gogol Darwin

After running these commands archive will contain one sub irectory for each author, and each of these will contain all the books for that author on Project Gutenberg.

Will generate a DB database file containing a histogram of word frequencies of the book archive created by the program.

To use it just run it from the directory where the 'archive' directory was created. It will generate two files, one of them called index.db containing the histogram and the other called indexedFiles.db containing the names of the files indexed so far (this last one allows us to add books to the archive and index them without analizing again the ones we already had).

Note that this script is very innefficient and requires a good deal of free memory on your system to run. A new version should use MySQL instead of DB files to speed it up. BOOK_FILE

Will take a book from the archive created by the script and will look at the word count for each of its words on the histogram of word frequencies created by Starting with the less frequent words it will prompt the user to choose which ones to include on the glossary. When the user stops choosing words the program will query a web dictionary and print the definition of all the chosen words to STDOUT.

String Buffer
on Apr 25, 2000 at 21:12 UTC
by kayos

Sometimes I encounter a script or program that wants to print directly to STDOUT (like Parse::ePerl), but I want it in a scalar variable. In those cases, I use this StringBuffer module to make a filehandle that is tied to a scalar.


use StringBuffer; my $stdout = tie(*STDOUT,'StringBuffer'); print STDOUT 'this will magically get put in $stdout'; undef($stdout); untie(*STDOUT);
Line Ending Converter
on Apr 25, 2000 at 19:50 UTC
by kayos

This converts the line-endings of a text file (with unknown line-endings). It supports DOS-type, Unix-type, and Mac-type. It converts the files "in place", so be careful.

You call it like:

linendings --unix file1.txt file2.txt ...
Log In?

What's my password?
Create A New User
Domain Nodelet?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others imbibing at the Monastery: (10)
As of 2023-12-05 12:37 GMT
Find Nodes?
    Voting Booth?
    What's your preferred 'use VERSION' for new CPAN modules in 2023?

    Results (27 votes). Check out past polls.