Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?

Text Processing

by vroom (Pope)
on May 26, 2000 at 01:24 UTC ( #14872=sourcecodesection: print w/replies, xml ) Need Help??
Sendmail pairs
on Sep 14, 2002 at 19:13 UTC
by Limbic~Region
Builds a hash whose index is a "from" "to" pair, and then increments it every time the pair is encountered. Finally, the information is sorted displaying the highest pairs first. Tested on HPUX 11.0 running Sendmail 8.9.3.
Parse::Report - parse Perl format-ed reports.
on Aug 14, 2002 at 00:20 UTC
by osfameron
After reading this question about a "generic report parser", I got interested in the idea. The question itself has been bizarrely (I think) downvoted, as it's an interesting topic. I've gone for the approach of parsing Perl's native format strings.

This is a very early of this code, and can probably be better done (e.g. could all the code be incorporated into the regex?!) I've made no attempt to parse number formats, and the piecing together of multiline text is unsophisticated (e.g. no attention to hyphenation), but it's a start.

on Jul 29, 2002 at 08:54 UTC
by DamnDirtyApe
I wrote this module because I wanted a super-simple way to separate my SQL queries from my application code.
on Jul 29, 2002 at 05:44 UTC
by DamnDirtyApe
This program generates a skeleton LaTeX file from a simple text file. This allows a large document to be `prototyped', with the LaTeX tags later generated automatically. See the POD for a more detailed explanation.
on Jul 28, 2002 at 21:51 UTC
by vxp
This is a little something that parses an mbox file and grabs email address out of it (I use it at work to parse a bounce file and grab email addresses out of it for various purposes). Feel free to modify it, use it, whatever. (Credit info: this was actually not written by me, but by the previous network admin)
on Jul 22, 2002 at 23:28 UTC
by ignatz
Takes in a LOH (List of Hashes) and an array of keys to sort by and returns a new, sorted LOH. This module closely relates to Sort::Fields. in terms of it's interface and how it does things. One of it's main differences is that it is OO, so one can create a Sort::LOH object and perform multiple sorts on it.

Comments and hash criticism are most welcome. I tried to find something here or on CPAN that did this, but the closest that I got was Sort::Fields. Close, but no cigar. Perhaps there is some simple way to do this with a one liner. Even so, it was fun and educational to write.

Snort IDS signature parser
on Jun 24, 2002 at 00:50 UTC
by semio
I wanted to obtain a list of all enabled signatures on a Snort IDS e.g. a listing of sigs contained in all .rules files as well as some general information for each, such as the signature id and signature revision number. I created one large file on the IDS called allrules and wrote this script to present each signature, in a comma-delimited format, as msg, signature id, signature revision number.
Pod::Tree dump for the stealing
on May 31, 2002 at 15:03 UTC
by crazyinsomniac
ever use Pod::Tree; ?
ever my $tree = new Pod::Tree; ?
ever $tree->load_file(__FILE__); ?
ever print $tree->dump; ?
Wanna do it yourself?
Here is goood skeleton code (care of the Pod::Tree authors)
Annotating Files
on Apr 17, 2002 at 20:32 UTC
by stephen

I was writing up a list of comments on someone's code, and got tired of retyping the line number and filename over and over again. Also, I liked to skip around a bit in the files, but wanted to keep my annotations sorted.

So for starters, I wrote a little XEmacs LISP to automatically add my annotations to a buffer called 'Annotations'. It would ask me for the comment in the minibuffer, then write the whole thing, so that I could keep working without even having to switch screens. I bound it to a key so I could do it repeatedly. Pretty basic stuff.


(defun add-note () "Adds an annotation to the 'annotations' buffer" (interactive) (save-excursion (let ( (annotate-comment (read-from-minibuffer "Comment: ")) (annotate-buffer (buffer-name)) (annotate-line (number-to-string (line-number))) ) (set-buffer (get-buffer-create "annotations")) (goto-char (point-max)) (insert-string (concat annotate-buffer ":" annotate-line " " ann +otate-comment "\n" ) ) ) ) ) (global-set-key "\C-ca" `add-note)

This would generate a bunch of annotations like this:

comment_tmpl.tt2:1 This would be more readable if I turned on Template +'s space-stripping options. More informative error message would probably be +good. Need a better explanation of data structure. annotate.el:1 Should properly be in a mode... annotate.el:11 Should be configurable variable annotate.el:13 Formatting should be configurable in variable annotate.el:11 Should automatically make "annotations" visible if it i +sn't already annotate.el:21 Control-c keys are supposed to be for mode-specifics...

Next, I wanted to format my annotations so I could post them here in some kind of HTML format. So I wrote a little text processor to take my annotations, parse them, and format the result in HTML. This was not difficult, since most of the heavy lifting was done by the Template module.

Here's a standard template file... pretty ugly, really, but you can define your own without changing the code...

[% FOREACH file = files %][% FOREACH line = file.lines %] <dt>[% %]</dt> <dd><b>line [% line.number %]</b> <ul>[% FOREACH comment = line.comments %] <li>[% comment %]</li> [% END %]</ul> </dd> [% END %][% END %]

Alternatively, I could have had my XEmacs function output XML and used XSLT. Six of one, half a dozen of the other... Plus, one could write a template file to translate annotations into an XML format.

The Output

line 1
  • Should properly be in a mode...
line 11
  • Should be configurable variable
  • Should automatically make "annotations" visible if it isn't already
line 13
  • Formatting should be configurable in variable
line 21
  • Control-c keys are supposed to be for mode-specifics...
line 31
  • More informative error message would probably be good.
line 71
  • Need a better explanation of data structure.
line 1
  • This would be more readable if I turned on Template's space-stripping options.
Shortcuts Engine: Packaged Version
on Apr 06, 2002 at 21:42 UTC
by munchie
This is the shortcuts engine I made put into packaged format. It is now more portable, and offers more flexibility. I have never submitted anything to CPAN, so I want fellow monks' opinion on wheter or not this module is ready for submission to CPAN.

This module requires Text::xSV by tilly. If you have any suggestions on making this better, please speak up.

UPDATE 1: Took out the 3Arg open statements for slightly longer 2Args, to make sure that older versions of Perl will like my program. (thanks to crazyinsomniac)

UPDATE 2: I just uploaded this on PAUSE, so very soon you'll all be able to get it on CPAN! (My first module!)

Shortcuts engine for note taking
on Apr 02, 2002 at 20:42 UTC
by munchie
This code allows the user to set shortcuts (a character surrounded in brackets) to allow for fast note taking/document writing. (Thank you tilly for the awesome Text::xSV module!)
Yet another code counter
on Mar 01, 2002 at 14:27 UTC
by rinceWind
Here is a code counter that can handle languages other than Perl. It was required for sizing a rewrite project, and gave some useful metrics on code quality as a by-product.

It is easy to add other languages by populating the hashes %open_comment and %close_comment, and/or %line_comment.

The code counts pod as comment, but bear it in mind that this script was not primarily designed for counting Perl.

on Jan 07, 2002 at 09:24 UTC
by seattlejohn
When working on large-ish projects, I've sometimes found it gets to be a real pain to manage all the error messages, status messages, and so on that end up getting scattered throughout my code. I wrote MessageLibrary to provide a simple OO way of generating messages from a centralized list of alternatives, so you can keep everything in one easy-to-maintain place.
Matching in huge files
on Dec 02, 2001 at 03:13 UTC
by dws
A demonstration of how to grep through huge files using a sliding window (buffer) technique. The code below has rough edges, but works for simple regular express fragments. Treat it as a starting point.

I've seen this done somewhere before, but couldn't find a working example, so I whipped this one up. A pointer to a more authoritative version will be appreciated.

Regexp::Graph 0.01
on Nov 24, 2001 at 23:11 UTC
by converter

Regex::Graph provides methods for displaying the results of regular expression matches using a "decorated" copy of the original string,with various format-specific display attributes used to indicate the portion of the string matched, substrings captured, and for global pattern matches, the position where the pattern will resume on the next match.

This module encapsulates a regular expression pattern and a string against which the pattern is to be matched.

The `regshell' program (included with this distribution) demonstrates the use of the ANSI formatter module, and provides a handy tool for testing regular expressions and displaying the results. Other format modules are in the works, including a module to output text with HTML/CSS embedded styles.

regshell includes support for readline and history, and can save/load pattern/string pairs to disk files for re-use.

NOTE: I have not been testing this code on win32. The regshell program is not strict about paying attention to which terminal attributes are set, so it may go braindead on win32. I'll pay more attention to win32 on the next revision.
on Oct 02, 2001 at 14:55 UTC
by tfrayner
Here's a little script I wrote as an exercise. My aim was to implement a user-friendly way of substituting text across a heirarchy of files and directories. There is of course a simple way to do this without resorting to this script. However, the development of the script allowed me to add some nice features. Try 'perldoc' for details.

I'd welcome any comments, particularly with regard to efficiency/performance and portability.

on Sep 11, 2001 at 03:06 UTC
by jryan
Someone in the chatterbox the other day wanted an easier way to create a tree construct without emedding hashes up the wazoo, so here is my solution: Tie::HashTree. You can create a new tree from scratch, or convert an old hash into a Tree, whatever floats your boat. You can climb up and down the branches, or jump to a level from the base. Its up to you. The pod tells all that you need to know (I think), and I put it at the top for your convienience :) If you have any comments/flames, please feel free to reply.
on Aug 27, 2001 at 10:50 UTC
by tachyon

This module implements the halve the difference algorithm to efficiently (and rapidly) find an element(s) in a sorted file. It provides a number of useful methods and can reduce search times in large files by several orders of magnitude as it uses a geometric search rather than the typical linear approach. Logfiles are a typical example where this is useful.

I have never written anything I considered worthy of submitting to CPAN but thought this might be worthwhile.
on Aug 02, 2001 at 19:19 UTC
by kingman
This module reformats paragraphs (delimited by \n\n) into equal-width or varied-width columns by interpreting a format string. See the synopsis for a couple of examples.

I just read yesterday that formats will be a module in perl 6 so I guess there's already something like this out there?

Erriccsons Dump Eric Data Search and output script
on Jul 16, 2001 at 19:07 UTC
by brassmon_k
First off this is a menu script with simple options that include multiple scripts.

Uses Erriccssons Dump_Eric decrypter for cell traffic call records and I've developed a tool to search on the encrypted file names (a ksh and a CGI, posting ksh though)
You specify the date first - Then the time - By doing this the records you pick are thinned out allowing for faster processing of the call record files. It finds the call record files by using a simple pattern match.
From top to bottom here is the process.

Search tool - Specify date & time.
sends the names of the files found to a file
then a "sed" statement is created to put dump_eric infront of all filenames in the file
then the output is sent to another file
then the awk script is run after the above is done and you put in your msisdn and the awk script searches on the output in the second file and outputs that to another file.
then after all that you can view the results.
Lastly (as we all know the files that dump_eric runs on are rather large)We delete the search results as you're done with them(You're givne the option to delete)
Only 2 flaws as I'm aware of is the fact that you can only do one search at a time or else the files with the output get overwritten if somebody else is running a search after you. (I had my own purposes for that) You can easily get around this by having the script ask you what you want to name the output files, to solve the unknown factor for other users just keep a known file extension on it.
Last flaw (not really a flaw on my part a necessity because dump_eric is picky - If you run the searchtool from a different directory it includes the fullpath in the file so your call record location output would be (for me atleast) /home/bgw/AccessBill/TTFILE.3345010602123567 and dump_eric won't take anything but the call record file name and not the path) The date&time search tools must be in the same directory as the calltrace records....All the other scripts can go anywhere you wish.
Now the code I will list below is multiple scripts each with their own heading.

NOTE: Don't forget to change your PERL path for the "#!/usr/bin/perl" as your path might be different.
NOTE: There are 3 search tools: A dateonly, a timeonly, and a date&time
NOTE: I only put in the date&time search tool because it's really easy to change this script to a timeonly or dateonly and change the menu to suit your needs so you can change it at your leisure(and to save space down here:-).
NOTE: THE AWK SCRIPT(except the part where you append or output to your file)can't have any whitespace after each line or it won't work so cut and paste it but make sure that you go through it and get rid of any after each line if there is any.

I'll list the code in order.
If any help is needed don't hesitate to contact myself at ""
on Jun 29, 2001 at 04:58 UTC
by ton
Data::Dumper is a great utility that converts Perl structures into eval-able strings. These strings can be stored to text files, providing an easy way to save the state of your program.

Unfortunately, evaling strings from a file is usually a giant security hole; imagine if someone replaced your stucture with system("rm -R /"), for instance. This code provides a non-eval way of reading in Data::Dumper structures.

Note: This code requires Parse::RecDescent.

Update: Added support for blessed references.
Update: Added support for undef, for structures like [3, undef, 5, [undef]]. Note that the undef support is extremely kludgy; better implementations would be much appreciated!
Update2: Swapped the order of FLOAT and INTEGER in 'term' and 'goodterm' productions. FLOAT must come before INTEGER, otherwise it will never be matched!

on Mar 17, 2001 at 05:50 UTC
by tilly
I am tired of people asking how to handle CSV and not having a good answer that doesn't involve learning DBI first. In particular I don't like Text::CSV. This is called Text::xSV at tye's suggestion since you can choose the character separation. Performance can be improved significantly, but that wasn't the point.

For details you can read the documentation.

Fixed minor bug that resulted in quotes in quoted fields remaining doubled up.

Fixed missing defined test that caused a warning. Thanks TStanley.

Frequency Analyzer
on Mar 16, 2001 at 04:00 UTC
by Big Willy
Updated as of March 16, 2001 at 0120 UTC Does frequency analysis of a monoalphabetic enciphered message via STDIN. (Thanks to Adam for the $i catch).
Fast file reader
on Mar 15, 2001 at 22:51 UTC
by Malkavian
Following discussions with a colleague (hoping for the name Dino when he gets round to appearing here) on performance of reading log files, and other large files, we hashed out a method for rapidly reading files, and returning data in a usable fashion.
Here's the code I came up with to implement the idea This is a definate v1.0 bit of code, so be gentle with me, although constructive criticism very welcome.
It's not got much in the way of internal documentation yet, tho I'll post that if anyone really feels they want it.
It requires you have the infinitely useful module Compress::Zlib installed, so thank you authors of that gem.

Purpose: The purpose is to have a general purpose object that allows you to read newline seperated logs (in this case from Apache), and return either a scalar block of data or an array of data, which is comprised of full lines, while being faster than using readline/while.

Some quick stats:
Running through a log file fragment, using a while/readline construct and writing back to a comparison file to check integrity of file written took 15.5 seconds.
Running the same log file with a scalar read from the read_block and writing the same output file took 11.3 seconds.
Running the file with an array request to read_block took 11.3 seconds.
Generating the block and using the reference by the get_new_block_ref accessor and writing the block uncopied to the integrity test file took 8.3 seconds.
For those who take a long time reading through long log files, this may be a useful utility.

on Feb 24, 2001 at 09:59 UTC
by damian1301
This is the first script I have ever made, though it could use some improvements. This is about 2 months old out of my 5 month Perl career. Any improvements or suggestions will be, as usual, thankfully accepted.
Perl Source Stats
on Feb 15, 2001 at 01:42 UTC
by spaz
This lil app will give you the following info about your Perl source files
  • Number of subroutines (and their line number)
  • Number of loops (and their line number)
  • Number of lines that are actual code
  • Number of lines that are just comments
Markov Chain Program
on Feb 02, 2001 at 01:38 UTC
by sacked
Taking the suggestion from Kernighan and Pike's The Practice of Programming, I wrote another version of their Markov Chain program (chapter 3) that allows for different length prefixes. It works best with shorter prefixes, as they are more likely to occur in the text than longer ones.

Please offer any tips for improving/enhancing this script. Thanks!
Code counter
on Feb 01, 2001 at 07:18 UTC
by Falkkin
This program takes in files on the command-line and counts the lines of code in each, printing a total when finished.

My standard for counting lines of code is simple. Every physical line of the file counts as a logical line of code, unless it is composed entirely of comments and punctuation characters. Under this scheme, conditionals count as separate lines of code. Since it is often the case that a decent amount of the code's actual logic takes place within a conditional, I see no reason to exclude conditionals from the line-count.

Usage: [-v] [filenames]

The -v switch makes it output verbosely, with a + or - on each line of code based on whether it counted that line as an actual line of code or not.

Totally Simple Templates
on Jan 24, 2001 at 06:14 UTC
by japhy
Using my recently uploaded module, DynScalar, template woes are a thing of the past. By wrapping a closure in an object, we have beautiful Perl code expansion.
on Dec 22, 2000 at 00:11 UTC
by epoptai
coder encodes text and IP addresses in various formats. Text can be encoded to and from uppercase, lowercase, uuencoding, MIME Base64, Zlib compression (binary output is also uuencoded, uncompress expects uuencoded input), urlencoding, entities, ROT13, and Squeeze. IP addresses can have their domain names looked up and vice versa, converts IPs to octal, dword, and hex formats. Query strings can also be decoded or constructed.
SuperSplit code
on Jan 02, 2001 at 18:35 UTC
by jeroenes
Extends split/join to multi-dimensional arrays
IP Address sorting
on Oct 23, 2000 at 16:38 UTC
by japhy
Sorts N IP addresses in O(Nk) time, each and every time. Uses the technique called radix sorting.
In-Place editing system
on Oct 10, 2000 at 23:13 UTC
by Intrepid

Update made on Fri, 25 Jul 2003 +0000 ...

...mostly for historical interest; if this horrible code has to remain up on PM it might as well be readable (removed the <CITE> tags and so forth).

Generally: Perl with the -i switch (setting $^I) does in-place editing of files.


Passed a set of arguments for the string to match to a regex and the replacement string, this system will find every file in the current working directory which matches the glob argument and replace all instances of the string.

There's some ugliness here: the need to make 2 duplicates of the @ARGV array (@Y and @O) and the need to write a TMP file to disk (major YEEECHHH). So I am offering it as a work-in-progress with the additional acknowledgement that it *must* be merely a reinvent of the same wheel built by many before me, yet I never have found a finished script to do this (particularly on Win32 which does not possess a shell that will preglob the filename argument for you).

The system consists of two parts: one is the perl one-liner (note WinDOS -style shell quoting which will look so wrong to UNI* folk) and the other is a script file on-disk (which is called by the one-liner). It's probably a decidedly odd way to do this and I may later decide that I must have been hallucinating or otherwise mentally disabled when I wrote it :-).

The system keeps a fairly thorough log for you of what was done in the current session. If optionflag -t is set it will reset all the timestamps on completion of the replacements, allowing one to make a directory-wide substitution without modifying the lastmod attribute on the files (might be highly desirable in certain situations ... ethical ones of course). The -d switch currently doesn't work (like I said this a work in progress). (see next para for explanation). When it worked it was for debugging, that is, doing a dry-run.

The Perl -s switch (option flag to perl itself) wouldn't give me the ability to set custom options in this use (with -e set) nor would the module Getopt::Std. One of my reasons for posting this code is to invite help in coming up with an answer as to "why" (particularly in the case of the latter).

on Aug 08, 2000 at 23:57 UTC
by turnstep

Just a simple hex editor-type program. I actually wrote this back in 1996, so go easy on it! :) I cleaned it up a little to make it strict-compliant, but other than that, it is pretty much the same. I used it a lot when I was learning about how gif files are contructed. Good for looking at files byte by byte.

I have no idea why it was named wanka but it has stuck. :)

Automatic CODE-tag creation (Prototype)
on Jun 21, 2000 at 20:28 UTC
by Corion
Out of a discussion about how we can prevent newbies from posting unreadable rubbish, here is a program that tries to apply some heuristics to make posts more readable. This version isn't the most elegant, so it's called a prototype.
on Aug 01, 2000 at 00:05 UTC
by Tally
PINE is a common text-based email viewer on many UNIX systems. The PINE program stores email in large text files which makes it very handy to archive your old email... except that there's no table of contents at the beginning of the file to let you know what messages are stored there. This script solves that problem by parsing the PINE email store and creating a separate table of contents from the headers of each email. The resulting TOC lists the message number, title, sender info and date in formatted columns. I usually concatinate the TOC and email storage file, and then save the resulting file in my email archives.

Note: This script works very well with version 3.96 of PINE, which I use, but there are other versions that I have not tested it on.

PLEASE comment on this code. I'm a fairly new perl programmer and would appreciate feedback on how to improve my programming.
on Aug 01, 2000 at 19:29 UTC
by mikfire
A down and dirty little script that reads a file, looking for subroutine definitions. It extracts these and then parses through a whole bunch of other files looking for calls to those functions. It isn't perfect, but it works pretty well.

Usage: list_call source file [ ...]
where source is the file from which to extract the calls and file is the file to be searched.
Glossary Maker
on Apr 27, 2000 at 00:35 UTC
by gregorovius
I wrote a set of scripts that will automatically find rare words in a book or text.

1. The first script will FTP a very large number of ascii coded classic books from the gutenberg project (

2. The second one computes a histogram of word frequencies for all those books.

3. The third one takes the text where one wants to find rare words. It will start by showing all the words in it with count 0 in the histogram, then the ones with count 1 and so on. The user chooses manually which words he wants to include in the glossary and then chooses to stop as the scripts starts showing words with higher counts.

4. The chosen words are looked up automatically on web dictionary.

5. We have our glossary ready! The next step is unimplemented but what follows is to generate a TeX file for type-setting the ascii book with the dictionary terms as footnotes or as a glossary on the back.

Note: The scripts are short and easy to understand but not too orderly or properly documented. If you want to continue developing them feel free to do so, but please share any improvements you make to them.

Here is a description of their usage: LIST OF LASTNAMES

Will download all the book titles under each author on the list of names into a local archive. It must be run from the directory where the archive resides.


% mkdir archive
% Conan\ Doyle Conrad Gogol Darwin

After running these commands archive will contain one sub irectory for each author, and each of these will contain all the books for that author on Project Gutenberg.

Will generate a DB database file containing a histogram of word frequencies of the book archive created by the program.

To use it just run it from the directory where the 'archive' directory was created. It will generate two files, one of them called index.db containing the histogram and the other called indexedFiles.db containing the names of the files indexed so far (this last one allows us to add books to the archive and index them without analizing again the ones we already had).

Note that this script is very innefficient and requires a good deal of free memory on your system to run. A new version should use MySQL instead of DB files to speed it up. BOOK_FILE

Will take a book from the archive created by the script and will look at the word count for each of its words on the histogram of word frequencies created by Starting with the less frequent words it will prompt the user to choose which ones to include on the glossary. When the user stops choosing words the program will query a web dictionary and print the definition of all the chosen words to STDOUT.

String Buffer
on Apr 25, 2000 at 21:12 UTC
by kayos

Sometimes I encounter a script or program that wants to print directly to STDOUT (like Parse::ePerl), but I want it in a scalar variable. In those cases, I use this StringBuffer module to make a filehandle that is tied to a scalar.


use StringBuffer; my $stdout = tie(*STDOUT,'StringBuffer'); print STDOUT 'this will magically get put in $stdout'; undef($stdout); untie(*STDOUT);
Line Ending Converter
on Apr 25, 2000 at 19:50 UTC
by kayos

This converts the line-endings of a text file (with unknown line-endings). It supports DOS-type, Unix-type, and Mac-type. It converts the files "in place", so be careful.

You call it like:

linendings --unix file1.txt file2.txt ...
Log In?

What's my password?
Create A New User
[choroba]: Good morning!

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (6)
As of 2018-06-21 07:32 GMT
Find Nodes?
    Voting Booth?
    Should cpanminus be part of the standard Perl release?

    Results (117 votes). Check out past polls.