http://www.perlmonks.org?node_id=177287

Cody Pendant has asked for the wisdom of the Perl Monks concerning the following question:

I've got an archive of scripts, and I'd like to be able to mark them up in some way, for searching, so that they distinguish between, say, stage directions and speech, or find each character's speeches -- for instance:

<stage-direction>Enter Fred, stage right</stage-direction> <speech>FRED: Hey everyone, what's up?</speech>

or maybe

<stage-direction>Enter Fred, stage right</stage-direction> <speech character="Fred">Hey everyone, what's up?</speech>

and I know that this goes against the "do it properly" ethos, but if I didn't want to use XML, has anybody got any suggestions for an XML-like custom markup which makes flat text files easy to search?

It doesn't have to be case-insensitive or even whitespace-independent, say if it was just:

begin-stage-direction Enter Fred, stage right end-stage-direction

That would be cool. Then it can just be read with some regex/for-loop thing like:

if(/^begin-stage-direction/){ $notice_next_line = 1; }else{ next; }
kind of construct.

I know there's a big "why not do it properly in a database" hanging over this, but assuming text files with custom markup, is there a non-stupid way to do this?
--

($_='jjjuuusssttt annootthheer pppeeerrrlll haaaccckkeer')=~y/a-z//s;print;

Replies are listed 'Best First'.
Re: Poor Man's XML?
by mirod (Canon) on Jun 26, 2002 at 04:07 UTC

    You could have a look at yaml, a terse yet powerful non-markup language. There is even a YAML module.

    You example would look like:

    stage-direction :Enter Fred, stage right speech : character: Fred text: > Hey everyone, what's up?
      On a line similar to YAML, there is also SOX. But no luck, the implementation is java based.

      -- stefp -- check out TeXmacs wiki

Re: Poor Man's XML?
by lestrrat (Deacon) on Jun 26, 2002 at 05:16 UTC

    I second graff's suggestion. Why not just use XML, if you're going to be marking it up anyways?

    Once you put it in XML, you can do bunch of cool things, plus you get the bonus of not having to write yet another parser.

    use XML::LibXML; my $parser = XML::LibXML->new(); my $xml = $parser->parse_file( '/path/to/file' ); ## find all <speech>.. foreach my $speech ( $xml->findnodes( '/script/speech' ) { .... } ## find all speech by Romeo (assuming: ## <speech><character>Romeo</character><text>....</text></speech> ## ) foreach my $romeo_speech ( $xml->findnodes( '/script/speech/[charact +er = "Romeo"]' ) ) { .... } ## etc...
Re: Poor Man's XML?
by kvale (Monsignor) on Jun 26, 2002 at 04:17 UTC
    If the tags don't nest, then it seems that you could get by with just start tags:

    <stage right> Enter Juliet <speech Juliet> Romeo! Romeo! Where the heck are you! <audience> claps.

    The content after each tag is delimited by the next tag or the EOF. You could decompose this format into its respective parts using, e.g.,

    while (<>) { if (/^<(.+?)>(.*)$/i) { $tag = $1; $part{$tag} .= $2; } else { $part{$tag} .= $_; } }

    -Mark
Re: Poor Man's XML?
by graff (Chancellor) on Jun 26, 2002 at 05:01 UTC
    Just curious... what might be the compelling reasons for not wanting to use XML? It doesn't seem like you're objecting to the additional bulk in the data, since some of your considered choices were potentially bulkier; and (speaking as one who had to do basic SGML data hacking years ago, before modules) it's not as if XML tags are all that tough or unwieldy for this sort of usage.

    If/when you get around making this data portable/sharable -- trust me, there are many who could find untold uses for it, if you're willing to make it available -- you'll really want to have it in XML anyway, so why not just do it that way now?

Re: Poor Man's XML?
by peschkaj (Pilgrim) on Jun 26, 2002 at 04:01 UTC
    I would suggest using something like comments (perhaps the standard #). These can be easily ignored by something like perldoc, but can be used as a pre-processing instruction if you want to convert to XML or any other format.

    A database would take up a lot of overhead... A LOT of overhead. And require that you learn SQL. Which seems to go against reasons for not using XML (as they are both languages and have a distinct syntax e.g. DTD, XSLT, XSL-FO for XML).
Re: Poor Man's XML?
by Popcorn Dave (Abbot) on Jun 26, 2002 at 04:03 UTC
    A quick thumb through The Complete HTML Reference came up with either using an id or class tag.

    Then with your regex, you could base it on whatever class or id you've assigned to your lines.

    Caveat: I don't know XML or if this would interfere with it, but it would probably do what you want quickly and without a whole lot of hassle.

    Hope that helps!

    Some people fall from grace. I prefer a running start...

Re: (newrisedesigns) Poor Man's XML?
by newrisedesigns (Curate) on Jun 26, 2002 at 04:25 UTC

    If you aren't going to go with a pre-defined format, make sure your format has strict guidelines and your programs adhere to them.

    I suggest using a combination of tabs and newlines (CR or LF) to delimit certain areas.

    I've used ## and \t\t\n as delimiters before. When it comes down to the wire, whatever works well (and reliably) is your best option.

    You can use any kind of delimiter, just as long as the delimiter is stripped/replaced before you add it to your file, or else you'll break the format and your program.

    In defense of module-driven apps, XML::* mods are worth using. But if you can't, you can't. Good luck.

    John J Reiser
    newrisedesigns.com

Re: Poor Man's XML?
by brianarn (Chaplain) on Jun 26, 2002 at 16:16 UTC
    I'm currently reading through the new Perl & XML book, and I must say that XML is a lot more flexible and useful than I'd have imagined, and it seems like it'd be suited for this task.

    You could even design a simple DTD for it, assigning things such as a <dialogue>tag that needs attributes of character - you could even split up the acts into their own parts of the tree. Here's a quick ad-hoc sample that comes to mind (just from somewhere in the middle, not including a DTD or the root element)
    <act number="2"> <stage-direction character="Fred"> Enter, stage right </stage-direction> <dialogue character="Fred"> Man, it's hot out there! </dialogue> <stage-direction character="Fred"> Wipe forehead </stage-direction> </act>
    It could just be that I'm loving this new book, but it seems like XML would be the markup of choice here, and you learn a lot in the process.

    ~Brian
Re: Poor Man's XML?
by Cody Pendant (Prior) on Jun 27, 2002 at 03:20 UTC
    Thank you all for your help.

    I think my aversion to learning XML is really an aversion to learning XSLT -- I read a bit of the O'Relly book and it just seemed like page after page of "know how to do a 'for' loop? Well it's possible, though incredibly complicated and wordy, to do in XSLT".

    I've been learning Perl and various other things like JavaScript for ages, and I haven't got the room in my brain to start learning a new and forbidding language.

    I might try to do it with XML, but not using XSLT, using the module as suggested by Lestrrat.

    After all, even if I ended up writing my own really bad parser, just because it's not being read by a formal XML parser, doesn't mean it couldn't be valid XML in case someone else wanted it...
    --

    ($_='jjjuuusssttt annootthheer pppeeerrrlll haaaccckkeer')=~y/a-z//s;print;

      XSLT isn't quite as bad as it looks at first glance. I haven't seen the O'Reily book myself, but I did manage to get a decent start from the tutorial at W3Schools. I leatned enough to replace some pretty nasty looping structures in some of my scripts with much more elegant (and faster) XSL transforms using XML::LibXML and XML::LibXSLT.

      <xsl:rattus/>

A reply falls below the community's threshold of quality. You may see it by logging in.