Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Transforming strange format to XML

by Tortue (Scribe)
on Nov 17, 2001 at 23:02 UTC ( #126030=perlquestion: print w/replies, xml ) Need Help??

Tortue has asked for the wisdom of the Perl Monks concerning the following question:

I'm looking for a s///g construct to transform lines like:
    [_CELL_]This is a sentence on one line.
    [_CELL_] Cell 1  [_CELL_] Cell 2  [_CELL_] Cell 3
into:
    <CELL>This is a sentence on one line.</CELL>
    <CELL>Cell 1</CELL><CELL>Cell 2</CELL><CELL>Cell 3</CELL>
and feeling dumb I can't think of one off the top of my head.

Bonus question: Does anyone recognize that format? All I know is it's probably produced by FrameMaker (buggily) and used by O'Reilly for book publishing. It also includes things like:

    [_Body_] ... a paragraph on on line ...
    ... {_XRef#89948_} ...
    ... [_Fi_]something in italics[_F_] ...
etc. I don't like this format much (at all), so I'm trying to make a tool to convert it into XML and back. And I'd love to hear: "That's FMEZ format! Joe X has a great module for converting it!"

Replies are listed 'Best First'.
Re: Transforming strange format to XML
by chipmunk (Parson) on Nov 17, 2001 at 23:31 UTC
    I don't recognize the format either, so I'll just provide a substitution to transform the cell tags in a line, according to your example. s,\[_CELL_\]\s*(.*?)\s*?(?=\[_CELL_\]|$),<CELL>$1</CELL>,g; This matches from an opening [_CELL_] up to the next opening [_CELL_] or the end of the line, and sticks in the opening and closing <CELL> tags, swallowing any leading and trailing whitespace.

    This substitution may need to be adjusted based on the details of the actual format.

      I obviously hadn't quite mastered combining positive lookahead and non-greediness (selflessness?) yet. Until now my solution was to use two statements:
      s{\[_(CELL|COLHEAD)_\] (.*)}{<$1>$2</$1>}g; s{\s*\[_(CELL|COLHEAD)_\]\s*}{</$1><$1>}g;
      Here it's more complicated because there's several of these tags. It's lamer and slower than yours, so I'll gladly change it, thanks!

      Pauses to think for a while... Hm, with that extra twist I just introduced (not fair, I know), maybe the two-step version isn't slower. Lookahead on something it doesn't know yet could get tricky, maybe.

          s,\[_(CELL|COLHEAD)_\]\s*(.*?)\s*?(?=\[_\1_\]|$),<$1>$2</$1>,g;
      But it seems to work fine, and I don't think I care about speed anyway.
Re: Transforming strange format to XML
by Hero Zzyzzx (Curate) on Nov 17, 2001 at 23:34 UTC

    I don't know what format it is, but if it's standard you could probably find utilities to convert it to xml, or another format that you could then convert into xml.

    You might look at File::MMagic, or the unix 'file' command- I believe you can set them to peek into the file (if they don't already do it by default) to identify what type of file it is by looking for certain clues.

    It'd be nice to know what beastie you're dealing with before you go and roll your own xx-to-xml solution, huh?

    -Any sufficiently advanced technology is
    indistinguishable from doubletalk.

Re: Transforming strange format to XML
by fuzzysteve (Beadle) on Nov 19, 2001 at 19:00 UTC
    hmmm, just wanting the _ _ converted to < >, right? Can't think of how to do that in one line, but
    s/\[_/</g s/-\]/>/g
    Should be capable of doing it (might be wrong on the escaping.)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://126030]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others scrutinizing the Monastery: (3)
As of 2022-08-07 16:06 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?