Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine

Re: Parsing SGML-ish Data Files

by mirod (Canon)
on Aug 02, 2012 at 06:47 UTC ( #984953=note: print w/replies, xml ) Need Help??

in reply to Parsing SGML-ish Data Files

It is pretty difficult to answer you with the litlle information you give us, but I'll try anyway ;--(

If the data is not SGML, XML or HTML I don't think you should try to use SGML/XML/HTML tools on it. SGML and XML tools will simply not accept the data, and HTML tools will try their best, but their guess may not be what you expect.

It really depends on the format of your data files, but I would probably try to first convert the data to XML, using regexps, and then use XML tools which are usually pretty fast. But that's because I am used to processing XML, and my output is usually either XML or HTML, so an XML transformation gives me the result I want.

Also, the problem with your finite state machine may be the tokenizer. If you scan the input one character at a time, C-style, this may not be optimal, tokenizing using regexps may be faster.

Replies are listed 'Best First'.
Re^2: Parsing SGML-ish Data Files
by coolmichael (Deacon) on Aug 15, 2012 at 19:27 UTC

    Well, the end goal is converting to XML. Regular expressions for the conversion aren't going to work very well, as the tags aren't properly nested as they are in XML/HTML/SGML. For example <a><b></a></b> is considered valid.

    I do think the speed problem is in the tokenizer. I am doing to the scan one character at a time (from a buffer in memory, at least). I'm not sure how I could do that with regular expressions, but it's a good idea to look into.

      If you can show us enough of the actual structure of the data and describe the constraints on tags, attributes etc, we should be able to at least sketch a regex based solution or offer other alternatives for you.

      True laziness is hard work

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://984953]
[choroba]: (well, it comes from the 14th century, so labelling it as "modern" doesn't seem appropriate)
[Your Mother]: "They" is becoming accepted but it irritates me sometimes. I tend to just pick she or he randomly or try to use "one."
LanX thinks it is appropriate here :)
[Your Mother]: They would think so.
[jdporter]: ok, I need a recipe for piping lines "through" an external program which is itself a filter
[jdporter]: without using a tmp file
[1nickt]: tobyink perl -MTypes::Standard= is_Int -Mstrict -wE 'say 1 if is_Int 1.0'
[jdporter]: so that I can use the existing expand unix util. Otherwise, I'll probably use Text::Tabs.
[1nickt]: pryrt I guess I don;t really care if user 42 logs on as 42.0 ... more of an academic question at this point.
[LanX]: jdporter: open PIPE,'-|' ?

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (13)
As of 2017-05-24 20:24 GMT
Find Nodes?
    Voting Booth?