Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Re: Parsing SGML-ish Data Files

by mirod (Canon)
on Aug 02, 2012 at 06:47 UTC ( #984953=note: print w/ replies, xml ) Need Help??


in reply to Parsing SGML-ish Data Files

It is pretty difficult to answer you with the litlle information you give us, but I'll try anyway ;--(

If the data is not SGML, XML or HTML I don't think you should try to use SGML/XML/HTML tools on it. SGML and XML tools will simply not accept the data, and HTML tools will try their best, but their guess may not be what you expect.

It really depends on the format of your data files, but I would probably try to first convert the data to XML, using regexps, and then use XML tools which are usually pretty fast. But that's because I am used to processing XML, and my output is usually either XML or HTML, so an XML transformation gives me the result I want.

Also, the problem with your finite state machine may be the tokenizer. If you scan the input one character at a time, C-style, this may not be optimal, tokenizing using regexps may be faster.


Comment on Re: Parsing SGML-ish Data Files
Re^2: Parsing SGML-ish Data Files
by coolmichael (Deacon) on Aug 15, 2012 at 19:27 UTC

    Well, the end goal is converting to XML. Regular expressions for the conversion aren't going to work very well, as the tags aren't properly nested as they are in XML/HTML/SGML. For example <a><b></a></b> is considered valid.

    I do think the speed problem is in the tokenizer. I am doing to the scan one character at a time (from a buffer in memory, at least). I'm not sure how I could do that with regular expressions, but it's a good idea to look into.

      If you can show us enough of the actual structure of the data and describe the constraints on tags, attributes etc, we should be able to at least sketch a regex based solution or offer other alternatives for you.

      True laziness is hard work

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://984953]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (6)
As of 2014-07-28 07:18 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (193 votes), past polls