Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Re^3: Extracting appropriate language text from HTML data

by john_oshea (Priest)
on May 28, 2006 at 15:06 UTC ( [id://552180]=note: print w/replies, xml ) Need Help??


in reply to Re^2: Extracting appropriate language text from HTML data
in thread Extracting appropriate language text from HTML data

In case you're not aware of this, you can add 'lang=xx' attributes to both block-level and inline elements in HTML4 and later, which may or may not make parsing a bit easier.

One question for clarification: what should the system do if your user requests, for example, French, but the source document is Italian in origin, and has more translations for some 'chunks' (for want of a better term) in EN than FR?

i.e. chunk 1 has IT & EN translations, chunk 2 has IT, EN & FR, chunk 3 has IT only - chunk 2 would obviously return the FR version and chunk 3 the IT (as it's the only one available), but what about chunk 1? What would the user expect to see for that?

  • Comment on Re^3: Extracting appropriate language text from HTML data

Replies are listed 'Best First'.
Re^4: Extracting appropriate language text from HTML data
by UnderMine (Friar) on May 28, 2006 at 21:55 UTC
    Thanks for that.

    I am currently treating each paragraph seperately using panic_languages to back out where no direct translation is available.

    You have raised an interesting point in relation to should there be some overall scheme that balences the paragraph readability against document readability. But to do this there has to be a relationship between alternate parts of the text.

    The current markup does not show how alternate parts relate but just what language that chunk is in. A better markup would indicate alternate parts and group them together.

    Thanks
    UnderMine

      Given your database constraints, I'm not sure that you're going to come up with a 'better' solution. Given that not every chunk is available in all languages, you're (effectively) going to have to decide at each chunk what's going to be the 'best' piece of text to return at that point, and I can't at the moment see a more elegant way of doing that...

Re^4: Extracting appropriate language text from HTML data
by UnderMine (Friar) on May 29, 2006 at 16:01 UTC
    You have a very good point in relation to different chunks having to be processed seperately. I have started a seperate thread in relation to the Marking up alternatives as this is more of a meditation on the nature of markups.

    Thanks
    UnderMine

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://552180]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having a coffee break in the Monastery: (8)
As of 2024-04-24 09:04 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found