Re^3: Extracting appropriate language text from HTML data

In case you're not aware of this, you can add 'lang=xx' attributes to both block-level and inline elements in HTML4 and later, which may or may not make parsing a bit easier.

One question for clarification: what should the system do if your user requests, for example, French, but the source document is Italian in origin, and has more translations for some 'chunks' (for want of a better term) in EN than FR?

i.e. chunk 1 has IT & EN translations, chunk 2 has IT, EN & FR, chunk 3 has IT only - chunk 2 would obviously return the FR version and chunk 3 the IT (as it's the only one available), but what about chunk 1? What would the user expect to see for that?

Comment on Re^3: Extracting appropriate language text from HTML data

Replies are listed 'Best First'.
Re^4: Extracting appropriate language text from HTML data by UnderMine (Friar) on May 28, 2006 at 21:55 UTC
Thanks for that. I am currently treating each paragraph seperately using panic_languages to back out where no direct translation is available. You have raised an interesting point in relation to should there be some overall scheme that balences the paragraph readability against document readability. But to do this there has to be a relationship between alternate parts of the text. The current markup does not show how alternate parts relate but just what language that chunk is in. A better markup would indicate alternate parts and group them together. Thanks UnderMine	[reply]
Re^5: Extracting appropriate language text from HTML data by john_oshea (Priest) on May 29, 2006 at 12:15 UTC
Given your database constraints, I'm not sure that you're going to come up with a 'better' solution. Given that not every chunk is available in all languages, you're (effectively) going to have to decide at each chunk what's going to be the 'best' piece of text to return at that point, and I can't at the moment see a more elegant way of doing that...	[reply]
Re^4: Extracting appropriate language text from HTML data by UnderMine (Friar) on May 29, 2006 at 16:01 UTC
You have a very good point in relation to different chunks having to be processed seperately. I have started a seperate thread in relation to the Marking up alternatives as this is more of a meditation on the nature of markups. Thanks UnderMine	[reply]


Perl-Sensitive Sunglasses
	PerlMonks