Unicode combining accents problemThe question (The answer)I've read carefully about Unicode but I can't figure something out. Consider the accented European characters such as E WITH ACUTE (é) (é) or U WITH UMLAUT (ü). A recommended way of handling each of these characters in the modern unicode style is to COMPOSE them from two unicode characters, one the underlying character (say, e), and the second the combining accent. Here is the html sequence for E, COMBINING ACUTE:
I have tried to write files in this recommended format, and I find that the resulting files do not display well in any of my linux text-rendering environments. Specifically,
Am I doing something wrong? The GTK text box works so well with other unicode alphabets (eg Hindi, Korean) I am surprised to find it is not working in European unicode. The Answer.
Markus Kuhn said:
Chances are that you don't do anything wrong, though I haven't done a
lot of experiments with combining characters recently on current
software versions. In general, combining accents are not yet well
supported under Linux/X11 with European fonts, as most people use UTF-8
only in NFC (the combined form) today. Xterm implemented with the old pixel core fonts
combining characters by simple unaligned overstriking of character-cell
glyphs, which may lead to unsatisfactory results for characters taller
than x. Modern font technologies have a mechanism to represent a
combination of 2 or 3 unicode characters by a single glyph, which is all
that is needed for Indic rendering. Another mechanism is used to place
any accent onto any character (not just those from a small precomposed
set), but most European outline fonts available lack the additional data
necessary, namely the additional reference points in the glyph design
needed for alignment. Instead, most of them contain just a set of
precomposed glyphs from NFC to cover the standard language repertoires.
The only things I can recommend at present are:
- use NFC wherever possible
- search for an OpenType encoded font that has all the necessary
information included (though I don't know, which X widget sets
do already make correct use of these, best ask on the respective
GTK mailing lists)
- try it with a specialised Unicode editor such as Yudit, which
have their own OpenType-compliant text rendering engine, and which
together with the right font might give you the best chance
The AnswerAs Markus said, I am not doing anything wrong. You can see how Yudit renders test.txt in figure 4. The utf8 text is rendered correctly. (Dasher's output is also correct utf8.) The problem is simply that most text-widget and font authors have not bothered to make European fonts comply with the new Unicode "decompose" convention. It's a shame, because it means we can't yet make Dasher work in the most user-friendly way. (For example, I think French would be more natural in Dasher if one wrote "e" followed by "acute".)We should ask the makers of the GTK textbox to fix this problem somehow. I guess the problem is with the fonts. I checked many of the fonts available for this text widget, and only one of them (ClearlyU, sadly only available in one size) rendered all European characters like "é" and "ü" right. The font "Clean" gets an honorable mention. It does all the combining marks that I tried correctly, except for the cedilla. "Verdana", "Courier New", "Dingbats", "Newspaper", "Nimbus Romand No9", "Sans", "Times" and "Standard Symbols L" were next best: they rendered the "é" right, and grave too. Many of the fonts make the error of putting the acute or grave on the following character, which is never correct in Unicode. | ||||||||||||||||||
|
Oct 2004
|