While there are efforts underway to effectively kill the Linux virtual terminal “VT” console by punting the functionality off to user-space, it’s not dead yet and a new patch series out on Wednesday aims to enhance the modern Unicode handling by the Linux VT.
Open-source developer Nicolas Pitre posted a patch series yesterday for improving the Unicode support within the Linux VT. Pitre explained:
– All new double-width Unicode code points which have been introduced since Unicode 5.0 are not recognized as such (we’re at Unicode 16.0 now).
– Zero-width code points are not recognized at all. If you try to edit files containing a lot of emojis, you will see the rendering issues. When there are a lot of zero-width characters (like “variation selectors”), long lines get wrapped, but any Unicode-aware editor thinks that the content was rendered properly and its rendering logic starts to work in very bad ways. Combine this with tmux or screen, and there is a huge mess going on in the terminal.
– Also, text which uses combining diacritics has the same effect as text with zero-width characters as programs expect the characters to take fewer columns than what they actually do.
Some may argue that the Linux VT console is unmaintained and/or not used much any longer and that one should consider a user space terminal alternative instead. But every such alternative that is not less maintained than the Linux VT console does require a full heavy graphical environment and that is the exact antithesis of what the Linux console is meant to be.
Furthermore, there is a significant Linux console user base represented by blind users (which I’m a member of) for whom the alternatives are way more cumbersome to use reducing our productivity. So it has to stay and be maintained to the best of our abilities.
That being said…
This patch series is about fixing all the above issues. This is accomplished with some Python scripts leveraging Python’s unicodedata module to generate C code with lookup tables that is suitable for the kernel. In summary:
– The double-width code point table is updated to the latest Unicode version and the table itself is optimized to reduce its size.
– A zero-width code point table is created and the console code is modified to properly use it.
– A table with base character + combining mark pairs is created to convert them into their precomposed equivalents when they’re encountered. By default the generated table contains most commonly used Latin, Greek,and Cyrillic recomposition pairs only, but one can execute the provided script with the –full argument to create a table that covers all possibilities. Combining marks that are not listed in the table are simply treated like zero-width code points and properly ignored.
– All those tables plus related lookup code require about 3500 additional bytes of text which is not very significant these days. Yet, one can still set CONFIG_CONSOLE_TRANSLATIONS=n to configure this all out if need be.”