Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Ugh Unicode has been the bane of my existence trying to write a text format spec. I started by trying to forbid certain characters to keep files editable and avoid Unicode rendering exploits (like hiding text, or making structured text behave differently than it looks), but in the end it became so much like herding cats that I had to just settle on https://github.com/kstenerud/concise-encoding/blob/master/ct...

Basically allow everything except some separators, most control chars, and some lookalike characters (which have to be updated as more characters are added to Unicode). It's not as clean as I'd like, but it's at least manageable this way.



Unfortunately, your "text safe" definition appears to exclude text in languages such as Persian (among others), where Zero Width Non-Joiner is required to write some words correctly.


Oh damn good catch! Adding category Cf.


Thanks.... though note that Cf also includes things like the bidi directional-override code points, which you might still prefer to exclude.


Yup, I've removed a bunch of Cf codes (it really is a grab bag!), but I'm not sure if removing BIDI could be done without breaking some languages?

https://github.com/kstenerud/concise-encoding/blob/master/ce...


at least you finally got to the right conclusion. If you had started there, maybe you wouldn't have considered it hard to write your spec.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: