Sorting is language specific even if you're restricted to languages using Latin characters. Eg. How do you sort N relative to Ñ? How do you treat the Turkish variations on the letter I?
Doing a dumb sort by character or byte values is obviously the wrong call for any diacritics, but the right call may also depend on the language.
And that's why there are a hundred different possible values for LC_COLLATE, and it's completely normal that two popular Unix distributions picked different default values for that setting...right?
It would have been reasonable to conclude the article a third of the way through, and say "sorting is locale-dependent, if what you value is consistent behaviour between different OSs (instead of sorting based on the user's preferences) you need to implement the sorting yourself."
Before the Danish language adopted the letter "å" (in 1948), the vowel was written as "aa". In the Danish alphabet, "å" is the last letter. Therefore a list of three Danish city names would be correctly sorted as:
* Albertslund
* Odense
* Aarhus
This feels like material for another Tom Scott video.
Minor note: on Debian (and possibly other distros), you don't have to use `locale-gen` to dynamically build things into `$complocaledir/locale-archive` (which, incidentally, can cause random breakage for programs that happen to start during system upgrades).
The `locales-all` package works more like macOS. It's only a ~10MB download but unpacks to take ~250MB of disk space (these numbers will vary based on your libc version and packaging format).
There are a lot of sparse arrays and UTF32 character data in compiled locales.
Incidentally, the command to dump a locale's data is:
LC_ALL=whatever locale -ck `locale | sed 's/=.*//; /LANG\|LC_ALL/d'`
Sorting is language specific even if you're restricted to languages using Latin characters. Eg. How do you sort N relative to Ñ? How do you treat the Turkish variations on the letter I?
Doing a dumb sort by character or byte values is obviously the wrong call for any diacritics, but the right call may also depend on the language.
And that's why there are a hundred different possible values for LC_COLLATE, and it's completely normal that two popular Unix distributions picked different default values for that setting...right?
It would have been reasonable to conclude the article a third of the way through, and say "sorting is locale-dependent, if what you value is consistent behaviour between different OSs (instead of sorting based on the user's preferences) you need to implement the sorting yourself."
Before the Danish language adopted the letter "å" (in 1948), the vowel was written as "aa". In the Danish alphabet, "å" is the last letter. Therefore a list of three Danish city names would be correctly sorted as:
This feels like material for another Tom Scott video.Minor note: on Debian (and possibly other distros), you don't have to use `locale-gen` to dynamically build things into `$complocaledir/locale-archive` (which, incidentally, can cause random breakage for programs that happen to start during system upgrades).
The `locales-all` package works more like macOS. It's only a ~10MB download but unpacks to take ~250MB of disk space (these numbers will vary based on your libc version and packaging format).
There are a lot of sparse arrays and UTF32 character data in compiled locales.
Incidentally, the command to dump a locale's data is:
Updated link to the file as https://opensource.apple.com/source/adv_cmds/adv_cmds-118/us... doesn't work anymore: https://github.com/apple-oss-distributions/adv_cmds/blob/adv...
So the ISO way is the right way, right?
I wondered the same. What's the right ordering?
It's not a stable sort?
(2020)
Yet another one of those POSIX and ISO things that most people don't bother to know about.
https://pubs.opengroup.org/onlinepubs/9799919799/basedefs/V1...