Find in this group all groups
Unknown more information…

h : 2 June 2012 • 7:54PM -0400

Re: those funny non-ASCII characters
by Xah Lee


On Jun 1, 8:17 pm, rusi <rustompm...@gmai...> wrote:
> On Jun 2, 2:06 am, Xah Lee <xah...@gmai...> wrote:
> > Xah wrote
> > > > 〈Unicode BOM Byte Order Mark Hack〉
> > > >
> > On Jun 1, 9:26 am, rusi <rustompm...@gmai...> wrote:
> > > See
> > > (pg 36) "Use of a BOM is neither required nor recommended for UTF-8,
> > > but may
> > > be encountered in contexts where UTF-8 data is converted from other
> > > encoding forms..."
> > > More specifically the non-recommendation of bom:
> > > "Note that some recipients of UTF-8 encoded data do not expect a BOM.
> > > Where UTF-8 is used transparently in 8-bit environments, the use of a
> > > BOM will interfere with any protocol or file format that expects
> > > specific ASCII characters at the beginning, such as the use of "#!" of
> > > at the beginning of Unix shell scripts. "
> > didn't i mention these 2 points exactly in the link i gave??
> Yeah your own link says this: (as you know I often use and quote your
> unicode pages :-) )
> - In unix-like OSes, BOM for utf-8 conflicts with the Shebang (Unix)
> hack.
> - Many Window software add BOM to utf-8 files, e.g. Notepad.
> But you also say
> > If your lang spec says unicode, you have to support BOM mark
> So I am not clear whats ur stand...
> Let me make my own position clear:
> The de jure unicode standard is set by the unicode consortium (or
> whatever its called)
> The de facto standard is set by microsoft and java
> The two conflict

BOM mark is part of the unicode standard. If a tech declares full
support for unicode, support for BOM mark is necessary.

BOM mark is a hack, but so is unix shebang mark. BOM mark being a
given, it wouldn't have any problem if utf-8 isn't invented. utf-8 is
invented by unix fanatic Rob Pike largely to help unix world move
forward to unicode. As it is, BOM mark conflict with the spirit of
utf-8 (because utf-8 is meant to be ASCII compatible as is, yet BOM
mark byte sequence isn't in ASCII.)

i read the link Thien-Thin Nguyen posted 〔http://〕. At first i find it very informative, but in
the end i wasn't convinced in its opinion that we should all adopt
utf-8 instead of utf-16. I think if one switch a attitude, that utf-8
is the hack that introduced all this problems, then many of their
argument for utf-8 doesn't stand.

side note... about that site, it's Windows oriented. As such, they
didn't explain many terms and Windows tech they use, e.g. i have
little idea what narrowchar or widechar they mean, nor of the many
Windows libraries they mention.

also, the site is decidedly western-mind oriented. They forgot that in
china, the encoding used is GB 18030, which has the same char set as
unicode but different encoding, and is also compatible with ascii. No
utf-8 nor utf-anything whatsoever. Chinese web traffic are like half
of the world's or something.

the site wishes utf-16 to go away. Windows, Mac, NTFS, HFS+ file
systems, all utf-16, plus java C# etc. Though, the web (html,xml,css)
are all utf-8. Neither are likely to go away. If Java and C# and NTFS
disappeared from the face of this earth, then maybe. lol. :D


Bookmark with:

Delicious   Digg   reddit   Facebook   StumbleUpon

Related Messages

opensubscriber is not affiliated with the authors of this message nor responsible for its content.