http://csharpindepth.com/Articles/General/Unicode.aspx
Scope of this page
This is a big topic. Don‘t expect this page to do more than scratch the surface - indeed, if you believe you‘re already fairly experienced and knowledgeable about character encodings and the like, this page may well not have anything new or useful for you. However, there are still many people who don‘t understand the difference between binary and text, or know what a character encoding is, etc. It is for these people that this page has been written. It mentions a few advanced topics, but only to make the reader aware of their existence, rather than to give much guidance on them.
Resources
可以去原文看链接
Binary and text - a big distinction
Most modern computer languages (and some older ones) make a big distinction between "binary" content and "character" (or "text") content .
The difference is largely the same as the instinctive本能的;直觉的;天生的 one, but for the purposes of clarity清楚,明晰;透明, I‘ll define it here as:
- Binary content is a sequence of octets八位字节 (bytes in common parlance【in common parlance俗话说】) with no intrinsic本质的,固有的 meaning attached. Even though there may be external means of understanding a piece of binary content to be, say, a picture, or an executable file, the content itself is just a sequence of bytes. (Note for pedantic 迂腐的;学究式的 readers: from now on, I won‘t use the word "octet". I‘ll use "byte" instead, even though strictly speaking a byte needn‘t be an octet. There have been architectures with 9-bit bytes, for instance. I don‘t believe that‘s a particularly relevant or useful distinction to make in this day and age, and readers are likely to be more comfortable with the word "byte".)
- Character content is a sequence of characters.
The Unicode Glossary defines a character as:
- The smallest component of written language that has semantic语义的 value; refers to指的是 the abstract meaning and/or shape, rather than a specific shape (see also glyph图形字符), though in code tables some form of visual representation is essential基本的,必要的 for the reader‘s understanding.
- Synonym同义词 for abstract character. (See Definition D3 in Section 3.3, Characters and Coded Representations .http://www.unicode.org/versions/Unicode7.0.0/ch03.pdf#G2212)
- The basic unit of encoding for the Unicode character encoding.
- The English name for the ideographic written elements of Chinese origin. (See ideograph (2).)
That may or may not be a terribly useful definition to you, but for the most part you can again use your instinctive understanding - a character is something like "the capital letter A", "the digit 1" etc. There are other characters which are less obvious, such as: combining characters such as "an acute accent", control characters such as "newline", and formatting characters (invisible, but affect surrounding characters). The important thing is that these are fundamentally "text" in some form or other. They have some meaning attached to them.
Now, unfortunately in the past, this distinction has been very blurred - C programmers are often used to thinking of "byte" and "char" as being interchangeable, to the extent that they will talk about reading a certain number of characters, even when the content is entirely binary. In modern environments such as .NET and Java, where the distinction is clear and present in the IO libraries, this can lead to people attempting to copy binary files by reading and writing characters, resulting in corrupt output.