Unicode and .NET

http://csharpindepth.com/Articles/General/Unicode.aspx

Scope of this page

This is a big topic. Don‘t expect this page to do more than scratch the surface - indeed, if you believe you‘re already fairly experienced and knowledgeable about character encodings and the like, this page may well not have anything new or useful for you. However, there are still many people who don‘t understand the difference between binary and text, or know what a character encoding is, etc. It is for these people that this page has been written. It mentions a few advanced topics, but only to make the reader aware of their existence, rather than to give much guidance on them.

Resources

可以去原文看链接

Binary and text - a big distinction

Most modern computer languages (and some older ones) make a big distinction between "binary" content and "character" (or "text") content .

The difference is largely the same as the instinctive本能的;直觉的;天生的 one, but for the purposes of clarity清楚,明晰;透明, I‘ll define it here as:

  • Binary content is a sequence of octets八位字节 (bytes in common parlance【in common parlance俗话说】) with no intrinsic本质的,固有的 meaning attached. Even though there may be external means of understanding a piece of binary content to be, say, a picture, or an executable file, the content itself is just a sequence of bytes. (Note for pedantic 迂腐的;学究式的 readers: from now on, I won‘t use the word "octet". I‘ll use "byte" instead, even though strictly speaking a byte needn‘t be an octet. There have been architectures with 9-bit bytes, for instance. I don‘t believe that‘s a particularly relevant or useful distinction to make in this day and age, and readers are likely to be more comfortable with the word "byte".)
  • Character content is a sequence of characters.

The Unicode Glossary defines a character as:

  1. The smallest component of written language that has semantic语义的 value; refers to指的是 the abstract meaning and/or shape, rather than a specific shape (see also glyph图形字符), though in code tables some form of visual representation is essential基本的,必要的 for the reader‘s understanding.
  2. Synonym同义词 for abstract character. (See Definition D3 in Section 3.3, Characters and Coded Representations .http://www.unicode.org/versions/Unicode7.0.0/ch03.pdf#G2212)
  3. The basic unit of encoding for the Unicode character encoding.
  4. The English name for the ideographic written elements of Chinese origin. (See ideograph (2).)

That may or may not be a terribly useful definition to you, but for the most part you can again use your instinctive understanding - a character is something like "the capital letter A", "the digit 1" etc. There are other characters which are less obvious, such as: combining characters such as "an acute accent", control characters such as "newline", and formatting characters (invisible, but affect surrounding characters). The important thing is that these are fundamentally "text" in some form or other. They have some meaning attached to them.

Now, unfortunately in the past, this distinction has been very blurred - C programmers are often used to thinking of "byte" and "char" as being interchangeable, to the extent that they will talk about reading a certain number of characters, even when the content is entirely binary. In modern environments such as .NET and Java, where the distinction is clear and present in the IO libraries, this can lead to people attempting to copy binary files by reading and writing characters, resulting in corrupt output.

时间: 2024-12-28 17:54:28

Unicode and .NET的相关文章

js_字符转Unicode

在开发中总会遇到关于Unicode的转码和解码,每次都找工具转/解码很麻烦 ,今天在网上get到一个简单的转/解Unicode的函数. var UnicodeFun = { toUnicode: function(str) { if(str == '') return 'value is null'; return escape(str).toLocaleLowerCase().replace(/%u/gi, '\\u');; }, UnicodeDecode: function(str){ i

【推介】TMS的控件之“TMS Unicode Component Pack”和“TMS Advanced Toolbars & Menus”

TMS Unicode Component Pack是一款支持多语言的界面控件包,提供超过60种控件来为你的Delphi和C++Builder应用程序添加Unicode支持. 介绍: TMS Unicode Component Pack控制组件能让你在不终止Delphi.C++Builder或Windows 95/98/ME的情况下利用Windows NT/2000/XP/2003/Vista的Unicode功能开发应用程序.  注意:这些控制组件不会将Unicode功能添加到Windows 9

PHP实现Unicode和Utf-8互相转换

  一. 编码原理及实现 unicode编码是实现utf-8与gb系列编码(gb2312.gbk.gb18030)转换的基础,虽然我们也可以直接做一个utf-8到这些编码 的对照表,但很少有人会这么做,因为utf-8的可变编码具有不确定性,因此一般情况使用都是unicode与gb编码的对照 表,unicode(UCS-2)实际上是utf-8的基础编码,utf-8只是它的一种实现而已,两者存在下面的对应关系: Unicode符号范围           | UTF-8编码方式 u0000 0000

解决 iOS NSDictionary 输出中文字符”乱码”(Unicode编码)问题

简单定义一个字典,输出结果: NSDictionary *dic = @{ @"我是中文字符": @"223333", @"aaa": @{ @"aaa": @"啦啦啦" } }; NSLog(@"%@", dic); 将会看到这样的"乱码",这种现象经常在调试服务端返回 JSON 结果的时候遇到: 2015-02-25 19:23:40.346 XXXX[13273

VC中使用CFile正确的追加写中文数据到文件不出现乱码-unicode字符集

CFile saveFile; CString file_name = getFileName(); BOOL isOpenOK = saveFile.Open(file_name, CFile::modeCreate | CFile::modeWrite | CFile::modeNoTruncate, NULL); if (false == isOpenOK) { MessageBox(L"文件打开失败!"); return; } //是文件保存为unicode格式 //为了uni

简单聊下Unicode和UTF-8

今晚听同事分享提到这个,简单总结下. ## Unicode字符集 Unicode的出现是因为ASCII等其他编码码不够用了,比如ASCII是英语为母语的人发明的,只要一个字节8位就能够表示26个英文字母了,但是当跨区域进行信息交流的时候,尤其是Internet的出现,除了“A”,“B”,“C",还有“你”,“我”,“他”需要表示,一个字节8位显然不够用,够因此Unicode就被发明出来,Unicode的最大码位0x10FFFF,有21位.中文对应的Unicode编码见http://www.chi

使用 Unicode 编码

面向公共语言运行库的应用程序使用编码将字符表示形式从本机字符方案(Unicode)映射为其他方案.应用程序使用解码将字符从非本机方案(非 Unicode)映射为本机方案.System.Text 命名空间提供了使您能够对字符进行编码和解码的类.System.Text 编码支持包括以下编码: Unicode UTF-32 编码 Unicode UTF-32 编码将 Unicode 字符表示为 32 位整数序列.您可以使用 UTF32Encoding 类在字符和 UTF-32 编码之间相互转换. Un

C++ Unicode SBCS 函数对照表

来源:http://www.cnblogs.com/PiaoDbg/archive/2012/03/04/2379336.html Generic SBCS UNICODE TCHAR char wchar_t _TEOF EOF WEOF _TINT int wint_t _TSCHAR signed char wchar_t _TUCHAR unsigned char wchar_t _TXCHAR char wchar_t _T(x) x L __targv __argv __wargv

字符编码笔记:ASCII,Unicode和UTF-8【转载】

最近买了部安卓的手机,google nexus5 系统是安卓4.4.2. 刚到手就发现链接wifi有问题,一直在获取ip(obtaining ip...)和验证.试过恢复出厂 重启 各种都不管用,只有设置静态ip才可以,但是不能一直这样子呀!! 查了下路由器,路由器已经分配了地址.所以最大可能就是安卓手机上拿到这个地址没有成功写入配置文件,为什么没有写入呢,就是权限的问题了,不明白为什么google会出现这个错误. 因为不熟悉安卓系统,所以查了好几天,终于在一个外国网站上发现了下面这个解决办法,

PyQt QString 与 Python str&unicode

昨日,将许久以前做的模拟网页登录脚本用PyQt封装了一下,结果出大问题了, 登录无数次都提示登录失败!!而不用PyQt实现的GUI登录直接脚本登录无数次都提示登录成功!!心中甚是伤痛,于是探究起来,解决这一问题. 问题描述及证据如下: 上图是脚本MD5加密过程及结果 上图是PyQt GUI中获取密码框内容后加密的结果,其实现代码如下: # -*- coding: gbk -*- ''' Version : Python27 Author : Spring God Date : 2013-6-28