Unicode, UTF, ASCII, ANSI format differences

Going down your list:

  • "Unicode" isn‘t an encoding, although unfortunately, a lot of documentation imprecisely uses it to refer to whichever Unicode encoding that particular system uses by default. On Windows and Java, this often means UTF-16; in many other places, it means UTF-8. Properly, Unicode refers to the abstract character set itself, not to any particular encoding.
  • UTF-16: 2 bytes per "code unit". This is the native format of strings in .NET, and generally in Windows and Java. Values outside the Basic Multilingual Plane (BMP) are encoded as surrogate pairs. (These are relatively rarely used - which is a good job, as very few developers get them right, I suspect. I very much doubt that I do.)
  • UTF-8: Variable length encoding, 1-4 bytes per code point. ASCII values are encoded as ASCII using 1 byte.
  • UTF-7: Usually used for mail encoding. Chances are if you think you need it and you‘re not doing mail, you‘re wrong. (That‘s just my experience of people posting in newsgroups etc - outside mail, it‘s really not widely used at all.)
  • UTF-32: Fixed width encoding using 4 bytes per code point. This isn‘t very efficient, but makes life easier outside the BMP. I have a .NET Utf32String class as part of my MiscUtil library, should you ever want it. (It‘s not been very thoroughly tested, mind you.)
  • ASCII: Single byte encoding only using the bottom 7 bits. (Unicode code points 0-127.) No accents etc.
  • ANSI: There‘s no one fixed ANSI encoding - there are lots of them. Usually when people say "ANSI" they mean "the default locale/codepage for my system" which is obtained viaEncoding.Default, and is often Windows-1252 but can be other locales.

There‘s more on my Unicode page and tips for debugging Unicode problems.

The other big resource of code is unicode.org which contains more information than you‘ll ever be able to work your way through - possibly the most useful bit is the code charts.

时间: 2024-10-10 05:28:48

Unicode, UTF, ASCII, ANSI format differences的相关文章

Unicode 和ASCII码

在Unicode中:汉字"字"对应的数字是23383(十进制),十六进制表示为5B57.在Unicode中,我们有很多方式将数字23383表示成程序中的数据,包括:UTF-8.UTF-16.UTF-32.UTF是"UCS Transformation Format"的缩写,可以翻译成Unicode字符集转换格式,即怎样将Unicode定义的数字转换成程序数据.例如,"汉字"对应的数字是0x6c49和0x5b57,而编码的程序数据是: 1 2 3

UNICODE与ASCII

1.ASCII的特点 ASCII 是用来表示英文字符的一种编码规范.每个ASCII字符占用1 个字节,因此,ASCII 编码可以表示的最大字符数是255(00H—FFH).这对于英文而言,是没有问题的,一般只什么用到前128个(00H--7FH,最高位为0).而最高位为1 的另128 个字符(80H—FFH)被称为“扩展ASCII”,一般用来存放英文的制表符.部分音标字符等等的一些其它符号. 但是对于中文等比较复杂的语言,255个字符显然不够用.于是,各个国家纷纷制定了自己的文字编码规范,其中中

unicode string和ansi string的转换函数及获取程序运行路径的代码

#pragma once#include <string> namespace stds { class tool { public: std::string ws2s(const std::wstring& ws) { std::string curLocale = setlocale(LC_ALL, NULL); // curLocale = "C"; setlocale(LC_ALL, "chs"); const wchar_t* _Sou

创建文件夹并解决解决unicode和ASCII码转换的问题

# -*- coding: UTF-8 -*-import sysimport timeimport os #解决unicode和ASCII码转换的问题reload(sys) #解决unicode和ASCII码转换的问题sys.setdefaultencoding('utf8') #解决unicode和ASCII码转换的问题 context = '''hello world'''f = open("hello.txt", 'a+')f.write(context)f.close()da

unicode与ascii的那些事

ASCII码的由来 计算机发明后,为了在计算机中表示字符,人们制定了一种编码,叫ASCII码.ASCII码由一个字节中的7位(bit)表示,范围是0x00 - 0x7F 共128个字符.    后来他们突然发现,如果需要按照表格方式打印这些字符的时候,缺少了“制表符”.于是又扩展了ASCII的定义,使用一个字节的全部8位(bit)来表示字符了,这就叫扩展ASCII码.范围是0x00 - 0xFF 共256个字符. Unicode详细介绍 1.容易产生后歧义的两字节 unicode的第一个版本是用

Unicode和Ascii的区别

①ASCII就是编码英文的26个字母和一些常见的符号,之后扩展了一半.总之是一个字节来做编码,大于128的部分是一些特殊符号.但ASCII是无法编码别的东西的,比如说是不存在"中文的ascii码需要2个字符"这种说法的.ASCII就只有一个字节. ②Unicode是足够编码地球上所有的语言了,所以ASCII中所能表示的,Unicode当然全部包括了.Unicode本身是只有2个字节的,之所以出现UTF-8,UTF-16等等之类,那是为了针对不同的应用环境,提高整体编码效率,比如如果某篇

自己写unicode转换ascii码,wchar*到char*

对于ascii码的char其实就是unicode码wchar的首个字节码, 如wchar[20] = "qqqq"; 在内存中排码其实是char的'q' '\0'这类,因此我们如果自己写unicode码转换为ascii的char,只需要取其首字节即可,如下本人写了一个wchar到char的转换的函数.由于代码简单,加上了内存泄露测试方式. #include <stdio.h> #ifdef _DEBUG #define DEBUG_CLIENTBLOCK new( _CLI

各个系统和语言对Unicode的支持 字符集和编码——Unicode(UTF&amp;UCS)深度历险

http://www.cnblogs.com/Johness/p/3322445.html 各个系统和语言对Unicode的支持: Windows NT从底层支持Unicode(不幸的是,Windows 98只是小部分支援Unicode).先天即被ANSI束缚的C程序设计语言通过对宽字元集的支持来支持Unicode. Windows底层使用UTF16,Linux使用UTF32(未考证). C#和Java支持UTF16且是默认行为(如字符串天生为UTF16格式字符数组,Java还可以使用'\uxx

unicode utf-8 ascii 区别与联系

unicode是一种标准,utf-8是这种标准的一种编码方式,ascii也是一种编码方式, 一个汉字在unicode标准中占两个byte 中文汉字的unicode范围:4E00~9FA5 一个汉字在utf-8编码中占三个byte 中文汉字的utf-8编码范围:E4B880~E9BEA0 计算机中都是以字符流(byte)进行传输的,因此判定段字符流中的某一个是否是汉字,只需首先确定其是utf-8编码,然后判定其范围在E4B880~E9BEA0中即可