在SQL Server中,字符类型主要是指单字节字符(char,varchar)和 unicode字符(nchar,nvarchar)。字符类型的排序和比较是由Collation决定的。对于单字节字符,Collation决定字符的Code Page和字符数据的呈现。
Collations that are used with character data types such as char and varchar dictate the code page and corresponding characters that can be represented for that data type.
一,与字符数据关联的三个关属性:Locale,Code Page 和 Sort Order。
1,Locale ,关联区域或语言文化,不同的语言有不同的字符集。
A locale is a set of information that is associated with a location or a culture. This can include the name and identifier of the spoken language, the script that is used to write the language, and cultural conventions. Collations can be associated with one or more locales.
2,Code Page,关联到OS支持的字符数据集
A code page is an ordered set of characters of a given script in which a numeric index, or code point value, is associated with each character. A Windows code page is typically referred to as a character set or charset. Code pages are used to provide support for the character sets and keyboard layouts that are used by different Windows system locales.
Code Page是由OS设置决定的,通过设置Windows OS的区域来设置OS使用的Code Page。
The code pages that a client uses are determined by the operating system settings. To set client code pages on the Windows operating system, use Regional Settings in Control Panel.
3,Sort Order,字符数据的排序
Sort order specifies how data values are sorted. This affects the results of data comparison. Data is sorted by using collations, and it can be optimized by using indexes.
二,Collation
在SQL Server中,Collation 由两部分组成:Sorting Rule(对于non-unicode字符,还包括Code Page) 和 Comparison Style,用于对字符的呈现,排序和比较。对于non-unicode字符,必须指定一个Code Page。在不同Code Page之间进行传递字符数据,必须将Source Code Page转换成 Destination Code Page。对于Unicode 字符,不需要code page,在不同机器之间传递字符数据集,不需要进行code page的转换,这能提高数据处理的性能。
comparison style主要是case sensitivity, accent sensitivity, Kana-sensitivity, width sensitivity。
A collation specifies the bit patterns that represent each character in a data set. Collations also determine the rules that sort and compare data. SQL Server supports storing objects that have different collations in a single database. For non-Unicode columns, the collation setting specifies the code page for the data and which characters can be represented. Data that is moved between non-Unicode columns must be converted from the source code page to the destination code page.
三,Unicode Support
non-unicode使用一个字节表示一个字符,由于一个字节能够表示的字符是有限的,不同的语言或区域,使用Code Page来区分不同的字符数据集。每一个字符集都拥有一个Code Page。对于使用non-unicode的computer,每台机器只能设置一个code page。在Code Page不同的机器上进行字符数据传递时,必须进行Code Page的转换。而Unicode 编码使用两个字节(2Byte)表示一个字符,能够表示世界上所有的字符数据集,因此不再需要Code Page。
Unicode is a standard for mapping code points to characters. Because it is designed to cover all the characters of all the languages of the world, there is no need for different code pages to handle different sets of characters. If you store character data that reflects multiple languages, always use Unicode data types (nchar, nvarchar, and ntext) instead of the non-Unicode data types (char, varchar, and text).
Significant limitations are associated with non-Unicode data types. This is because a non-Unicode computer will be limited to use of a single code page. You might experience performance gain by using Unicode because fewer code-page conversions are required. Unicode collations must be selected individually at the database, column or expression level because they are not supported at the server level.
The code pages that a client uses are determined by the operating system settings. To set client code pages on the Windows operating system, use Regional Settings in Control Panel.
四,查看SQL Server支持的Collation 及其关联的Code Page
select * ,collationproperty(name, ‘codepage‘) as CodePage from sys.fn_helpcollations()
常见的三个Code Page:
936:简体中文
1252:Latin1,兼容ASCII,单字节编码
65001:UTF-8 Unicode
计算机内只能保存101010等二进制数据,那么页面上显示的字符是如何显示出来的呢?
一:字符集(Charset)
charset = char + set,char 是字符,set是集合,charset就是字符的集合。
字符集就是是这个编码方式涵盖了哪些字符,每个字符都有一个数字序号。
二:编码方式(Encoding)
编码方式就是一个字符要怎样编码成二进制字节序,或者反过来怎么解析。 也即给你一个数字序号,要编码成几个字节,字节顺序如何,或者其他特殊规则。
三:字形字体(Font)
根据数字序号调用字体存储的字形,就可以在页面上显示出来了。 所以一个字符要显示出来,要显示成什么样子要看字体文件。
综上所述,Unicode 只是字符集,而没有编码方式。 UTF-8 是一种 Unicode 字符集的编码方式,其他还有 UTF-16,UTF-32 等。
而有了字符集以及编码方式,如果系统字体是没有这个字符,也是显示不出来的。
参考doc:
COLLATIONPROPERTY (Transact-SQL)
sys.fn_helpcollations (Transact-SQL)