中央情报局关键词提取——Unicode码 / 憋错料

Dataset

本文的任务是学习计算机在内存中如何存储一个值。本文的数据集sentences_cia.csv是中央情报局备忘录的一个摘录，描述了酷刑和其他秘密活动的细节。数据格式如下：

year,statement,,,

1997,”The FBI information included that al-Mairi’s brother “”traveled to Afghanistan in 1997-1998 to train in Bin - Ladencamps.”“”,,,

整个csv文件就是一个长的字符串，我们之前讨论了字符串的应用，但一直不知道它们是如何被存储在一台电脑中。文件是存储在硬盘上的，硬盘通常是磁存储，将数据存储到磁条中。磁条只能存储两个值：up和down也就是高电压低电压，代表了计算机的0,1值。因此在存储我们的字符串前，我们需要将其转换为二进制，然后才能存储到磁条中。

Intro To Binary

下面这段代码，首先b是一个二进制数据的字符串形式，然后将其转换为整数型二进制数再打印，同理打印了一个100的二进制数。

# Let‘s say a is a binary number.  In python, we have to store binary numbers as strings
# Trying to say b = 10 directly will assume base 10, so strings are needed
b = "10"

# We can convert b to a binary number from a string using the int function -- the optional second argument base is set to 2 (binary is base two)
print(int(b, 2))
‘‘‘
2
‘‘‘
base_10_100 = int("100", 2)
‘‘‘
base_10_100 :4
‘‘‘

此处打印的结果人就是显示的十进制

Binary Addition

二进制相加和十进制类似

b = "1"

# We‘ll add binary values using a binary_add function that was made just for this exercise
# It‘s not extremely important to know how it works right this second
def binary_add(a, b):
    return bin(int(a, 2) + int(b, 2))[2:]

c = binary_add(b, "1")

# We now see that c equals "10", which is exactly what happens in base 10 when we reach the highest possible digit.
print(c)
‘‘‘
10
‘‘‘

bin这个函数导致打印出来的形式是二进制,且由于前面会添加0b，因此打印的时候从第3个位置开始打印。

Converting Binary Values

现在知道了利用int（）函数可以将数据转换为其它进制。

def binary_add(a, b):
    return bin(int(a, 2) + int(b, 2))[2:]

# Start both at 0
a = 0
b = "0"

# Loop 10 times
for i in range(0, 10):
    # Add 1 to each
    a += 1
    b = binary_add(b, "1")

    # Check if they are equal
    print(int(b, 2) == a)
‘‘‘
True
True
True
True
True
True
True
True
True
True
‘‘‘

上面这个例子表达了，无论是什么进制的数据，相加的量相同，他们不是一个进制但是任然数值是相等的。

Characters To Binary

字符串首先拆分成单个字符，然后存储为整型数据然后转换为二进制存储起来。

# We can use the ord() function to get the integer associated with an ascii character.
ord(‘a‘)

# Then we use the bin() function to convert to binary
# The bin function adds "0b" to the start of strings to indicate that they contain binary values
bin(ord(‘a‘))
print(bin(ord(‘a‘)))
‘‘‘
0b1100001
‘‘‘

# ? is the "last" ascii character -- it has the highest integer value of any ascii character
# This is because 255 is the highest value that can be represented with 8 binary digits
ord(‘?‘)
# As you can see, we get 8 1‘s, which shows that this is the highest possible 8 digit value
bin(ord(‘?‘))
print(bin(ord(‘?‘)))
‘‘‘
0b11111111
‘‘‘

# Why is this?  It‘s because a single binary digit is called a bit, and computers store values in sequences of bytes, which are 8 bits together.
# You might be more familiar with kilobytes or megabytes -- a kilobyte is 1000 bytes, and a megabyte is 1000 kilobytes.
# There are 256 different ascii symbols, because the largest amount of storage any single ascii character can take up is one byte.
binary_w = bin(ord("w"))
‘‘‘
str (<class ‘str‘>)
‘0b1110111‘
‘‘‘

上面这段代码显示：字符类型数据总共是8位，字符数据需要转换为对应的ascii码，然后再转化为二进制数据存储起来，8位的二进制最多能存储255个字符。通过ord()函数可以获取一个字符的ascii码，而通过bin函数将数据转化为二进制，并且在数据的前面添加了0b表明这是个二进制字符串。

Intro To Unicode

ASCII码

我们知道，在计算机内部，所有的信息最终都表示为一个二进制的字符串。每一个二进制位（bit）有0和1两种状态，因此八个二进制位就可以组合出256种状态，这被称为一个字节（byte）。也就是说，一个字节一共可以用来表示256种不同的状态，每一个状态对应一个符号，就是256个符号，从0000000到11111111。上个世纪60年代，美国制定了一套字符编码，对英语字符与二进制位之间的关系，做了统一规定。这被称为ASCII码，一直沿用至今。 ASCII码一共规定了128个字符的编码，比如空格”SPACE”是32（二进制00100000），大写的字母A是65（二进制01000001）。这128个符号（包括32个不能打印出来的控制符号），只占用了一个字节的后面7位，最前面的1位统一规定为0。

Unicode

正如上一节所说，世界上存在着多种编码方式，同一个二进制数字可以被解释成不同的符号。因此，要想打开一个文本文件，就必须知道它的编码方式，否则用错误的编码方式解读，就会出现乱码。为什么电子邮件常常出现乱码？就是因为发信人和收信人使用的编码方式不一样。可以想象，如果有一种编码，将世界上所有的符号都纳入其中。每一个符号都给予一个独一无二的编码，那么乱码问题就会消失。这就是Unicode，就像它的名字都表示的，这是一种所有符号的编码。 Unicode当然是一个很大的集合，现在的规模可以容纳100多万个符号。每个符号的编码都不一样，比如，U+0639表示阿拉伯字母Ain，U+0041表示英语的大写字母A，U+4E25表示汉字”严”。具体的符号对应表，可以查询unicode.org，或者专门的汉字对应表。

Unicode的问题

需要注意的是，Unicode只是一个符号集，它只规定了符号的二进制代码，却没有规定这个二进制代码应该如何存储。这里就有两个严重的问题，第一个问题是，如何才能区别Unicode和ASCII？计算机怎么知道三个字节表示一个符号，而不是分别表示三个符号呢？第二个问题是，我们已经知道，英文字母只用一个字节表示就够了，如果Unicode统一规定，每个符号用三个或四个字节表示，那么每个英文字母前都必然有二到三个字节是0，这对于存储来说是极大的浪费，文本文件的大小会因此大出二三倍，这是无法接受的。

UTF-8

UTF-8是Unicode的实现方式之一。UTF-8最大的一个特点，就是它是一种变长的编码方式。它可以使用1~4个字节表示一个符号，根据不同的符号而变化字节长度。UTF-8的编码规则很简单，只有二条：

对于单字节的符号，字节的第一位设为0，后面7位为这个符号的unicode码。因此对于英语字母，UTF-8编码和ASCII码是相同的。

对于n字节的符号（n>1），第一个字节的前n位都设为1，第n+1位设为0，后面字节的前两位一律设为10。剩下的没有提及的二进制位，全部为这个符号的unicode码。

跟据上表，解读UTF-8编码非常简单。如果一个字节的第一位是0，则这个字节单独就是一个字符；如果第一位是1，则连续有多少个1，就表示当前字符占用多少个字节。

已知”严”的unicode是4E25（100111000100101），根据上表，可以发现4E25处在第三行的范围内（0000 0800-0000 FFFF），因此”严”的UTF-8编码需要三个字节，即格式是”1110xxxx 10xxxxxx 10xxxxxx”。然后，从”严”的最后一个二进制位开始，依次从后向前填入格式中的x，多出的位补0。这样就得到了，”严”的UTF-8编码是”11100100 10111000 10100101”，转换成十六进制就是E4B8A5。可以看到”严”的Unicode码是4E25，UTF-8编码是E4B8A5，两者是不一样的。

# We can initialize unicode code points (the value for this code point is \u27F6, but you see it as a character because it is being automatically converted)
code_point = "→"

# This particular code point maps to a right arrow character
print(code_point)

# We can get the base 10 integer value of the code point with the ord function
print(ord(code_point))

# As you can see, this takes up a lot more than 1 byte
print(bin(ord(code_point)))
‘‘‘
→
10230
0b10011111110110
‘‘‘

Strings With Unicode

由于ascii 是Unicode的子集，因此在python3中，默认所有的字符串都是用Unicode，并且用utf-8编码。所以我们可以直接使用Unicode的码点和字符。

s1 = "café"
# The \u prefix means "the next 4 digits are a unicode code point"
# It doesn‘t change the value at all (the last character in the string below is \u00e9)
s2 = "café"

# These strings are the same, because code points are equal to their corresponding unicode character.
# \u00e9 and é are equivalent.
print(s1 == s2)
‘‘‘
True
‘‘‘

The Bytes Type

encode(“utf-8”)可以对字符串进行编码将其转换为bytes型数据。

# We can make a string with some unicode values
superman = "Clark Kent□"
# This tells python to encode the string superman into unicode using the utf-8 encoding
# We end up with a sequence of bytes instead of a string
superman_bytes = "Clark Kent?".encode("utf-8")
print(superman_bytes)
‘‘‘
b‘Clark Kent\xe2\x90\xa6‘
‘‘‘

batman = "Bruce Wayne□"
batman_bytes = batman.encode("utf-8")
print(batman_bytes)
‘‘‘
bytes (<class ‘bytes‘>)
b‘Bruce Wayne\xe2\x90\xa6‘
‘‘‘

Hexadecimal Intro

\u是unicode码的前缀，说明这代表一个unicode码。\x是十六进制的前缀，代表后面两个数字是16进制的。两个十六进制数等于8个二进制数。

# F is the highest single digit in hexadecimal (base 16)
# Its value is 15 in base 10
print(int("F", 16))

# A in base 16 has the value 10 in base 10
print(int("A", 16))

# Just like the earlier binary_add function, this adds two hex numbers
def hexadecimal_add(a, b):
    return hex(int(a, 16) + int(b, 16))[2:]

# When we add 1 to 9 in hexadecimal, it becomes "a"
value = "9"
value = hexadecimal_add(value, "1")
print(value)
hex_ea = hexadecimal_add("2", "ea")
‘‘‘
hex_ea :str (<class ‘str‘>)
‘ec‘
‘‘‘
hex_ef = hexadecimal_add("e", "f")
‘‘‘
hex_ef :str (<class ‘str‘>)
‘1d‘
‘‘‘
‘‘‘
15
10
a
‘‘‘

Hex To Binary

# One byte (8 bits) in hexadecimal (the value of the byte below is \xe2)
hex_byte = "a"

# Print the base 10 integer value for the hex byte
print(ord(hex_byte))

# This gives the exact same value -- remember than \x is just a prefix, and doesn‘t affect the value
print(int("e2", 16))

# Convert the base 10 integer to binary
print(bin(ord("a")))
binary_aa = bin(ord("a"))
‘‘‘
str (<class ‘str‘>)
‘0b10101010‘
‘‘‘
binary_ab = bin(ord("\xab"))
‘‘‘
str (<class ‘str‘>)
‘0b10101011‘
‘‘‘
‘‘‘
226
226
0b11100010
‘‘‘

Bytes And Strings

bytes和strings对象不可以混在一起，用encode(“utf-8”)会将一个strings对象转换为一个bytes对象，然后不能在其中插入strings对象，但是可以插入形如b”“这样的bytes对象

hulk_bytes = "Bruce Banner?".encode("utf-8")
print(type(hulk_bytes))
# We can‘t mix strings and bytes
# For instance, if we try to replace the unicode □ character as a string, it won‘t work, because that value has been encoded to bytes
try:
    hulk_bytes.replace("Banner", "")
except Exception:
    print("TypeError with replacement")

# We can create objects of the bytes datatype by putting a b in front of the quotation marks in a string
hulk_bytes = b"Bruce Banner"
# Now, instead of mixing strings and bytes, we can use the replace method with bytes objects instead
hulk_bytes.replace(b"Banner", b"")
thor_bytes = b"Thor"
‘‘‘
<class ‘bytes‘>
TypeError with replacement
‘‘‘

Decode Bytes To Strings

decode(“utf-8”)可以将bytes对象解码为strings对象。

# Make a bytes object with aquaman‘s secret identity
aquaman_bytes = b"Who knows?"

# Now, we can use the decode method, along with the encoding (utf-8) to turn it into a string.
aquaman = aquaman_bytes.decode("utf-8")

# We can print the value and type out to verify that it is a string.
print(aquaman)
print(type(aquaman))
‘‘‘
Who knows?
<class ‘str‘>
‘‘‘

Read In File Data

到目前为止，对unicode有一定的了解了，现在开始处理数据了。
sentences_cia.csv”文件的第一行是列标签:[‘year’, ‘statement’, ”, ”, ”]
第二行：[‘1997’, ‘The FBI information included that al-Mairi\’s brother “traveled to Afghanistan in 1997-1998 to train in Bin - Ladencamps.”’, ”, ”, ”]

# We can read our data in using csvreader
import csv
# When we open a file, we can specify the encoding that it‘s in.  In this case, utf-8
f = open("sentences_cia.csv", ‘r‘, encoding="utf-8")
csvreader = csv.reader(f)
sentences_cia = list(csvreader)

# The data is two columns
# First column is year, second is a sentence from a CIA report written that year
# Print the first column of the second row
print(sentences_cia[1][0])

# Print the second column of the second row
print(sentences_cia[1][1])
‘‘‘
1997
The FBI information included that al-Mairi‘s brother "traveled to Afghanistan in 1997-1998 to train in Bin - Ladencamps."
‘‘‘

Convert To A Dataframe

将sentences_cia 转换为DataFrame对象，并且将legislators也同样处理。

import csv
# Let‘s read in the legislators data from a few missions ago
f = open("legislators.csv", ‘r‘, encoding="utf-8")
csvreader = csv.reader(f)
legislators = list(csvreader)

# Now, we can import pandas and use the DataFrame class to convert the list of lists to a dataframe
import pandas as pd

legislators_df = pd.DataFrame(legislators)

# As you can see, the first row is the headers, which we don‘t want (it‘s not actually data, it‘s just headers)
print(legislators_df.iloc[0,:])

# In order to remove the headers, we‘ll subset the df and pass them in separately
# This code removes the headers from legislators, and instead passes them into the columns argument
# The columns argument specifies column names
legislators_df = pd.DataFrame(legislators[1:], columns=legislators[0])
# We now have the right data in the first row, and the proper headers
print(legislators_df.iloc[0,:])

# The sentences_cia data from last screen is available.
sentences_cia_df = pd.DataFrame(sentences_cia[1:], columns=sentences_cia[0])
‘‘‘
0     last_name
1    first_name
2      birthday
3        gender
4          type
5         state
6         party
Name: 0, dtype: object
last_name                 Bassett
first_name                Richard
birthday               1745-04-02
gender                          M
type                          sen
state                          DE
party         Anti-Administration
Name: 0, dtype: object
‘‘‘

Clean Up Sentences

“statement”列是一个陈述句，必须对其进行处理然后才能进行分析。首先需要将strings中无关的符号剔除，我们只关心单词，数字和空格。所以我们先利用ord()查阅以上每个字符的整型码。good_characters列出了那些有用的码，保留这些字符，然后通过空格将其连接起来。

def clean_statement(row):
    # The integer codes for all the characters we want to keep
    good_characters = [48, 49, 50, 51, 52, 53, 54, 55, 56, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 32]
    statement = row["statement"]
    clean_statement_list = [s for s in statement if ord(s) in good_characters]
    # Join the list together, separated by "" (no space), which creates a string again
    return "".join(clean_statement_list)

sentences_cia["cleaned_statement"] = sentences_cia.apply(clean_statement, axis=1)

Tokenize Statements

剔除掉无关单词后，需要将拆分单词，然后计算每个单词的频率，所以首先将所有的statement通过join函数连接在一起，在通过空格将其拆分为一个字符数组：

# We can use the .join() method on strings to join lists together.
# The string we use the method on will be used as the separator -- the character(s) between each string when they are joined.
combined_statements = " ".join(sentences_cia["cleaned_statement"])
statement_tokens = combined_statements.split(" ")
‘‘‘
list (<class ‘list‘>)
[‘The‘,
 ‘FBI‘,
 ‘information‘,
 ‘included‘,
 ‘that‘,
 ‘alMairis‘,
 ‘brother‘,
 ‘traveled‘,
 ...
‘‘‘

Filter The Tokens

现在得到的是一个词袋形式，需要找到里面有用的单词的个数，但是在英语中最常见的单词是一些连接词，比如that or and等等，这些单词被称为停顿词，因此需要将这些单词过滤掉。但是为了简单起见，我们这里只是简单的将单词长度小于5的过滤掉：

# statement_tokens has been loaded in.
filtered_tokens = [s for s in statement_tokens if len(s) > 4]

Count The Tokens

现在引入一个新的包collections，这个包里面有一个Counter函数，返回一个字典格式，键值为每个元素的频数。

from collections import Counter

# filtered_tokens has been loaded in
filtered_token_counts = Counter(filtered_tokens)
‘‘‘
Counter({‘interrogation‘: 391, ‘REDACTED‘: 375, ‘information‘: 375, ‘Zubaydah‘: 328, ‘Committee‘: 327, ...
‘‘‘

然后计算其中最常见的3个单词：

common_tokens = filtered_token_counts.most_common(3)
‘‘‘
[(‘interrogation‘, 391), (‘REDACTED‘, 375), (‘information‘, 375)]
‘‘‘

Finding The Most Common Tokens By Year

# sentences_cia has been loaded in.
# It already has the cleaned_statement column.
from collections import Counter
def find_most_common_by_year(year, sentences_cia):
    data = sentences_cia[sentences_cia["year"] == year]
    combined_statement = " ".join(data["cleaned_statement"])
    statement_split = combined_statement.split(" ")
    counter = Counter([s for s in statement_split if len(s) > 4])
    return counter.most_common(2)

common_2000 = find_most_common_by_year("2000", sentences_cia)
‘‘‘
[(‘terrorist‘, 9), (‘Ahmad‘, 9)]
‘‘‘
common_2002 = find_most_common_by_year("2002", sentences_cia)
‘‘‘
[(‘interrogation‘, 275), (‘Zubaydah‘, 252)]
‘‘‘
common_2013 = find_most_common_by_year("2013", sentences_cia)
‘‘‘
[(‘Response‘, 196), (‘states‘, 111)]
‘‘‘

时间： 2024-10-12 07:44:01

中央情报局关键词提取——Unicode码

Dataset

Intro To Binary

Binary Addition

Converting Binary Values

Characters To Binary

Intro To Unicode

Strings With Unicode

The Bytes Type

Hexadecimal Intro

Hex To Binary

Bytes And Strings

Decode Bytes To Strings

Read In File Data

Convert To A Dataframe

Clean Up Sentences

Tokenize Statements

Filter The Tokens

Count The Tokens

Finding The Most Common Tokens By Year

中央情报局关键词提取——Unicode码的相关文章

Python调用百度接口（情感倾向分析）和讯飞接口（语音识别、关键词提取）处理音频文件

Python中Unicode码和非Unicode码引起的错误与格式转换

python用正则表达式怎么查询unicode码字符

查找字符对应Unicode码的十进制数字

如何过滤非中文的Unicode码

将字符转换为unicode码

ascii码与unicode码的区别

java读取配置文件(properties)的时候,unicode码转utf-8

idea properties文件unicode码问题