数据清洗,使用python数据清洗cvs里面带中文字符,意图是用字典对应中文字符,即key值是中文字符,value值是index,自增即可;利用字典数据结构没有重复key值的特性,把中文字符映射到了数值index。
python代码如下:(data数据时csv格式)
import csv dict2 = {} #Cdict4 = {} #Edict25 = {} #zdict26 = {} #AAdict27 = {} #ABdict37 = {} #ALdict38 = {} #AMdict40 = {} #AOdict41 = {} #APdict42 = {} #AQdict45 = {} #ATdict49 = {} #AXindex = 0flag = False # print(row[2],dict[row[2]]) with open("E:/yuce/test.csv", ‘w+‘, newline=‘‘) as csv_file_write: writer = csv.writer(csv_file_write) with open(‘E:/yuce/b.csv‘, ‘r‘, newline=‘‘) as csv_file_read: reader = csv.reader(csv_file_read) for row in reader: if(flag): dict2[row[2]] = index dict4[row[4]] = index dict25[row[25]] = index dict26[row[26]] = index dict27[row[27]] = index dict37[row[37]] = index dict38[row[38]] = index dict40[row[40]] = index dict41[row[41]] = index dict42[row[42]] = index dict45[row[45]] = index dict49[row[49]] = index row[2] = dict2[row[2]] row[4] = dict4[row[4]] row[25] = dict25[row[25]] row[26] = dict26[row[26]] row[27] = dict27[row[27]] row[37] = dict37[row[37]] row[38] = dict38[row[38]] row[40] = dict40[row[40]] row[41] = dict41[row[41]] row[42] = dict42[row[42]] row[45] = dict45[row[45]] row[49] = dict49[row[49]] index = index + 1 writer.writerow(row) flag = True csv_file_read.close()csv_file_write.close() print(‘done!‘)
上例是真实的数据处理,有两百列属性,三万条数据的原始数据。其中包括中文字符,及缺失值,需要一步步清洗。
时间: 2024-07-28 21:27:52