Python中基本的读文件和简单数据处理
DataQuest上面的免费课程(本文是Python基础课程部分),里面有些很基础的东西(csv文件读,字符串预处理等),发在这里做记录。涉及下面六个案例:
- Find the lowest crime rate(读取csv文件,字符串切分,for循环和if判断过滤数据)
- Discover weather pattern in LA(for循环和if判断进行频数统计)
- Building a Spell Checker(词频统计,字符串预处理,字典跑字符串,统计正确错误单词)
- Analyze NFL data(使用CSVmodule导入文件,类,函数,使用字典和list进行简单统计)
- What should you name your kid if you want them to be a US Congressperson?(数据预处理,强制类型转换int(),try-except语句,字典方式统计,转存需要数据)
- Which airline is delayed the most?
- 附录:逐行读取txt文件
案例1 Find the lowest crime rate
(读取csv文件,字符串切分,for循环和if判断过滤数据)
crime_rates.csv是单sheet,73Rows,2Cols的文件。第一列是城市名称(字符串),第二列是犯罪数量(整数)。但是读入Python开始都是字符串,在后面类型转换将字符串形式的犯罪数量强制转换成整型。 并将分隔开转换后的数据存到full_data这个list中,然后使用for循环将犯罪数量最小的城市找出来(if判断,已知犯罪数最小为130),并将这个城市名存入变量city中。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
# We know that the lowest crime rate is 130. # This is the second column of the data. # We need to find the corresponding value in the first column -- the city with the lowest crime rate. # Let‘s load the csv file f = open(‘crime_rates.csv‘, ‘r‘) data = f.read() rows = data.split(‘\n‘) full_data = [] for row in rows: split_row = row.split(",") split_row[1] = int(split_row[1]) full_data.append(split_row) city = "" lowest_crime_rate = 10000 for item in full_data: if item[1] == 130: city = item[0] |
案例2 Discover weather pattern in LA
(for循环和if判断进行频数统计)
两列数据的文本文件,有表头。导入la_weather.txt文本文件,切分,存入变量weather_data中,去掉表头。使用字典(dictionary)进行不同类型的频数统计。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
weather_data = [] f = open("la_weather.csv", ‘r‘) data = f.read() rows = data.split(‘\n‘) for row in rows: split_row = row.split(",") weather_data.append(split_row) print(weather_data) #去掉表头 weather = weather_data[1:367] weather_counts = {} for item in weather: if item in weather_counts: weather_counts[item] = weather_counts[item] + 1 else: weather_counts[item] = 1 print(weather_counts) |
案例3 Building a Spell Checker
(词频统计,字符串预处理,字典跑字符串,统计正确错误单词)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 |
# 将字符正规化,对字符进行处理,去掉特殊符号 def normalize(token): token = token.replace(".","") token = token.replace(",","") token = token.replace("‘", "") token = token.replace(";", "") token = token.replace("\n", "") token = token.lower() return token # 建立一个list用于存放正规的字典 normalized_dictionary_tokens = [] # 只读方式打开一个文件 f = open("dictionary.txt", "r") raw_data = f.read() # 按照空格将字符串进行切分,成单个单词 data = raw_data.split(" ") # 遍历切分后的单词,进行正规化处理(def normalize,去掉特殊符号) for token in data: normalized_dictionary_tokens.append(normalize(token)) print(normalized_dictionary_tokens) #统计正确单词和错误单词的词频。用一个正确单词的字典来遍历这个字符串,并进行统计 potential_misspellings = [] correctly_spelled = [] for token in normalized_story_tokens: if token in normalized_dictionary_tokens: correctly_spelled.append(token) else: potential_misspellings.append(token) print(correctly_spelled) print(potential_misspellings) |
案例4 Analyze NFL data
(使用CSVmodule导入文件,类,函数,使用字典和list进行简单统计)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
import csv class Team(): def __init__(self, name): self.name = name f = open("nfl.csv", ‘r‘) csvreader = csv.reader(f) self.nfl = list(csvreader) def count_total_wins(self): count = 0 for row in self.nfl: if row[2] == self.name: count = count + 1 return count def wins_by_years(self): wins = {} years = ["2009", "2010", "2011", "2012", "2013"] for year in years: count = 0 for row in self.nfl: if row[2] == self.name and row[0] == year: count += 1 wins[year] = count return wins niners = Team("San Francisco 49ers") niners_wins_by_year = niners.wins_by_years() print("Niners_wins_by_year: ", niners_wins_by_year) |
案例5 What should you name your kid if you want them to be a US Congressperson?
(数据预处理,强制类型转换int(),try-except语句,字典方式统计,转存需要数据)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 |
# legislators变量是一个2维list,大list里的其中一个list(条目)是一个有7个元素组成的(姓,名,出生年月日,未知,未知,未知)。我们要做的是这一组数据进行预处理,然后进行姓名的统计。 genders_list = [] unique_genders = set() unique_genders_list = [] # 将性别数据以append方式挨个读入list变量genders_list中去 for row in legislators: genders_list.append(row[3]) # genders_list变量使用set()函数进行元素去重变为字典,并存入字典变量unique_genders中,将去重后的结果再存储成list类型数据搭配到变量unique_genders_list unique_genders = set(genders_list) unique_genders_list = list(unique_genders) print(genders_list) # 已知性别数据的错误值为"",将其重赋值为“M” for row in legislators: if row[3] == "": row[3] = "M" # 统计出生年份存入list变量birth_years中。其中需要使用split方法对list中的某个元素进行切分,取其中第一个元素(即年),以append追加的方法存入list变量birth_years中 birth_years = [] for row in legislators: birth_list = [] birth_list = row[2].split("-") birth_years.append(birth_list[0]) # 对list变量进行enumerate()函数操作(得到下标和所在的当前row)类似对字典进行.item()方法(得到key和对应的value)。 # 将年份存入list变量legislators中每行的第八列,按照append追加的方法 for i, row in enumerate(legislators): row.append(birth_years[i]) # 将legislatros变量的第八列元素(出生年份)的字符串类型,强制类型转换成int类型。如遇到强制转换错误就将出生年份值变为0 for row in legislators: try: row[7] = int(row[7]) except Exception: row[7] = 0 # 用字典进行姓名统计(key为姓名,value为出现次数)存入male_name_counts字典变量中。并将出现次数最多的名字(同样是最大出现次数,但名字不止一个),将这些名字存入list变量top_male_names中 top_male_names = [] male_name_counts = {} # 用字典进行姓名统计,条件是出生年份大于1940,并且是女性 for row in legislators: if row[7] > 1940 and row[3] == "M": if row[1] in male_name_counts: male_name_counts[row[1]] += 1 else: male_name_counts[row[1]] = 1 # 找出名字出现最多的次数highest_value highest_value = None for key, value in male_name_counts.items(): if highest_value is None or value > highest_value: highest_value = value # 将名字次数出现最多的名字(同样是最大出现次数,但名字不止一个),将这些名字以追加append的方式存入list变量top_male_names中 for key, value in male_name_counts.items(): if value == highest_value: top_male_names.append(key) |
案例6 Which airline is delayed the most?
这个案例来来回回做了好几天,反正基本上大都是参考答案做过的……酱油了……
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
def column_number_from_name(column_name): column_number = None for i, column in enumerate(column_names): if column == column_name: column_number = i return column_number def find_average_delay(carrier_name=None): total_delayed_flights = 0 total_delay_time = 0 delay_time_column = column_number_from_name("arr_delay") delay_number_column = column_number_from_name("arr_del15") carrier_column = column_number_from_name("carrier") for row in flight_delays: if carrier_name is None or row[carrier_column] == carrier_name: total_delayed_flights += float(row[delay_number_column]) total_delay_time += float(row[delay_time_column]) return total_delay_time / total_delayed_flights delays_by_carrier = {} carrier_column = column_number_from_name("carrier") carriers = [row[carrier_column] for row in flight_delays] unique_carriers = list(set(carriers)) for carrier in unique_carriers: delays_by_carrier[carrier] = find_average_delay(carrier) print(delays_by_carrier) |
附录1 逐行读取txt文件
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
# 方法一 f = open("foo.txt") # 返回一个文件对象 line = f.readline() # 调用文件的 readline()方法 while line: print line, # 后面跟 ‘,‘ 将忽略换行符 # print(line, end = ‘‘) # 在 Python 3中使用 line = f.readline() f.close() # 方法二 for line in open("foo.txt"): print line # 方法三 f = open("c:\\1.txt","r") lines = f.readlines() #读取全部内容 for line in lines print line |