金融量化的第一步:数据统计和分析。
我选择的教材是:利用python进行数据分析 O‘reilly出版
实用案例
1. 处理来自bit.ly的1.usa.gov数据。
1) 数据: http://www.usa.gov/About/developer-resources/1usagov.shtml
该数据为常见的json格式
2)将json转换成字典
注意事项:我是将该数据以TXT格式保存到本地进行处理的。需要去掉分隔符,同时因为内部有BOM字符,需要去除这些字符。再将这些字典读到列表中。
import osimport json,pickle from collections import defaultdictfrom collections import Counter records = [] for line in open("haha6.txt", encoding = "utf8"): line = line.strip("\n") if line.startswith(u‘\ufeff‘): line = line.encode(‘utf8‘)[3:].decode(‘utf8‘) #去掉Bom字符 line = json.loads(line, encoding = "utf-8") records.append(line) print(records[0]) #output:第一行数据如下:#{‘u‘: ‘http://today.lbl.gov/2016/06/24/saudi-minister-of-energy-visits-lab-on-june-20/#main‘, #‘_id‘: ‘27e6808c-3750-e5ac-002a-cfb577e72a48‘, ‘r‘: ‘direct‘, ‘sl‘: ‘2963Ceb‘, ‘h‘: ‘2963Ceb‘, #‘k‘: ‘‘, ‘a‘: ‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML‘, ‘c‘: ‘FR‘, #‘hc‘: 1466804416, ‘nk‘: 0, ‘ll‘: [48.8582, 2.3387], ‘g‘: ‘2963Fqo‘, #‘t‘: 1467187377, ‘hh‘: ‘1.usa.gov‘, ‘l‘: ‘anonymous‘, ‘i‘: ‘‘, ‘tz‘: ‘Europe/Paris‘}
3) 查找所有的时区,并对其计数
time_zones = [rec["tz"] for rec in records if "tz" in rec] ##时区统计,列表里的字典元素的key的统计 #方法1 def get_counts(sequence): counts = {} for x in sequence: if x in counts: counts[x] += 1 else: counts[x] = 1 return counts counts = get_counts(time_zones) print(counts["America/New_York"]) #方法2 def get_counts1(sequence): counts = defaultdict(int) for x in sequence: counts[x] += 1 return counts counts = get_counts1(time_zones) print(counts["America/New_York"]) #output: 353
4) 取出前十的时区及其计数值
#方法1 def top_counts(count_dict, n = 10): value_key_pairs = [(count,tz) for tz, count in count_dict.items()] value_key_pairs.sort() return value_key_pairs[-n:] print(top_counts(counts)) #方法2 counts = Counter(time_zones) counts.most_common(10) print(counts.most_common(10))
5) 用pandas简化,对时区进行计数,并给出前十的柱状图
#用pandas对时区进行计数 from pandas import DataFrame import pandas as pd import numpy as np frame = DataFrame(records) #print(frame) #tz_counts = frame["tz"].value_counts() #print(tz_counts[:10]) clean_tz = frame["tz"].fillna("missing") # 缺失值处理 clean_tz[clean_tz == ""] = "unknown" # 空字符串处理 tz_counts = clean_tz.value_counts() print(tz_counts[:10].plot(kind = "barh", rot=0)) #output是柱状图
时间: 2024-10-05 04:33:18