Purpose
最近因为要买房子,扫过了各种信息,貌似lianjia上的数据还是靠点谱的(最起码房源图片没有太大的出入),心血来潮想着做几个图表来显示下房屋的数据信息,顺便练练手。
需求分析
1从lianjia的网站上获取关键的房屋信息数据,然后按照自己的需求通过图表显示出来。
2每天从lianjia的网站上获取一次数据
3以上海地区为主(本人在上海)
4最终生成图表有:房屋交易总量,二手房均价,在售房源,近90天成交量,昨日带看次数
分析获取网站数据
1 数据源
数据的获取主要是从两个地方:
http://sh.lianjia.com/chengjiao/ //成交量数据统计获取
页面上的数据(下面显示的是没登录前的量,貌似登录之后会比这个量要多一点):
http://sh.lianjia.com/ershoufang/ //二手房相关数据获取
页面上数据:
2 获取方法获取网页数据的话,首先想到的是scrapy,不过考虑到获取的数据不是很多很复杂,这里只用urllib.request来获取就可以了。后面因为使用到tornado的异步,所以会替换成httpclient.AsyncHTTPClient().fetch()。
3 使用urllib.request来获取相关数据。
首先,从网页上爬数据,使用obtain_page_data基础的函数:
1 def obtain_page_data(target_url): 2 with urllib.request.urlopen(target_url) as f: 3 data = f.read().decode(‘utf8‘) 4 return data
obtain_page_data()函数的话,主要是访问给定页面,然后返回页面的数据
然后,获取了数据之后,要按照需求来获取网页上的数据,主要是两大块:
1)房屋总成交量(http://sh.lianjia.com/chengjiao/)
定义函数get_total_dealed_house(),函数最终是返回页面上的总成交量,那么在调用obtain_page_data()获取页面的data后,分析下这个数据是在哪个位置。
那么看到数据一个div下,那么使用BeautifulSoup解析一下获取的html数据后,通过下面的命令来获取text数据:
dealed_house = soup_obj.html.body.find(‘div‘, {‘class‘: ‘list-head‘}).text
找到了text内容之后通过正则表达式过滤掉非数字的字符,然后就获取到了这个数据,具体如下:
1 def get_total_dealed_house(target_url): 2 # 获取总的房屋成交量 3 page_data = obtain_page_data(target_url) 4 soup_obj = BeautifulSoup(page_data,"html.parser") 5 dealed_house = soup_obj.html.body.find(‘div‘, {‘class‘: ‘list-head‘}).text 6 dealed_house_num = re.findall(r‘\d+‘, dealed_house)[0] 7 8 return int(dealed_house_num)
2)获取其他在线数据(http://sh.lianjia.com/ershoufang/)
类似的,要先分析自己要的数据在网页中的哪个位置,然后去获取,过滤,具体如下:
1 def get_online_data(target_url): 2 # 获取 城市挂牌均价,正在出售数量,90天内交易量,昨日看房次数 3 page_data = obtain_page_data(target_url) 4 soup_obj = BeautifulSoup(page_data, "html.parser") 5 online_data_str = soup_obj.html.body.find(‘div‘, {‘class‘: ‘secondcon‘}).text 6 online_data = online_data_str.replace(‘\n‘, ‘‘) 7 avg_price, on_sale, _, sold_in_90, yesterday_check_num = re.findall(r‘\d+‘, online_data) 8 9 return {‘avg_price‘:avg_price,‘on_sale‘:on_sale,‘sold_in_90‘:sold_in_90,‘yesterday_check_num‘:yesterday_check_num}
3)数据整合/细分各区
使用shanghai_data_process()函数来整合一下1,2中获取的数据,另外lianjia网页上上海区域的数据其实是可以按照各个区来查询的,那么这里也做一下处理,如下:
1 def shanghai_data_process(): 2 ‘‘‘ 3 获取上海各个区的数据 4 :return: 5 ‘‘‘ 6 chenjiao_page = "http://sh.lianjia.com/chengjiao/" 7 ershoufang_page = "http://sh.lianjia.com/ershoufang/" 8 sh_area_dict = { 9 "all":"", 10 "pudongxinqu": "pudongxinqu/", 11 "minhang": "minhang/", 12 "baoshan": "baoshan/", 13 "xuhui": "xuhui/", 14 "putuo": "putuo/", 15 "yangpu": "yangpu/", 16 "changning": "changning/", 17 "songjiang": "songjiang/", 18 "jiading": "jiading/", 19 "huangpu": "huangpu/", 20 "jingan": "jingan/", 21 "zhabei": "zhabei/", 22 "hongkou": "hongkou/", 23 "qingpu": "qingpu/", 24 "fengxian": "fengxian/", 25 "jinshan": "jinshan/", 26 "chongming": "chongming/", 27 "shanghaizhoubian": "shanghaizhoubian/", 28 } 29 dealed_house_num = get_total_dealed_house(chenjiao_page) 30 sh_online_data = {} 31 for key,value in sh_area_dict.items(): 32 sh_online_data[key] = get_online_data(ershoufang_page+sh_area_dict[key]) 33 print("dealed_house_num %s" %dealed_house_num) 34 for key,value in sh_online_data.items(): 35 print(key,value)
4)整体代码以及输出效果
1 import urllib.request 2 import re 3 from bs4 import BeautifulSoup 4 import time 5 6 def obtain_page_data(target_url): 7 with urllib.request.urlopen(target_url) as f: 8 data = f.read().decode(‘utf8‘) 9 return data 10 11 def get_total_dealed_house(target_url): 12 # 获取总的房屋成交量 13 page_data = obtain_page_data(target_url) 14 soup_obj = BeautifulSoup(page_data,"html.parser") 15 dealed_house = soup_obj.html.body.find(‘div‘, {‘class‘: ‘list-head‘}).text 16 dealed_house_num = re.findall(r‘\d+‘, dealed_house)[0] 17 18 return int(dealed_house_num) 19 20 def get_online_data(target_url): 21 # 获取 城市挂牌均价,正在出售数量,90天内交易量,昨日看房次数 22 page_data = obtain_page_data(target_url) 23 soup_obj = BeautifulSoup(page_data, "html.parser") 24 online_data_str = soup_obj.html.body.find(‘div‘, {‘class‘: ‘secondcon‘}).text 25 online_data = online_data_str.replace(‘\n‘, ‘‘) 26 avg_price, on_sale, _, sold_in_90, yesterday_check_num = re.findall(r‘\d+‘, online_data) 27 28 return {‘avg_price‘:avg_price,‘on_sale‘:on_sale,‘sold_in_90‘:sold_in_90,‘yesterday_check_num‘:yesterday_check_num} 29 30 def shanghai_data_process(): 31 ‘‘‘ 32 获取上海各个区的数据 33 :return: 34 ‘‘‘ 35 chenjiao_page = "http://sh.lianjia.com/chengjiao/" 36 ershoufang_page = "http://sh.lianjia.com/ershoufang/" 37 sh_area_dict = { 38 "all":"", 39 "pudongxinqu": "pudongxinqu/", 40 "minhang": "minhang/", 41 "baoshan": "baoshan/", 42 "xuhui": "xuhui/", 43 "putuo": "putuo/", 44 "yangpu": "yangpu/", 45 "changning": "changning/", 46 "songjiang": "songjiang/", 47 "jiading": "jiading/", 48 "huangpu": "huangpu/", 49 "jingan": "jingan/", 50 "zhabei": "zhabei/", 51 "hongkou": "hongkou/", 52 "qingpu": "qingpu/", 53 "fengxian": "fengxian/", 54 "jinshan": "jinshan/", 55 "chongming": "chongming/", 56 "shanghaizhoubian": "shanghaizhoubian/", 57 } 58 dealed_house_num = get_total_dealed_house(chenjiao_page) 59 sh_online_data = {} 60 for key,value in sh_area_dict.items(): 61 sh_online_data[key] = get_online_data(ershoufang_page+sh_area_dict[key]) 62 print("dealed_house_num %s" %dealed_house_num) 63 for key,value in sh_online_data.items(): 64 print(key,value) 65 66 def main(): 67 start_time = time.time() 68 shanghai_data_process() 69 print("time cost: %s" % (time.time() - start_time)) 70 71 72 if __name__==‘__main__‘: 73 main()
初版源码collect_data.py
Result:
1 dealed_house_num 51691 2 zhabei {‘yesterday_check_num‘: ‘1050‘, ‘sold_in_90‘: ‘533‘, ‘avg_price‘: ‘67179‘, ‘on_sale‘: ‘1674‘} 3 changning {‘yesterday_check_num‘: ‘1861‘, ‘sold_in_90‘: ‘768‘, ‘avg_price‘: ‘77977‘, ‘on_sale‘: ‘2473‘} 4 baoshan {‘yesterday_check_num‘: ‘2232‘, ‘sold_in_90‘: ‘1410‘, ‘avg_price‘: ‘48622‘, ‘on_sale‘: ‘4655‘} 5 putuo {‘yesterday_check_num‘: ‘1695‘, ‘sold_in_90‘: ‘910‘, ‘avg_price‘: ‘64942‘, ‘on_sale‘: ‘3051‘} 6 qingpu {‘yesterday_check_num‘: ‘463‘, ‘sold_in_90‘: ‘253‘, ‘avg_price‘: ‘40801‘, ‘on_sale‘: ‘1382‘} 7 jinshan {‘yesterday_check_num‘: ‘0‘, ‘sold_in_90‘: ‘8‘, ‘avg_price‘: ‘20370‘, ‘on_sale‘: ‘11‘} 8 chongming {‘yesterday_check_num‘: ‘0‘, ‘sold_in_90‘: ‘3‘, ‘avg_price‘: ‘26755‘, ‘on_sale‘: ‘9‘} 9 all {‘yesterday_check_num‘: ‘28682‘, ‘sold_in_90‘: ‘14550‘, ‘avg_price‘: ‘59987‘, ‘on_sale‘: ‘49396‘} 10 jingan {‘yesterday_check_num‘: ‘643‘, ‘sold_in_90‘: ‘277‘, ‘avg_price‘: ‘91689‘, ‘on_sale‘: ‘896‘} 11 xuhui {‘yesterday_check_num‘: ‘2526‘, ‘sold_in_90‘: ‘878‘, ‘avg_price‘: ‘80623‘, ‘on_sale‘: ‘3254‘} 12 songjiang {‘yesterday_check_num‘: ‘1571‘, ‘sold_in_90‘: ‘930‘, ‘avg_price‘: ‘44367‘, ‘on_sale‘: ‘3294‘} 13 yangpu {‘yesterday_check_num‘: ‘2774‘, ‘sold_in_90‘: ‘981‘, ‘avg_price‘: ‘67976‘, ‘on_sale‘: ‘2886‘} 14 pudongxinqu {‘yesterday_check_num‘: ‘7293‘, ‘sold_in_90‘: ‘3417‘, ‘avg_price‘: ‘62101‘, ‘on_sale‘: ‘12767‘} 15 shanghaizhoubian {‘yesterday_check_num‘: ‘0‘, ‘sold_in_90‘: ‘2‘, ‘avg_price‘: ‘24909‘, ‘on_sale‘: ‘15‘} 16 minhang {‘yesterday_check_num‘: ‘3271‘, ‘sold_in_90‘: ‘1989‘, ‘avg_price‘: ‘54968‘, ‘on_sale‘: ‘5862‘} 17 hongkou {‘yesterday_check_num‘: ‘936‘, ‘sold_in_90‘: ‘444‘, ‘avg_price‘: ‘71654‘, ‘on_sale‘: ‘1605‘} 18 fengxian {‘yesterday_check_num‘: ‘346‘, ‘sold_in_90‘: ‘557‘, ‘avg_price‘: ‘30423‘, ‘on_sale‘: ‘1279‘} 19 jiading {‘yesterday_check_num‘: ‘875‘, ‘sold_in_90‘: ‘767‘, ‘avg_price‘: ‘41609‘, ‘on_sale‘: ‘2846‘} 20 huangpu {‘yesterday_check_num‘: ‘1146‘, ‘sold_in_90‘: ‘423‘, ‘avg_price‘: ‘93880‘, ‘on_sale‘: ‘1437‘} 21 time cost: 12.94211196899414
Result
移植到tornado上
1 为什么要使用tornado
tornado是一个小巧的异步的python框架,这里使用到它是因为在发送request获取网页数据(IO密集)其实可以使用异步来提高效率,特别是在后期访问量大的时候,使用tornado会提高效率。
2 移植上面初步获取数据功能到tornado上
这里的关键点有这么几个:
1)异步获取网页数据
使用httpclient.AsyncHTTPClient().fetch()来获取页面数据,配合使用gen.coroutine+yield来实现异步。
2)返回数据的时候要使用raise gen.Return(data)
3)初步改造后的版本以及运行结果如下:
1 import re 2 from bs4 import BeautifulSoup 3 import time 4 from tornado import httpclient,gen,ioloop 5 6 @gen.coroutine 7 def obtain_page_data(target_url): 8 response = yield httpclient.AsyncHTTPClient().fetch(target_url) 9 data = response.body.decode(‘utf8‘) 10 print("start %s %s" %(target_url,time.time())) 11 12 raise gen.Return(data) 13 14 @gen.coroutine 15 def get_total_dealed_house(target_url): 16 # 获取总的房屋成交量 17 page_data = yield obtain_page_data(target_url) 18 soup_obj = BeautifulSoup(page_data,"html.parser") 19 dealed_house = soup_obj.html.body.find(‘div‘, {‘class‘: ‘list-head‘}).text 20 dealed_house_num = re.findall(r‘\d+‘, dealed_house)[0] 21 22 raise gen.Return(int(dealed_house_num)) 23 24 @gen.coroutine 25 def get_online_data(target_url): 26 # 获取 城市挂牌均价,正在出售数量,90天内交易量,昨日看房次数 27 page_data = yield obtain_page_data(target_url) 28 soup_obj = BeautifulSoup(page_data, "html.parser") 29 online_data_str = soup_obj.html.body.find(‘div‘, {‘class‘: ‘secondcon‘}).text 30 online_data = online_data_str.replace(‘\n‘, ‘‘) 31 avg_price, on_sale, _, sold_in_90, yesterday_check_num = re.findall(r‘\d+‘, online_data) 32 33 raise gen.Return({‘avg_price‘:avg_price,‘on_sale‘:on_sale,‘sold_in_90‘:sold_in_90,‘yesterday_check_num‘:yesterday_check_num}) 34 35 @gen.coroutine 36 def shanghai_data_process(): 37 ‘‘‘ 38 获取上海各个区的数据 39 :return: 40 ‘‘‘ 41 start_time = time.time() 42 chenjiao_page = "http://sh.lianjia.com/chengjiao/" 43 ershoufang_page = "http://sh.lianjia.com/ershoufang/" 44 dealed_house_num = yield get_total_dealed_house(chenjiao_page) 45 sh_area_dict = { 46 "all": "", 47 "pudongxinqu": "pudongxinqu/", 48 "minhang": "minhang/", 49 "baoshan": "baoshan/", 50 "xuhui": "xuhui/", 51 "putuo": "putuo/", 52 "yangpu": "yangpu/", 53 "changning": "changning/", 54 "songjiang": "songjiang/", 55 "jiading": "jiading/", 56 "huangpu": "huangpu/", 57 "jingan": "jingan/", 58 "zhabei": "zhabei/", 59 "hongkou": "hongkou/", 60 "qingpu": "qingpu/", 61 "fengxian": "fengxian/", 62 "jinshan": "jinshan/", 63 "chongming": "chongming/", 64 "shanghaizhoubian": "shanghaizhoubian/", 65 } 66 sh_online_data = {} 67 for key,value in sh_area_dict.items(): 68 sh_online_data[key] = yield get_online_data(ershoufang_page+sh_area_dict[key]) 69 print("dealed_house_num %s" %dealed_house_num) 70 for key,value in sh_online_data.items(): 71 print(key,value) 72 73 print("tornado time cost: %s" %(time.time()-start_time) ) 74 75 76 if __name__==‘__main__‘: 77 io_loop = ioloop.IOLoop.current() 78 io_loop.run_sync(shanghai_data_process)
tornado初版
1 start http://sh.lianjia.com/chengjiao/ 1480320585.879013 2 start http://sh.lianjia.com/ershoufang/jinshan/ 1480320586.575354 3 start http://sh.lianjia.com/ershoufang/chongming/ 1480320587.017322 4 start http://sh.lianjia.com/ershoufang/yangpu/ 1480320587.515317 5 start http://sh.lianjia.com/ershoufang/hongkou/ 1480320588.051793 6 start http://sh.lianjia.com/ershoufang/fengxian/ 1480320588.593865 7 start http://sh.lianjia.com/ershoufang/jiading/ 1480320589.134367 8 start http://sh.lianjia.com/ershoufang/qingpu/ 1480320589.6134 9 start http://sh.lianjia.com/ershoufang/pudongxinqu/ 1480320590.215136 10 start http://sh.lianjia.com/ershoufang/putuo/ 1480320590.696576 11 start http://sh.lianjia.com/ershoufang/zhabei/ 1480320591.34218 12 start http://sh.lianjia.com/ershoufang/changning/ 1480320591.935762 13 start http://sh.lianjia.com/ershoufang/xuhui/ 1480320592.5159 14 start http://sh.lianjia.com/ershoufang/minhang/ 1480320593.096085 15 start http://sh.lianjia.com/ershoufang/songjiang/ 1480320593.749226 16 start http://sh.lianjia.com/ershoufang/ 1480320594.306287 17 start http://sh.lianjia.com/ershoufang/shanghaizhoubian/ 1480320594.807418 18 start http://sh.lianjia.com/ershoufang/huangpu/ 1480320595.2744 19 start http://sh.lianjia.com/ershoufang/jingan/ 1480320595.850909 20 start http://sh.lianjia.com/ershoufang/baoshan/ 1480320596.368479 21 dealed_house_num 51691 22 jinshan {‘yesterday_check_num‘: ‘0‘, ‘on_sale‘: ‘11‘, ‘avg_price‘: ‘20370‘, ‘sold_in_90‘: ‘8‘} 23 yangpu {‘yesterday_check_num‘: ‘2774‘, ‘on_sale‘: ‘2886‘, ‘avg_price‘: ‘67976‘, ‘sold_in_90‘: ‘981‘} 24 hongkou {‘yesterday_check_num‘: ‘936‘, ‘on_sale‘: ‘1605‘, ‘avg_price‘: ‘71654‘, ‘sold_in_90‘: ‘444‘} 25 fengxian {‘yesterday_check_num‘: ‘346‘, ‘on_sale‘: ‘1279‘, ‘avg_price‘: ‘30423‘, ‘sold_in_90‘: ‘557‘} 26 chongming {‘yesterday_check_num‘: ‘0‘, ‘on_sale‘: ‘9‘, ‘avg_price‘: ‘26755‘, ‘sold_in_90‘: ‘3‘} 27 pudongxinqu {‘yesterday_check_num‘: ‘7293‘, ‘on_sale‘: ‘12767‘, ‘avg_price‘: ‘62101‘, ‘sold_in_90‘: ‘3417‘} 28 putuo {‘yesterday_check_num‘: ‘1695‘, ‘on_sale‘: ‘3051‘, ‘avg_price‘: ‘64942‘, ‘sold_in_90‘: ‘910‘} 29 zhabei {‘yesterday_check_num‘: ‘1050‘, ‘on_sale‘: ‘1674‘, ‘avg_price‘: ‘67179‘, ‘sold_in_90‘: ‘533‘} 30 changning {‘yesterday_check_num‘: ‘1861‘, ‘on_sale‘: ‘2473‘, ‘avg_price‘: ‘77977‘, ‘sold_in_90‘: ‘768‘} 31 baoshan {‘yesterday_check_num‘: ‘2232‘, ‘on_sale‘: ‘4655‘, ‘avg_price‘: ‘48622‘, ‘sold_in_90‘: ‘1410‘} 32 xuhui {‘yesterday_check_num‘: ‘2526‘, ‘on_sale‘: ‘3254‘, ‘avg_price‘: ‘80623‘, ‘sold_in_90‘: ‘878‘} 33 minhang {‘yesterday_check_num‘: ‘3271‘, ‘on_sale‘: ‘5862‘, ‘avg_price‘: ‘54968‘, ‘sold_in_90‘: ‘1989‘} 34 songjiang {‘yesterday_check_num‘: ‘1571‘, ‘on_sale‘: ‘3294‘, ‘avg_price‘: ‘44367‘, ‘sold_in_90‘: ‘930‘} 35 all {‘yesterday_check_num‘: ‘28682‘, ‘on_sale‘: ‘49396‘, ‘avg_price‘: ‘59987‘, ‘sold_in_90‘: ‘14550‘} 36 shanghaizhoubian {‘yesterday_check_num‘: ‘0‘, ‘on_sale‘: ‘15‘, ‘avg_price‘: ‘24909‘, ‘sold_in_90‘: ‘2‘} 37 jingan {‘yesterday_check_num‘: ‘643‘, ‘on_sale‘: ‘896‘, ‘avg_price‘: ‘91689‘, ‘sold_in_90‘: ‘277‘} 38 jiading {‘yesterday_check_num‘: ‘875‘, ‘on_sale‘: ‘2846‘, ‘avg_price‘: ‘41609‘, ‘sold_in_90‘: ‘767‘} 39 qingpu {‘yesterday_check_num‘: ‘463‘, ‘on_sale‘: ‘1382‘, ‘avg_price‘: ‘40801‘, ‘sold_in_90‘: ‘253‘} 40 huangpu {‘yesterday_check_num‘: ‘1146‘, ‘on_sale‘: ‘1437‘, ‘avg_price‘: ‘93880‘, ‘sold_in_90‘: ‘423‘} 41 tornado time cost: 10.953541040420532
初版运行结果
存储数据到数据库中
这里我使用的是mysql数据库,那么在tornado中可以使用pymysql来连接数据库,并且我这里使用了sqlalchemy来完成程序中的DML。
sqlalchemy部分的内容详见这里。
1)表结构
这里需要的表不是很多,如下:
sh_area //上海区域表,存放上海各个区域
sh_total_city_dealed //上海地区二手房总成交量
online_data //上海各区二手房数据
2) 使用sqlalchemy来初始化表
settings中设置的是数据库连接相关内容。
1 from sqlalchemy import create_engine 2 from sqlalchemy.orm import sessionmaker 3 DB={ 4 ‘connector‘:‘mysql+pymysql://root:[email protected]:3306/devdb1‘, 5 ‘max_session‘:5 6 } 7 8 engine = create_engine(DB[‘connector‘], max_overflow= DB[‘max_session‘], echo= False) 9 SessionCls = sessionmaker(bind=engine) 10 session = SessionCls()
settings.py
初始化脚本
1 from sqlalchemy.ext.declarative import declarative_base 2 from sqlalchemy import Column,Integer,String,ForeignKey,DateTime 3 4 import os,sys 5 BASE_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__))) 6 sys.path.append(BASE_DIR) 7 8 from conf import settings 9 10 Base = declarative_base() 11 12 class SH_Area(Base): 13 __tablename__ = ‘sh_area‘ # 表名 14 id = Column(Integer, primary_key=True) 15 name = Column(String(64)) 16 17 class Online_Data(Base): 18 __tablename__ = ‘online_data‘ # 表名 19 id = Column(Integer, primary_key=True) 20 sold_in_90 = Column(Integer) 21 avg_price = Column(Integer) 22 yesterday_check_num = Column(Integer) 23 on_sale = Column(Integer) 24 date = Column(DateTime) 25 belong_area = Column(Integer,ForeignKey(‘sh_area.id‘)) 26 27 class SH_Total_city_dealed(Base): 28 __tablename__ = ‘sh_total_city_dealed‘ # 表名 29 id = Column(Integer, primary_key=True) 30 dealed_house_num = Column(Integer) 31 date = Column(DateTime) 32 memo = Column(String(64),nullable=True) 33 34 def db_init(): 35 Base.metadata.create_all(settings.engine) # 创建表结构 36 for district in settings.sh_area_dict.keys(): 37 item_obj = SH_Area(name = district) 38 settings.session.add(item_obj) 39 settings.session.commit() 40 41 42 if __name__ == ‘__main__‘: 43 db_init()
database_init
图表绘制
1前端绘制
图表绘制的话,这里我使用的是Highcharts。图形比较美观,使用的时候只需要提供需要的数据即可。
我使用的是基础折线图,需要在前端引入几个js文件,如下:jquery.min.js,highcharts.js,exporting.js。然后添加一个div,使用id来标示这个div,样例中使用的是id="container"
官方js部分的代码如下:
1 $(function () { 2 $(‘#container‘).highcharts({ 3 title: { 4 text: ‘Monthly Average Temperature‘, 5 x: -20 //center 6 }, 7 subtitle: { 8 text: ‘Source: WorldClimate.com‘, 9 x: -20 10 }, 11 xAxis: { 12 categories: [‘Jan‘, ‘Feb‘, ‘Mar‘, ‘Apr‘, ‘May‘, ‘Jun‘, 13 ‘Jul‘, ‘Aug‘, ‘Sep‘, ‘Oct‘, ‘Nov‘, ‘Dec‘] 14 }, 15 yAxis: { 16 title: { 17 text: ‘Temperature (°C)‘ 18 }, 19 plotLines: [{ 20 value: 0, 21 width: 1, 22 color: ‘#808080‘ 23 }] 24 }, 25 tooltip: { 26 valueSuffix: ‘°C‘ 27 }, 28 legend: { 29 layout: ‘vertical‘, 30 align: ‘right‘, 31 verticalAlign: ‘middle‘, 32 borderWidth: 0 33 }, 34 series: [{ 35 name: ‘Tokyo‘, 36 data: [7.0, 6.9, 9.5, 14.5, 18.2, 21.5, 25.2, 26.5, 23.3, 18.3, 13.9, 9.6] 37 }, { 38 name: ‘New York‘, 39 data: [-0.2, 0.8, 5.7, 11.3, 17.0, 22.0, 24.8, 24.1, 20.1, 14.1, 8.6, 2.5] 40 }, { 41 name: ‘Berlin‘, 42 data: [-0.9, 0.6, 3.5, 8.4, 13.5, 17.0, 18.6, 17.9, 14.3, 9.0, 3.9, 1.0] 43 }, { 44 name: ‘London‘, 45 data: [3.9, 4.2, 5.7, 8.5, 11.9, 15.2, 17.0, 16.6, 14.2, 10.3, 6.6, 4.8] 46 }] 47 }); 48 });
官方js
我的工作是在这个基础上,修改js内容来画出符合自己的图。
具体的参考github上代码中的修改,最后画出来的图是这样的。
2 后端获取数据并传输给前端
基本上前端表哥需要的数据是一维或者二维数组,比如横坐标时间数组[time1,time2,time3],纵坐标数据数组[data1,data2,data3]这样子。
这里需要注意几点:
1)tornado后端返回数据,使用render()函数渲染到指定的页面即可。
2) js中使用{{ data_rendered }}来获取数据
3)后端传入前端的时间数据为timestamp时间戳,这里需要format一下显示,如下:
1 function formatDate(timestamp_v) { 2 var now = new Date(parseFloat(timestamp_v)*1000); 3 var year=now.getFullYear(); 4 var month=now.getMonth()+1; 5 var date=now.getDate(); 6 var hour=now.getHours(); 7 var minute=now.getMinutes(); 8 var second=now.getSeconds(); 9 return year+"-"+month+"-"+date+" "+hour+":"+minute+":"+second; 10 11 };
formatDate
4)注意js部分二维数组的定义处理
3 前端请求传给后端参数
因为需求中可以查询上海各个区的图表,那么可以设计访问地址为r‘/view/(\w+)/(\w+)‘,这样前面是city(比如sh,bj等)后面是具体的哪个区area。后端接收到这两个参数后去数据库中查找数据并返回。
最终成型
在数据库中有了数据之后,后面的内容就是前端后端数据的交互,在前端哪些地方绘制图表,需要什么数据,后端返回即可,最终主要的代码是这样的:
1 import re 2 from bs4 import BeautifulSoup 3 import datetime 4 import time 5 from tornado import httpclient,gen,ioloop,httpserver 6 from tornado import web 7 import tornado.options 8 import json 9 10 import os,sys 11 BASE_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__))) 12 sys.path.append(BASE_DIR) 13 14 from conf import settings 15 from database_init import Online_Data,SH_Total_city_dealed,SH_Area 16 from tornado.options import define,options 17 18 define("port",default=8888,type=int) 19 20 21 @gen.coroutine 22 def obtain_page_data(target_url): 23 response = yield httpclient.AsyncHTTPClient().fetch(target_url) 24 data = response.body.decode(‘utf8‘) 25 print("start %s %s" %(target_url,time.time())) 26 27 raise gen.Return(data) 28 29 @gen.coroutine 30 def get_total_dealed_house(target_url): 31 # 获取总的房屋成交量 32 page_data = yield obtain_page_data(target_url) 33 soup_obj = BeautifulSoup(page_data,"html.parser") 34 dealed_house = soup_obj.html.body.find(‘div‘, {‘class‘: ‘list-head‘}).text 35 dealed_house_num = re.findall(r‘\d+‘, dealed_house)[0] 36 37 raise gen.Return(int(dealed_house_num)) 38 39 @gen.coroutine 40 def get_online_data(target_url): 41 # 获取 城市挂牌均价,正在出售数量,90天内交易量,昨日看房次数 42 page_data = yield obtain_page_data(target_url) 43 soup_obj = BeautifulSoup(page_data, "html.parser") 44 online_data_str = soup_obj.html.body.find(‘div‘, {‘class‘: ‘secondcon‘}).text 45 online_data = online_data_str.replace(‘\n‘, ‘‘) 46 avg_price, on_sale, _, sold_in_90, yesterday_check_num = re.findall(r‘\d+‘, online_data) 47 48 raise gen.Return({‘avg_price‘:avg_price,‘on_sale‘:on_sale,‘sold_in_90‘:sold_in_90,‘yesterday_check_num‘:yesterday_check_num}) 49 50 @gen.coroutine 51 def shanghai_data_process(): 52 ‘‘‘ 53 获取上海各个区的数据 54 :return: 55 ‘‘‘ 56 start_time = time.time() 57 chenjiao_page = "http://sh.lianjia.com/chengjiao/" 58 ershoufang_page = "http://sh.lianjia.com/ershoufang/" 59 dealed_house_num = yield get_total_dealed_house(chenjiao_page) 60 sh_online_data = {} 61 for key,value in settings.sh_area_dict.items(): 62 sh_online_data[key] = yield get_online_data(ershoufang_page+settings.sh_area_dict[key]) 63 print("dealed_house_num %s" %dealed_house_num) 64 for key,value in sh_online_data.items(): 65 print(key,value) 66 67 print("tornado time cost: %s" %(time.time()-start_time) ) 68 69 #settings.session 70 update_date = datetime.datetime.now() 71 dealed_house_num_obj = SH_Total_city_dealed(dealed_house_num=dealed_house_num, 72 date = update_date) 73 settings.session.add(dealed_house_num_obj) 74 75 for key,value in sh_online_data.items(): 76 area_obj = settings.session.query(SH_Area).filter_by(name=key).first() 77 online_data_obj = Online_Data(sold_in_90 = value[‘sold_in_90‘], 78 avg_price = value[‘avg_price‘], 79 yesterday_check_num = value[‘yesterday_check_num‘], 80 on_sale = value[‘on_sale‘], 81 date = update_date, 82 belong_area = area_obj.id) 83 settings.session.add(online_data_obj) 84 settings.session.commit() 85 86 class IndexHandler(web.RequestHandler): 87 def get(self, *args, **kwargs): 88 total_dealed_house_num = settings.session.query(SH_Total_city_dealed).all() 89 cata_list = [] 90 data_list = [] 91 for item in total_dealed_house_num: 92 cata_list.append(time.mktime(item.date.timetuple())) 93 data_list.append(item.dealed_house_num) 94 95 area_id = settings.session.query(SH_Area).filter_by(name=‘all‘).first() 96 area_avg_price = settings.session.query(Online_Data).filter_by(belong_area = area_id.id).all() 97 area_date_list = [] 98 area_data_list = [] 99 area_on_sale_list = [] 100 area_sold_in_90_list = [] 101 area_yesterday_check_num = [] 102 for item in area_avg_price: 103 area_date_list.append(time.mktime(item.date.timetuple())) 104 area_data_list.append(item.avg_price) 105 area_on_sale_list.append([time.mktime(item.date.timetuple()),item.on_sale]) 106 area_sold_in_90_list.append(item.sold_in_90) 107 area_yesterday_check_num.append(item.yesterday_check_num) 108 self.render("index.html",cata_list=cata_list, 109 data_list=data_list,area_date_list = area_date_list,area_data_list = area_data_list, 110 area_on_sale_list = area_on_sale_list,area_sold_in_90_list=area_sold_in_90_list, 111 area_yesterday_check_num = area_yesterday_check_num,city="sh",area="all") 112 113 class QueryHandler(web.RequestHandler): 114 def get(self,city,area): 115 116 if city == "sh": 117 total_dealed_house_num = settings.session.query(SH_Total_city_dealed).all() 118 119 cata_list = [] 120 data_list = [] 121 for item in total_dealed_house_num: 122 cata_list.append(time.mktime(item.date.timetuple())) 123 data_list.append(item.dealed_house_num) 124 125 area_id = settings.session.query(SH_Area).filter_by(name=area).first() 126 area_avg_price = settings.session.query(Online_Data).filter_by(belong_area=area_id.id).all() 127 area_date_list = [] 128 area_data_list = [] 129 area_on_sale_list = [] 130 area_sold_in_90_list = [] 131 area_yesterday_check_num = [] 132 for item in area_avg_price: 133 area_date_list.append(time.mktime(item.date.timetuple())) 134 area_data_list.append(item.avg_price) 135 area_on_sale_list.append([time.mktime(item.date.timetuple()), item.on_sale]) 136 area_sold_in_90_list.append(item.sold_in_90) 137 area_yesterday_check_num.append(item.yesterday_check_num) 138 139 self.render("index.html", cata_list=cata_list, 140 data_list=data_list, area_date_list=area_date_list, area_data_list=area_data_list, 141 area_on_sale_list=area_on_sale_list, area_sold_in_90_list=area_sold_in_90_list, 142 area_yesterday_check_num=area_yesterday_check_num,city=city,area=area) 143 else: 144 self.redirect("/") 145 146 147 148 149 class MyApplication(web.Application): 150 def __init__(self): 151 handlers = [ 152 (r‘/‘,IndexHandler), 153 (r‘/view/(\w+)/(\w+)‘,QueryHandler), 154 155 ] 156 157 settings = { 158 ‘static_path‘: os.path.join(os.path.dirname(os.path.dirname(__file__)), "static"), 159 ‘template_path‘: os.path.join(os.path.dirname(os.path.dirname(__file__)), "templates"), 160 } 161 162 super(MyApplication,self).__init__(handlers,**settings) 163 164 # ioloop.PeriodicCallback(f2s, 2000).start() 165 166 if __name__==‘__main__‘: 167 http_server = httpserver.HTTPServer(MyApplication()) 168 http_server.listen(options.port) 169 ioloop.PeriodicCallback(shanghai_data_process,86400000).start() #毫秒 86400000 170 ioloop.IOLoop.instance().start()
data_collect
几点说明:
1 因为要定期去网页上获取数据,这里使用了ioloop.PeriodicCallback()函数来定时处理。
结合nginx部署
自己有一台AWS 的EC2虚机,操作系统是centos7,最后是要把程序放到上面去跑。
1 安装部署nginx
因为时间关系没有做过深入的研究,只是从网上翻了下几本的东西,如下:
1 使用wget下载nginx包(nginx-1.11.6.tar.gz),并解压 2 进入nginx-1.11.6 3 ./configure 4 make 5 make install
配置文件修改/usr/local/nginx/conf/nginx.conf
reload nginx 使用 /usr/local/nginx/sbin/nginx -s reload
2 调整虚机的inbound 防火墙规则,我添加的是80端口(nginx配置文件中同样监听80端口)
1、登录到AWS console主界面 2、左侧INSTANCES-Instances 3、右侧group security 4、下面inbounds 5、edit 6、edit inbounds rules页面中自己添加规则
3 测试访问nginx
如果正常,会显示Welcome nginx的页面
4 运行tornadao代码后reload nginx
效果图以及代码
1 几个效果图如下:
2 代码放在github上
解决sqlalchemy session问题
在代码运行之后的几天发现,每隔大约半天的时间,程序虽然不会挂掉,但是在浏览器访问的时候会出现500 error。后台日志中也会报访问的错误。
仔细研究了下后台日志的报错,发现应该是浏览器使用旧的session信息来访问,但是session信息在程序中已经过期,所以导致错误。仔细审查了下代码,确实是在settings文件中初始化了一个session,然后后面所有的DB相关操作都用了这个session。显然是有问题的。
解决办法其实很简单,只要把数据库session的生命周期与http 每次request的生命周期放在一起即可。也就说在每次http request开始的时候初始化一个db session,然后在每次reqeust结束的时候close掉这个db session即可。可以参考下flask框架中这部分内容的介绍。
1 sqlalchemy部分
为了实现上述的说明,sqlalchemy 这边需要使用一个新的对象scoped_session,官方示例如下:
1 >>> from sqlalchemy.orm import scoped_session 2 >>> from sqlalchemy.orm import sessionmaker 3 4 #创建session 5 >>> session_factory = sessionmaker(bind=some_engine) 6 >>> Session = scoped_session(session_factory) 7 8 #关闭session 9 >>> Session.remove()
更多的说明参考这里。
2 tornado 部分
在RequestHandler中重写initialize()和on_finish()两个函数。initialize()函数中初始化db session,而在on_finish()的时候结束这个db session。BaseHandler是一个基础的handler,其他request handler 只需要继承 BaseHandler即可。
1 class BaseHandler(web.RequestHandler): 2 def initialize(self): 3 self.db_session = scoped_session(sessionmaker(bind=settings.engine)) 4 self.db_query = self.db_session().query 5 6 def on_finish(self): 7 self.db_session.remove()