使用BeautifulSoup抓取

年前有个坑爹的需求来了,要把某点评网商户数据都给获取下来存储于数据库,好啦其实这个东西是蛮简单的;

首先到点评网把城市数据给拷贝下来,当然你也可以写个脚本把数据抓取下来,不够我没这么干;好了下面是用于抓取数据的脚本,下面我分享下给大家:

城市列表:

alashan|57|阿拉善
anshan|58|鞍山
anqing|117|安庆
anhuisuzhou|121|宿州
anyang|164|安阳
aba|255|阿坝
anshun|261|安顺
ali|288|阿里
ankang|297|安康
akesudiqu|332|阿克苏地区
aletaidiqu|338|阿勒泰地区
macau|342|澳门
alaer|389|阿拉尔
australia|2318|澳大利亚其他
auckland|2384|奥克兰
orlando|2401|奥兰多
agra|2410|阿格拉
antwerp|2422|安特卫普
amsterdam|2428|阿姆斯特丹
antalya|2445|安塔丽亚
ankara|2446|安卡拉
athens|2455|雅典
edinburgh|2465|爱丁堡
alexandria|2473|亚历山大
aswan|2474|亚斯文
ethiopia|2496|埃塞俄比亚
alishan|2503|阿里山
beijing|2|北京
baoding|29|保定
baotou|47|包头
bayannaoer|56|巴彦淖尔
benxi|60|本溪
baishan|75|白山
baicheng|77|白城
bengbu|112|蚌埠
bozhou|124|亳州
binzhou|158|滨州
beihai|228|北海
baise|233|百色
bazhong|253|巴中
bijiediqu|264|毕节地区
baoshan|270|保山
baoji|291|宝鸡
baiyin|302|白银
boertala|330|博尔塔拉
bayinguoleng|331|巴音郭楞
beitun|346|北屯
baisha|390|白沙
baoting|391|保亭
bangkok|2342|曼谷
pattaya|2344|芭堤雅
pai|2349|拜县
bali|2351|巴厘岛
bandung|2352|万隆
boracay|2355|长滩岛
palawan|2357|巴拉望岛
bohol|2358|薄荷岛
busan|2370|釜山
hokkaido|2375|北海道
brisbane|2381|布里斯班
paris|2388|巴黎
boston|2403|波士顿
brussels|2420|布鲁塞尔
bruges|2421|布鲁日
berlin|2423|柏林
prague|2431|布拉格
brno|2433|布尔诺
porto|2435|波尔图
bern|2443|伯尔尼
barcelona|2449|巴塞罗纳
budapest|2457|布达佩斯
pisa|2462|比萨
pretoria|2480|比勒陀利亚
buenosaires|2485|布宜诺斯艾利斯
brunei|2492|文莱
chengdu|8|成都
chongqing|9|重庆
chengde|31|承德
cangzhou|32|沧州
changzhi|38|长治
chifeng|49|赤峰
chaoyang|68|朝阳
changchun|70|长春
changzhou|93|常州
chuzhou|119|滁州
chaohu|122|巢湖
chizhou|125|池州
changde|197|常德
chenzhou|200|郴州
chaozhou|221|潮州
chuxiongzhou|272|楚雄州
changdudiqu|284|昌都地区
changjizhou|329|昌吉州
changsha|344|长沙
changjiang|392|昌江
chengmai|393|澄迈县
chongzuo|394|崇左
cixi|421|慈溪
cangnan|911|苍南
changle|981|长乐
cambodia|2316|柬埔寨其他
chiangmai|2345|清迈
chiangrai|2348|清莱
boracay|2355|长滩岛
cebu|2356|宿雾
okinawa|2377|冲绳
canberra|2382|堪培拉
cairns|2383|凯恩斯
christchurch|2387|基督城
cannes|2391|戛纳
chicago|2400|芝加哥
cologne|2425|科隆
creteisland|2453|克里特
cambridge|2466|剑桥
cairo|2472|开罗
casablanca|2477|卡萨布兰卡
capetown|2478|开普敦
cancun|2482|坎坤
cuzco|2487|库斯科
costarica|2500|哥斯达黎加
dalian|19|大连
datong|36|大同
dandong|61|丹东
daqing|84|大庆
daxinganling|91|大兴安岭
dongying|147|东营
dezhou|156|德州
dongguan|219|东莞
deyang|241|德阳
dazhou|251|达州
dali|277|大理
dehong|278|德宏
diqing|281|迪庆
dingxi|309|定西
danzhou|358|儋州
dingan|395|定安县
dongfang|396|东方
tokyo|2372|东京
osaka|2374|大阪
dijon|2394|第戎
tahiti|2405|大溪地
delhi|2407|新德里
toronto|2413|多伦多
turin|2463|都灵
dublin|2470|都柏林
eerduosi|51|鄂尔多斯
ezhou|181|鄂州
enshizhou|188|恩施州
edinburgh|2465|爱丁堡
ethiopia|2496|埃塞俄比亚
fuzhou|14|福州
fushun|59|抚顺
fuxin|64|阜新
fuyang|120|阜阳
jiangxifuzhou|143|抚州
foshan|208|佛山
fangchenggang|229|防城港
fenghua|422|奉化
fuqing|433|福清
fuyangfy|869|富阳
fuding|1031|福鼎
philippines|2327|菲律宾其他
fiji|2328|斐济
france|2331|法国其他
busan|2370|釜山
fujisan|2376|富士山
frankfurt|2426|法兰克福
florence|2459|佛罗伦萨
fukuoka|2505|福冈
guangzhou|4|广州
ganzhou|140|赣州
guilin|226|桂林
guigang|231|贵港
guangxiyulin|232|玉林
guangyuan|243|广元
guangan|250|广安
ganzi|256|甘孜??
guiyang|258|贵阳
gannanzhou|312|甘南
guoluo|318|果洛
guyuan|324|固原
guowai|343|国外其他
kaohsiung|2337|高雄
goldcoast|2380|黄金海岸
gothenburg|2437|哥德堡
geneva|2440|日内瓦
costarica|2500|哥斯达黎加
hangzhou|3|杭州
haikou|23|海口
handan|27|邯郸
hengshui|34|衡水
huhehaote|46|呼和浩特
hulunbeier|52|呼伦贝尔
huludao|69|葫芦岛
haerbin|79|哈尔滨
hegang|82|鹤岗
heihe|89|黑河
huaian|96|淮安
huzhou|103|湖州
hefei|110|合肥
huainan|113|淮南
huaibei|115|淮北
huangshan|118|黄山
heze|159|菏泽
hebi|165|鹤壁
huangshi|177|黄石
huanggang|185|黄冈
hengyang|194|衡阳
huaihua|202|怀化
huizhou|213|惠州
heyuan|216|河源
hezhou|234|贺州
hechi|235|河池
honghe|273|红河
hanzhong|295|汉中
haidong|314|海东
haibei|315|海北
huangnan|316|黄南
haixi|320|海西
hamidiqu|328|哈密地区
hetiandiqu|335|和田地区
hongkong|341|香港
hainanzhou|411|海南州
korea|2314|韩国其他
hualien|2336|花莲
hochiminh|2366|胡志明市
hanoi|2367|河内
haiphong|2368|海防市
hokkaido|2375|北海道
hakone|2378|箱根
goldcoast|2380|黄金海岸
queenstown|2385|皇后镇
wellington|2386|惠灵顿
hawaii|2404|夏威夷
hamburg|2424|汉堡
thehague|2430|海牙
jinan|22|济南
jincheng|39|晋城
jinzhong|41|晋中
jinzhou|62|锦州
jilin|71|吉林
jixi|81|鸡西
jiamusi|86|佳木斯
jiaxing|102|嘉兴
jinhua|105|金华
jingdezhen|135|景德镇
jiujiang|137|九江
jian|141|吉安
jiangxiyichun|142|宜春
jiangxifuzhou|143|抚州
jining|150|济宁
jiaozuo|167|焦作
jinmen|182|荆门
jingzhou|184|荆州
jiangmen|209|江门
jieyang|222|揭阳
jiayuguan|300|嘉峪关
jinchang|301|金昌
jiuquan|307|酒泉
jiyuan|397|济源
jingjian|853|靖江
jinjiang|1009|晋江
japan|2315|日本其他
cambodia|2316|柬埔寨其他
jakarta|2353|雅加达
kualalumpur|2359|吉隆坡
phnompenh|2364|金边
jeju|2371|济州岛
kyoto|2373|京都
christchurch|2387|基督城
sanfrancisco|2396|旧金山
jaipur|2411|斋浦尔
cambridge|2466|剑桥
killarney|2471|基拉尼
johannesburg|2479|约翰内斯堡
jordan|2495|约旦
jamaica|2501|牙买加
kaifeng|161|开封
kunming|267|昆明
kelamayi|326|克拉玛依
kezilesu|333|克孜勒苏
kashidiqu|334|喀什地区
kunshan|416|昆山
korea|2314|韩国其他
kaohsiung|2337|高雄
kohphiphi|2347|皮皮岛
kohsamet|2350|沙美岛
kualalumpur|2359|吉隆坡
kyoto|2373|京都
canberra|2382|堪培拉
cairns|2383|凯恩斯
kenting|2406|垦丁
cologne|2425|科隆
karlovyvary|2432|卡罗维瓦立
creteisland|2453|克里特
killarney|2471|基拉尼
cairo|2472|开罗
casablanca|2477|卡萨布兰卡
capetown|2478|开普敦
cancun|2482|坎坤
cuzco|2487|库斯科
kenya|2497|肯尼亚
langfang|33|廊坊
linfen|44|临汾
lvliang|45|吕梁
liaoyang|65|辽阳
liaoyuan|73|辽源
lianyuangang|95|连云港
lishui|109|丽水
liuan|123|六安
longyan|132|龙岩
laiwu|154|莱芜
linyi|155|临沂
liaocheng|157|聊城
luoyang|162|洛阳
luohe|170|漯河
loudi|203|娄底
liuzhou|225|柳州
luzhou|240|泸州
leshan|246|乐山
liangshan|257|凉山
liupanshui|259|六盘水
lijiang|279|丽江
linchang|282|临沧
lasa|283|拉萨
linzhi|289|林芝地区
lanzhou|299|兰州
longnan|310|陇南
linxiazhou|311|临夏州
laibin|398|来宾
ledong|399|乐东
lingao|400|临高县
lingshui|401|陵水
liyang|867|溧阳
linan|868|临安
yueqing|905|乐清
liuhai|1015|龙海
liuyang|1376|浏阳
langkawi|2361|兰卡威
lyon|2393|里昂
losangeles|2397|洛杉矶
lasvegas|2398|拉斯维加斯
rotterdam|2429|鹿特丹
lisbon|2434|里斯本
luzern|2442|卢塞恩
rome|2458|罗马
london|2464|伦敦
liverpool|2469|利物浦
luxor|2475|卢克索
riodejaneiro|2483|里约热内卢
lima|2488|利马
laos|2490|老挝
lebanon|2491|黎巴嫩
mudanjiang|88|牡丹江
maanshan|114|马鞍山
maoming|211|茂名
meizhou|214|梅州
mianyang|242|绵阳
meishan|248|眉山
macau|342|澳门
malaysia|2312|马来西亚其他
melbourne|2322|墨尔本
maldives|2324|马尔代夫
mauritius|2329|毛里求斯
unitedstates|2332|美国其他
bangkok|2342|曼谷
manila|2354|马尼拉
marseille|2392|马赛
miami|2402|迈阿密
mumbai|2409|孟买
montreal|2412|蒙特娄
munich|2427|慕尼黑
malmo|2438|马尔默
pamukkale|2448|棉花堡
madrid|2450|马德里
majorca|2452|马略卡岛
mykonos|2456|米科诺斯
manchester|2467|曼彻斯特
marrakech|2476|马拉喀什
mexicocity|2481|墨西哥城
machupicchu|2489|马丘比丘
myanmar|2493|缅甸
madagascar|2498|马达加斯加
nanjing|5|南京
ningbo|11|宁波
nantong|94|南通
nanping|131|南平
ningde|133|宁德
nanchang|134|南昌
nanyang|172|南阳
nanning|224|南宁
neijiang|245|内江
nanchong|247|南充
nujiang|280|怒江
naqu|287|那曲
ninghai|2308|宁海
newzealand|2319|新西兰其他
nepal|2333|尼泊尔
newtaipei|2340|新北
nice|2390|尼斯
newyork|2395|纽约
niagarafalls|2414|尼亚加拉瀑布
naples|2461|那不勒斯
oxford|2468|牛津
riyuetan|2504|南投
osaka|2374|大阪
okinawa|2377|冲绳
orlando|2401|奥兰多
ottawa|2416|渥太华
oxford|2468|牛津
panjin|66|盘锦
putian|127|莆田
pingxiang|136|萍乡
pingdingshan|163|平顶山
puyang|168|濮阳
panzhihua|239|攀枝花
puer|275|普洱
pingliang|306|平凉
pingyang|908|平阳
philippines|2327|菲律宾其他
phuket|2343|普吉岛
pattaya|2344|芭堤雅
kohphiphi|2347|皮皮岛
pai|2349|拜县
palawan|2357|巴拉望岛
penang|2360|槟城
phnompenh|2364|金边
paris|2388|巴黎
provence|2389|普罗旺斯
prague|2431|布拉格
porto|2435|波尔图
pamukkale|2448|棉花堡
pisa|2462|比萨
pretoria|2480|比勒陀利亚
palau|2502|帕劳
qingdao|21|青岛
qinghuangdao|26|秦皇岛
qiqihaer|80|齐齐哈尔
qitaihe|87|七台河
quzhou|106|衢州
quanzhou|129|泉州
qianjiang|190|潜江
qingyuan|218|清远
qinzhou|230|钦州
qianxinan|263|黔西南
qiandongnan|265|黔东南
qiannan|266|黔南
qujing|268|曲靖
qingyang|308|庆阳
qionghai|402|琼海
qiongzhong|403|琼中
chiangmai|2345|清迈
chiangrai|2348|清莱
queenstown|2385|皇后镇
rizhao|153|日照
rikazediqu|286|日喀则地区
ruian|904|瑞安
rongcheng|1161|荣成
japan|2315|日本其他
rotterdam|2429|鹿特丹
geneva|2440|日内瓦
rome|2458|罗马
riodejaneiro|2483|里约热内卢
riyuetan|2504|南投
shanghai|1|上海
suzhou|6|苏州
shenzhen|7|深圳
shenyang|18|沈阳
shijiazhuang|24|石家庄
shuozhou|40|朔州
siping|72|四平
songyuan|76|松原
shuangyashan|83|双鸭山
suihua|90|绥化
suqian|100|宿迁
shaoxing|104|绍兴
anhuisuzhou|121|宿州
sanming|128|三明
shangrao|144|上饶
sanmenxia|171|三门峡
shangqiu|173|商丘
shiyan|178|十堰
suizhou|187|随州
shaoyang|195|邵阳
shaoguan|205|韶关
shantou|207|汕头
shanwei|215|汕尾
suining|244|遂宁
shannan|285|山南
shangluo|298|商洛
shizuishan|322|石嘴山
shihezi|339|石河子
sanya|345|三亚
shennongjia|404|神农架林区
shishi|1008|石狮
sansha|2310|三沙
singapore|2311|新加坡
saipan|2326|塞班岛
srilanka|2330|斯里兰卡
seychelles|2334|塞舌尔
samui|2346|苏梅岛
kohsamet|2350|沙美岛
cebu|2356|宿雾
sabah|2362|沙巴
siemreap|2363|暹粒
sihanoukville|2365|西哈努克
seoul|2369|首尔
sydney|2379|悉尼
sanfrancisco|2396|旧金山
seattle|2399|西雅图
salzburg|2418|萨尔兹堡
stockholm|2436|斯德哥尔摩
zurich|2439|苏黎世
seville|2451|塞维利亚
santorini|2454|圣托里尼
saopaulo|2484|圣保罗
santiago|2486|圣地亚哥
tianjin|10|天津
tangshan|25|唐山
taiyuan|35|太原
tongliao|50|通辽
tieling|67|铁岭
tonghua|74|通化
taizhou|99|泰州
zhejiangtaizhou|108|台州
tongling|116|铜陵
taian|151|泰安
tianmen|191|天门
tongrendiqu|262|铜仁地区
tongchuan|290|铜川
tianshui|303|天水
tulufandiqu|327|吐鲁番地区
tachengdiqu|337|塔城地区
taiwan|340|台湾其他
tumushuke|405|图木舒克
tunchang|406|屯昌县
thailand|2313|泰国其他
taipei|2335|台北
tainan|2338|台南
taoyuan|2339|桃园
taichung|2341|台中
tokyo|2372|东京
tahiti|2405|大溪地
toronto|2413|多伦多
thehague|2430|海牙
turin|2463|都灵
tanzania|2499|坦桑尼亚
vietnam|2317|越南其他
varanasi|2408|瓦拉纳西
vancouver|2415|温哥华
vienna|2417|维也纳
venice|2460|威尼斯
wuxi|13|无锡
wuhan|16|武汉
wuhai|48|乌海
wulanchabu|55|乌兰察布
wenzhou|101|温州
wuhu|111|芜湖
weifang|149|潍坊
weihai|152|威海
wuzhou|227|梧州
wenshan|274|文山州
weinan|293|渭南
wuwei|304|武威
wuzhong|323|吴忠
wulumuqi|325|乌鲁木齐
wanning|407|万宁
wenchang|408|文昌
wujiaqu|409|五家渠
wuzhishan|410|五指山
wendeng|1163|文登
bandung|2352|万隆
wellington|2386|惠灵顿
varanasi|2408|瓦拉纳西
vancouver|2415|温哥华
vienna|2417|维也纳
venice|2460|威尼斯
brunei|2492|文莱
xiamen|15|厦门
xian|17|西安
xingtai|28|邢台
xinzhou|43|忻州
xingan|53|兴安盟
xilinguole|54|锡林郭勒
xuzhou|92|徐州
xuancheng|126|宣城
xinyu|138|新余
xinxiang|166|新乡
xuchang|169|许昌
xinyang|174|信阳
xiangyang|180|襄阳
xiaogan|183|孝感
xianning|186|咸宁
xiantao|189|仙桃
xiangtan|193|湘潭
xiangxi|204|湘西
xishuangbanna|276|西双版纳
xianyang|292|咸阳
xining|313|西宁
hongkong|341|香港
singapore|2311|新加坡
newzealand|2319|新西兰其他
newtaipei|2340|新北
sihanoukville|2365|西哈努克
hakone|2378|箱根
sydney|2379|悉尼
seattle|2399|西雅图
hawaii|2404|夏威夷
delhi|2407|新德里
yangzhou|12|扬州
yangquan|37|阳泉
yuncheng|42|运城
yingkou|63|营口
yanbian|78|延边
yichun|85|伊春
yancheng|97|盐城
yingtan|139|鹰潭
jiangxiyichun|142|宜春
yantai|148|烟台
yichang|179|宜昌
yueyang|196|岳阳
yiyang|199|益阳
yongzhou|201|永州
yangjiang|217|阳江
yunfu|223|云浮
guangxiyulin|232|玉林
yibin|249|宜宾
yaan|252|雅安
yuxi|269|玉溪
yanan|294|延安
yulin|296|榆林
yushu|319|玉树
yinchuan|321|银川
yili|336|伊犁
yiwu|385|义乌
yuyao|423|余姚
yongkang|893|永康
yueqing|905|乐清
vietnam|2317|越南其他
indonesia|2325|印度尼西亚其他
jakarta|2353|雅加达
innsbruck|2419|因斯布鲁克
interlaken|2441|因特拉肯
istanbul|2444|伊斯坦布尔
izmir|2447|伊兹密尔
athens|2455|雅典
alexandria|2473|亚历山大
aswan|2474|亚斯文
johannesburg|2479|约翰内斯堡
israel|2494|以色列
jordan|2495|约旦
jamaica|2501|牙买加
chongqing|9|重庆
zhangjiakou|30|张家口
zhengjiang|98|镇江
zhoushan|107|舟山
zhejiangtaizhou|108|台州
zhangzhou|130|漳州
zibo|145|淄博
zaozhuang|146|枣庄
zhengzhou|160|郑州
zhoukou|175|周口
zhumadian|176|驻马店
zhuzhou|192|株洲
zhangjiajie|198|张家界
zhuhai|206|珠海
zhanjiang|210|湛江
zhaoqing|212|肇庆
zhongshan|220|中山
zigong|238|自贡
ziyang|254|资阳
zunyi|260|遵义
zhaotong|271|昭通
zhangye|305|张掖
zhongwei|351|中卫
zhuji|883|诸暨
zhangqiu|1118|章丘
chicago|2400|芝加哥
jaipur|2411|斋浦尔
zurich|2439|苏黎世

抓取列表页面数据:

# -*- coding: utf-8 -*-
import codecs
import traceback
import urllib2
import re
from bs4 import BeautifulSoup
import sys
import MySQLdb
import string
import json
import time

URL_LIST = "http://www.dianping.com/search/category/%s/%s/p%s"  # 列表
RUL_DETAIL = ‘http://www.dianping.com/shop/%s‘  # 详情

f1 = open("f1.log", "a", 1)
f2 = open("f2.log", "a", 1)

reload(sys)
sys.setdefaultencoding(‘utf-8‘)
type = sys.getfilesystemencoding()

def deal(city_id, category_id, p):
    url = URL_LIST % (city_id, category_id, p)
    print url
    opener = urllib2.build_opener()
    opener.addheaders = [(‘User-agent‘, ‘Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36‘), (‘Accept‘, ‘application/json, text/javascript‘), (‘Accept-Language‘, ‘zh-CN,zh;q=0.8,en;q=0.6‘)]
    urlopen = opener.open(url, timeout=100)
    rsp = urlopen.read()

    if "404" in rsp:
        return 404
    print "=====================start=========================="
    soup = BeautifulSoup(rsp)
    soup = soup.find("div", { "id" : "shop-all-list" })
    # print soup
    row = soup.find_all("li")
    for so in row:
        # print so
        get_business(so)
    print ‘‘

INSERT_BUSINESS = "INSERT INTO tb_dianping_business_zx (businessID,NAME,Url,BranchName,Address,Regions,Categories,City,AvgRating,AvgPrice,ReviewCount,PhotoUrl,SPhotoUrl,HasCoupon,HasDeal,DealCount,Deals) VALUES (%s,‘%s‘,‘%s‘,‘%s‘,‘%s‘,‘%s‘,‘%s‘,‘%s‘,%s,%s,%s,‘%s‘,‘%s‘,%s,%s,%s,‘%s‘);"
db_interest = MySQLdb.connect(host="ip", port=3306, user="xxx", passwd="xxx", db="db_xxx", charset="utf8");
cur_interest = db_interest.cursor();

def save(business_id, Name, Url, BranchName, Address, Regions, Categories, City, AvgRating, AvgPrice, ReviewCount, PhotoUrl, SPhotoUrl, HasCoupon, HasDeal, DealCount, Deals):
    sql = INSERT_BUSINESS % (business_id, Name, Url, BranchName, Address, Regions, Categories, City, AvgRating, AvgPrice, ReviewCount, PhotoUrl, SPhotoUrl, HasCoupon, HasDeal, DealCount, Deals)
    print "========================sql=========================="
    print sql
    try:
        cur_interest.execute(sql);
        db_interest.commit()
    except MySQLdb.IntegrityError:
        db_interest.rollback()
        print "*********************** duplicate business_id: %s" % sql

    print ‘;‘

def get_business(soup):
#     print soup

    business_id = get_business_id(soup)
    NAME = get_business_name(soup)
    Url = RUL_DETAIL % business_id
    BranchName = ‘‘
    if "(" in NAME:
        BranchName = NAME[NAME.find("(") + 1:NAME.find(")")]
    Address = get_Address(soup)
    Regions = get_Regions(soup)
    Categories = get_Categories(soup)
    City = ‘北京‘
    AvgRating = get_AvgRating(soup)
    AvgPrice = get_AvgPrice(soup)
    ReviewCount = get_ReviewCount(soup)
    PhotoUrl = get_PhotoUrl(soup)
    SPhotoUrl = PhotoUrl;
    DealCount = get_DealCount(soup)
    HasCoupon = DealCount > 0 and 1 or 0
    HasDeal = HasCoupon
    Deals = get_Deals(soup)

    print business_id, NAME, Url, BranchName, Address, Regions, Categories, City, AvgRating, AvgPrice, ReviewCount, PhotoUrl, SPhotoUrl, HasCoupon, DealCount, Deals
    save(business_id, NAME, Url, BranchName, Address, Regions, Categories, City, AvgRating, AvgPrice, ReviewCount, PhotoUrl, SPhotoUrl, HasCoupon, HasDeal, DealCount, Deals)

def get_business_id(soup):
    return soup.find("div", {"class":"tit"}).find("a")["href"].strip().replace("/shop/", "")
def get_business_name(soup):
    return soup.find("div", {"class":"tit"}).find("a")["title"].strip()
def get_Address(soup):
    if soup.find("span", {"class":"addr"}):
        return soup.find("span", {"class":"addr"}).get_text().strip()
    else:
        return ""
def get_Regions(soup):
    if soup.find("div", {"class":"tag-addr"}):
        return soup.find("div", {"class":"tag-addr"}).find_all("a")[0].find("span", {"class":"tag"}).get_text().strip()
    else:
        return ""
def get_Categories(soup):
    if soup.find("div", {"class":"tag-addr"}):
        return soup.find("div", {"class":"tag-addr"}).find_all("a")[1].find("span", {"class":"tag"}).get_text().strip()
    else:
        return ""
def get_AvgRating(soup):
    return soup.find("span", {"class":"sml-rank-stars"})["class"][1].strip().replace("sml-str", "")
def get_AvgPrice(soup):
    b = soup.find("a", {"class":"mean-price"}).find("b")
    if b:
        return b.get_text().strip().replace("¥", "")
    return 0

def get_ReviewCount(soup):
    b = soup.find("a", {"class":"review-num"})
    if b:
        return soup.find("a", {"class":"review-num"}).find("b").get_text().strip()
    return 0

def get_PhotoUrl(soup):
    return soup.find("div", {"class":"pic"}).find("img")["data-src"].strip()
def get_DealCount(soup):
    soup = soup.find("div", {"class":"si-deal"})
    if soup :
        return len(soup.find_all("a", {"class":"J_dinfo"}))  # .count("a", {"class":"J_dinfo"})
    return 0
def get_Deals(soup):
    soup = soup.find("div", {"class":"si-deal"})
    if soup :
        data_deal_id = ‘‘
        rows = soup.find_all("a", {"class":"J_dinfo"})
        for so in rows:
            data_deal_id = ‘%s,%s‘ % (data_deal_id, so["data-deal-id"])
        return data_deal_id
    return ‘‘

if __name__ == "__main__":
    cities = []
    cas = [10, 20, 25, 30, 45, 50, 60, 70]
    cas = [30, 45, 50, 60, 70]
    ct = codecs.open("cities", ‘r‘, ‘utf-8‘)
    lines = ct.readlines()
    for word in lines:
        word = word[word.find("|") + 1:]
        word = word[0:word.find("|")]
        cities.append(word.strip())

    for city in cities:
        for ca in cas:
            p = 0
            while p <= 50:
                try:
                    print ‘deal(%s,%s,%s)‘ % (city, ca, p)
                    p = p + 1
                    code = deal(city, ca, p)
#                     if 404==code:
#                         break
#                      2 25 12
                except Exception:
                    traceback.print_exc()
#                     print "*********************** duplicate business_id: %s" % sql
                print "休眠5秒 ... "
                time.sleep(1)

#     f = codecs.open("li", ‘r‘, ‘utf-8‘)
#         soup = BeautifulSoup(f.read())
#     soup = BeautifulSoup(f.read())
#     get_business(soup)

    # save(business_id, NAME, Url, BranchName, Address, Regions, Categories, City, AvgRating, AvgPrice, ReviewCount, PhotoUrl, SPhotoUrl, HasCoupon, HasDeal, DealCount, Deals)

抓取详情数据:

# -*- coding: utf-8 -*-
import codecs
import traceback
import urllib2
import re
from bs4 import BeautifulSoup
import sys
import MySQLdb
import string
import json
import time
from tokenize import Double

URL_LIST = "http://www.dianping.com/search/category/%s/%s/p%s"  # 列表
RUL_DETAIL = ‘http://www.dianping.com/shop/%s‘  # 详情

f1 = open("f1.log", "a", 1)
f2 = open("f2.log", "a", 1)

reload(sys)
sys.setdefaultencoding(‘utf-8‘)
type = sys.getfilesystemencoding()

def deal(businessID, url):
    print url
    opener = urllib2.build_opener()
    opener.addheaders = [(‘User-agent‘, ‘Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36‘), (‘Accept‘, ‘application/json, text/javascript‘), (‘Accept-Language‘, ‘zh-CN,zh;q=0.8,en;q=0.6‘)]
    urlopen = opener.open(url, timeout=30)
    rsp = urlopen.read()
    print "=====================start=========================="
    soup = BeautifulSoup(rsp)
    str = soup.find("div", {"class":"breadcrumb"})
    str2 = soup.find("div", {"id":"basic-info"})
    str3 = soup.find("div", {"id":"sales"})
    str4 = soup.find("div", {"id":"aside"})
    if str4:
        str4 = soup.find("div", {"id":"aside"}).find("script")
    else:
        str4=""
    print "----------------"
    print ‘%s%s%s%s‘ % (str, str2, str3, str4)
    soup = BeautifulSoup(‘%s%s%s%s‘ % (str, str2, str3, str4))
#     print soup
    get_business(businessID, soup)

UPDATE_BUSINESS = "UPDATE tb_dianping_business_zx SET Address=‘%s‘,Regions=‘%s‘,Categories=‘%s‘,City=‘%s‘,lat=%s,lng=%s,Deals=‘%s‘ where businessID= %s "
SELECT_BUSINESS = "SELECT businessID,url FROM tb_dianping_business_zx WHERE lat=0 and businessID > %d order by businessID asc LIMIT 100 "
db_interest = MySQLdb.connect(host="xxx", port=xxx, user="xxx", passwd="xxx", db="db_xxx", charset="utf8");
cur_interest = db_interest.cursor();

def save(Address, Regions, Categories, City, lat, lng, Deals, business_id):
    sql = UPDATE_BUSINESS % (Address, Regions, Categories, City, lat, lng, Deals, business_id)
    try:
        print sql
        cur_interest.execute(sql);
        db_interest.commit()
    except MySQLdb.IntegrityError:
        db_interest.rollback()
        print "*********************** duplicate business_id: %s" % sql
    print ‘;‘

def fetchall(cur, sql):
    cur.execute(sql)
    return cur.fetchall()

def fetchone(cur, sql):
    cur.execute(sql)
    return cur.fetchone()

def get_business(business_id, soup):

    business_id = business_id
    cs = get_Regions_Categories(soup)
    City = cs[0]
    Regions = cs[1]
    Categories = cs[2]

#     print ‘%s , %s , %s‘ % (City, Regions, Categories)
    Address = get_Address(soup)
    point = get_point(soup)
    lat = point[0]
    lng = point[1]
    Deals = get_Deals(soup)
#     print Address, Regions, Categories, City, lat, lng, Deals, business_id
    save(Address, Regions, Categories, City, lat, lng, Deals, business_id)

def get_Regions_Categories(soup):
    rows = soup.find("div", {"class":"breadcrumb"}).find_all("a")
    City = ‘‘
    RegionsCs = []
    CategoriesCs = []
    i = 0
    length = len(rows)
    for row in rows :
        if i == 0:
            City = row.get_text().strip()
        elif length % 2 == 0 and i < length / 2:
            RegionsCs.append(row.get_text().strip())
        elif length % 2 == 0 and i >= length / 2:
            CategoriesCs.append(row.get_text().strip())
        elif length % 2 == 1 and i < length / 2 + 1:
            RegionsCs.append(row.get_text().strip())
        else:
            CategoriesCs.append(row.get_text().strip())
        i = i + 1

    Regions = ""
    for c in RegionsCs:
        Regions = ‘%s,"%s"‘ % (Regions , c)
    Regions = ‘[%s]‘ % Regions
    Regions = Regions.replace("[,", "[")

    Categories = ""
    for c in CategoriesCs:
        Categories = ‘%s,"%s"‘ % (Categories , c)
    Categories = ‘[%s]‘ % Categories
    Categories = Categories.replace("[,", "[")

    return City, Regions, Categories

def get_Address(soup):
    return ‘%s %s‘ % (soup.find("div", {"class":"address"}).find("a").find("span").get_text().strip(), soup.find("div", {"itemprop":"street-address"}).find("span", {"class":"item"}).get_text().strip())
def get_point(soup):
    lat = ‘‘
    lng = ‘‘
    str = soup.find("script").get_text().strip()
    str = str[str.find("({lng:") + 6:]
    lat = str[:str.find(",lat:")]
    lng = str[str.find(",lat:") + 5:str.find("}")]

    la = int(float(lat)*1000000)
    ln = int(float(lng)*1000000)
    return la, ln

def get_Deals(soup):
    soup = soup.find("div", {"id":"sales"})
    if soup:
        Deals = []
        rows = soup.find_all("div", {"class":"item"})
        for row in rows:
            if row.find("span", {"class":"price"}):
                deal = {}
                title = row.find("p", {"class":"title"})
                url = ""
                if title:
                    deal["name"] = title.get_text().strip()
                    url = row.find("a", {"class":"block-link"})["href"]
                else:
                    deal["name"] = rows.get_text().strip()
                    url = row["href"]
                deal["url"] = url
                deal["id"] = url.replace("http://t.dianping.com/deal/", "")
                deal["h5_url"] = url
                Deals.append(deal)

        deals = ""
        for c in Deals:
            deals = ‘%s,{"url":"%s", "name": "%s", "h5_url": "%s", "id": "%s"}‘ % (deals , c.get("url"), c.get("name"), c.get("h5_url"), c.get("id"))
        deals = ‘[%s]‘ % deals
        deals = deals.replace("[,", "[")
        return deals
    return ‘‘

if __name__ == "__main__":
#     deal(‘http://www.dianping.com/shop/11566327‘)
    maxId = 0
    SELECT_BUSINESS_NEXT = "";
    while True:
        try:
            SELECT_BUSINESS_NEXT = SELECT_BUSINESS % maxId
            print SELECT_BUSINESS_NEXT
            rows = fetchall(cur_interest, SELECT_BUSINESS_NEXT)
            for row in rows:
                print row
                deal(row[0], row[1])
                maxId = row[0]
        except Exception:
            traceback.print_exc()
        print "休眠5秒 ... "
        time.sleep(5)

#     f = codecs.open("detail", ‘r‘, ‘utf-8‘)
#     soup = BeautifulSoup(f.read())
#     get_business(soup, 11566327)

    # save(business_id, NAME, Url, BranchName, Address, Regions, Categories, City, AvgRating, AvgPrice, ReviewCount, PhotoUrl, SPhotoUrl, HasCoupon, HasDeal, DealCount, Deals)

备注:抓取数据速度尽量去控制下来,好拉,今天都16号了,哥可以放大假了,大伙加完班,也好好回家过个好年

使用的解析库:http://www.crummy.com/software/BeautifulSoup/bs4/doc/

时间: 2024-10-19 00:18:42

使用BeautifulSoup抓取的相关文章

python3用BeautifulSoup抓取id=&#39;xiaodeng&#39;,且正则包含‘elsie’的标签

# -*- coding:utf-8 -*- #python 2.7 #XiaoDeng #http://tieba.baidu.com/p/2460150866 #使用多个指定名字的参数可以同时过滤tag的多个属性 from bs4 import BeautifulSoup import urllib.request import re #如果是网址,可以用这个办法来读取网页 #html_doc = "http://tieba.baidu.com/p/2460150866" #req

BeautifulSoup抓取门户网站上的链接

使用BeautifulSoup抓取门户网站上的所有跳转链接 from bs4 import BeautifulSoup import urllib2 request = urllib2.Request('http://www.163.com') response = urllib2.urlopen(request) html_doc = response.read() soup = BeautifulSoup(html_doc , from_encoding = "gb18030")

python3用BeautifulSoup抓取a标签

# -*- coding:utf-8 -*- #python 2.7 #XiaoDeng #http://tieba.baidu.com/p/2460150866 from bs4 import BeautifulSoup import urllib.request html_doc = "http://tieba.baidu.com/p/2460150866" req = urllib.request.Request(html_doc) webpage = urllib.reques

python爬虫:使用urllib.request和BeautifulSoup抓取新浪新闻标题、链接和主要内容

案例一 抓取对象: 新浪国内新闻(http://news.sina.com.cn/china/),该列表中的标题名称.时间.链接. 完整代码: from bs4 import BeautifulSoup import requests url = 'http://news.sina.com.cn/china/' web_data = requests.get(url) web_data.encoding = 'utf-8' soup = BeautifulSoup(web_data.text,'

利用BeautifulSoup抓取新浪网页新闻的内容

第一次写的小爬虫,python确实功能很强大,二十来行的代码抓取内容并存储为一个txt文本 直接上代码 #coding = 'utf-8' import requests from bs4 import BeautifulSoup import sys reload(sys) sys.setdefaultencoding("utf-8") #抓取web页面 url = "http://news.sina.com.cn/china/" res = requests.g

Python结合BeautifulSoup抓取知乎数据

本文主要介绍利用Python登录知乎账号,抓取其中的用户名.用户头像.知乎的问题.问题来源.被赞数目.以及回答者.其中数据是配合Beautiful Soup进行解析的. 首先,要解决的是知乎登录问题.在程序中登录知乎我们直接提供用户名和密码是无法进行登录的,这里我们采用一个比较笨拙的办法直接在发送请求过程中附带上cookies.这个cookies值我们可以通过在火狐浏览器登录知乎时用firebug直接捕获.cookies获取如下图所示: [info] email =youremail passw

BeautifulSoup抓取列表页锚文本

素闻BeautifulSoup提取效率低,艾玛,第一印象果然是很要命的,反正比Re 和 Lxml 是要慢的,不过就无奈Re的正则折腾来折腾去,没写出来,Lxml 的 Xpath 又用得不好. 不过就这三个模版来看,BeautifulSoup的表现还是不错的,够简单,顺便测试了一下时间,抓10个列表页花不了1分钟,当然我是菜鸟,没事不会纠结终结速度. 核心就是这部分,用 Find_all 和 Find 都搞了半天不成功,最后用CSS提取搞定,也怪我太着急. 用Find比较麻烦,一层层的Class找

BeautifulSoup抓取百度贴吧

BeautifulSoup是python一种原生的解析文件的模块,区别于scrapy,scrapy是一种封装好的框架,只需要按结构进行填空,而BeautifulSoup就需要自己造轮子,相对scrapy麻烦一点但也更加灵活一些 以爬取百度贴吧内容示例说明. # -*- coding:utf-8 -*- __author__='fengzhankui' import urllib2 from bs4 import  BeautifulSoup class Item(object):     tit

python3用BeautifulSoup抓取div标签

# -*- coding:utf-8 -*- #python 2.7 #XiaoDeng #http://tieba.baidu.com/p/2460150866 #标签操作 from bs4 import BeautifulSoup import urllib.request import re #如果是网址,可以用这个办法来读取网页 #html_doc = "http://tieba.baidu.com/p/2460150866" #req = urllib.request.Req