最近在学python,之前用Python写过简单的图片爬取,今天想着用python爬一下豆瓣的电影,就有了下面的程序:
#coding:utf-8 #coding:utf-8 import re import sys import urllib from bs4 import BeautifulSoup def movieSearch(): <span style="white-space:pre"> </span>douBanSearchurl = "http://movie.douban.com/subject_search?search_text=" <span style="white-space:pre"> </span>data = urllib.urlopen(douBanSearchurl+movieName).read() <span style="white-space:pre"> </span>r = re.findall(r'<a class="nbg" href=(.*?) onclick',data) <span style="white-space:pre"> </span>realy_url = re.sub('"','',r[0]) <span style="white-space:pre"> </span>movieData = urllib.urlopen(realy_url).read() <span style="white-space:pre"> </span>soup = BeautifulSoup(movieData) <span style="white-space:pre"> </span>movieSummary = soup.find_all("span",{'property':'v:summary'}) <span style="white-space:pre"> </span>#movieSummaryText = re.findall(r'<span property="v:summary" class="">(\W*.*\W*.*?)</span>',movieData) <span style="white-space:pre"> </span>movie = re.findall(r'name="title" value="(.*?)"',movieData) <span style="white-space:pre"> </span>people = re.findall(r'name="desc" value="(.*?)"',movieData) <span style="white-space:pre"> </span>imdb = re.findall(r'</span> <a href="(.*?)" target=',movieData) <span style="white-space:pre"> </span>Time = re.findall(r'<span property="v:runtime" content="109">(.*?)</span>',movieData) <span style="white-space:pre"> </span>print u"IMDB电影网链接" <span style="white-space:pre"> </span>print imdb[0] <span style="white-space:pre"> </span>print u"豆瓣电影链接" <span style="white-space:pre"> </span>print realy_url <span style="white-space:pre"> </span>print '*'*100 <span style="white-space:pre"> </span>print movie[0].decode('utf-8').encode('gbk') <span style="white-space:pre"> </span>print people[0].decode('utf-8') <span style="white-space:pre"> </span>print u"电影简介" <span style="white-space:pre"> </span>print '*'*100 <span style="white-space:pre"> </span>print movieSummary[0].encode('utf-8').decode('utf-8').encode('gbk') if __name__=='__main__': <span style="white-space:pre"> </span>while(1): <span style="white-space:pre"> </span>arg = raw_input("请选择功能:\n1:电影搜索\n2:退出\n".decode('utf-8').encode('gb2312')) <span style="white-space:pre"> </span>if arg=='1' : <span style="white-space:pre"> </span>movieName=raw_input("请输入电影名: ".decode('utf-8').encode('gb2312')).strip() <span style="white-space:pre"> </span>print u"开始搜索" <span style="white-space:pre"> </span>movieSearch() <span style="white-space:pre"> </span>else: <span style="white-space:pre"> </span>print u"退出程序" <span style="white-space:pre"> </span>break; <span style="white-space:pre"> </span>
在调试过程中,遇到了两个头疼的问题:
1、beautifulSoup编码与CMD编码不匹配,beautifulSoup得到的网页信息均为Unicode,但是cmd不支持Unicode只有gbk,中文显示就成了很大问题了,所以就用了最笨的办法,通过转码来实现,目前还没找到更好的办法,如果有请指点下。
2、在写正则表达式时,不知道<br\>如何匹配,特别是对于网页文字有换行分段的形式,该如何匹配?
时间: 2024-10-13 21:12:50