最近在学习xpath,在网上找资料的时候,发现一个新手经常拿来练手的项目,爬取猫眼电影前一百名排行的信息,很多都是跟崔庆才的很雷同,基本照抄.这里就用xpath自己写了一个程序,同样也是爬取猫眼电影,获取的信息是一样的,这里提供一个另外的解法.
说实话,对于网页信息的匹配,还是推荐用xpath,虽然正则确实也能达到效果,但是语句过于繁琐,一不注意就匹配不出东西,特别对于新手,本身就不熟悉正则表达式,错了都找不出来,容易劝退.正则我一般用于在处理文件,简直神器.
下面贴代码.
import requests from requests.exceptions import RequestException from lxml import etree import csv import re def get_page(url): """ 获取网页的源代码 :param url: :return: """ try: headers = { ‘User-Agent‘: ‘Mozilla / 5.0(X11;Linuxx86_64) AppleWebKit / 537.36(KHTML, likeGecko) Chrome / ‘ ‘76.0.3809.100Safari / 537.36‘, } response = requests.get(url, headers=headers) if response.status_code == 200: return response.text return None except RequestException: return None def parse_page(text): """ 解析网页源代码 :param text: :return: """ html = etree.HTML(text) movie_name = html.xpath("//p[@class=‘name‘]/a/text()") actor = html.xpath("//p[@class=‘star‘]/text()") actor = list(map(lambda item: re.sub(‘\s+‘, ‘‘, item), actor)) time = html.xpath("//p[@class=‘releasetime‘]/text()") grade1 = html.xpath("//p[@class=‘score‘]/i[@class=‘integer‘]/text()") grade2 = html.xpath("//p[@class=‘score‘]/i[@class=‘fraction‘]/text()") new = [grade1[i] + grade2[i] for i in range(min(len(grade1), len(grade2)))] ranking = html.xpath("///dd/i/text()") return zip(ranking, movie_name, actor, time, new) def change_page(number): """ 翻页 :param number: :return: """ base_url = ‘https://maoyan.com/board/4‘ url = base_url + ‘?offset=%s‘ % number return url def save_to_csv(result, filename): """ 保存 :param result: :param filename: :return: """ with open(‘%s‘ % filename, ‘a‘) as csvfile: writer = csv.writer(csvfile, dialect=‘excel‘) writer.writerow(result) def main(): """ 主函数 :return: """ for i in range(0, 100, 10): url = change_page(i) text = get_page(url) result = parse_page(text) for j in result: save_to_csv(j, filename=‘message.csv‘) if __name__ == ‘__main__‘: main()
原文地址:https://www.cnblogs.com/lattesea/p/11746488.html
时间: 2024-11-07 22:09:51