python xpath 爬取豆瓣电脑版电影案例

from lxml import etree
import requests

url = ‘https://movie.douban.com/chart‘

headers = {"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36" }

response = requests.get(url,headers=headers)

html_str = response.content.decode()

# print(html_str)

# 使用etree来处理数据
html = etree.HTML(html_str)

# 获取电影的url地址
url_list = html.xpath("//div[@class=‘indent‘]/div/table//div[@class=‘pl2‘]/a/@href")
#print(url_list)

# 获取电影图片地址
img_list = html.xpath("//div[@class=‘indent‘]/div/table//a[@class=‘nbg‘]/img/@src")
#print(img_list)

# 把每一部电影组成一个字典,字典中是电影的数据
    # 1.分组
    # 2.每一组提取数据

rets = html.xpath("//div[@class=‘indent‘]/div/table")

for table in rets:
    item = {}
    item[‘title‘] = table.xpath(".//div[@class=‘pl2‘]/a/text()")[0].replace("/","").strip()
    item[‘href‘] = table.xpath(".//div[@class=‘pl2‘]/a/@href")[0]
    item[‘img‘] = table.xpath(".//a[@class=‘nbg‘]/img/@src")[0]
    item[‘comment_num‘] = table.xpath(".//div[@class=‘pl2‘]/div//span[@class=‘pl‘]/text()")[0]
    item[‘rating_num‘] = table.xpath(".//div[@class=‘pl2‘]/div//span[@class=‘rating_nums‘]/text()")[0]
    print(item)

原文地址：https://www.cnblogs.com/zqrios/p/9017480.html

时间： 2024-10-01 01:04:11

python xpath 爬取豆瓣电脑版电影案例的相关文章

python日常—爬取豆瓣250条电影记录

# 感兴趣的同仁可以相互交流哦 import requests import lxml.html,csv doubanUrl = 'https://movie.douban.com/top250?start={}&filter=' def getSource(doubanUrl): response = requests.get(doubanUrl) # 获取网页 response.encoding = 'utf-8' # 修改编码 return response.content #获取源码 d

Python爬虫：现学现用Xpath爬取豆瓣音乐

爬虫的抓取方式有好几种,正则表达式,Lxml(xpath)与Beautiful,我在网上查了一下资料,了解到三者之间的使用难度与性能三种爬虫方式的对比. 抓取方式性能使用难度正则表达式快困难 Lxml 快简单 BeautifulSoup 慢简单这样一比较我我选择了Lxml(xpath)的方式了,虽然有三种方式,但肯定是要选择最好的方式来爬虫,这个道理大家都懂,另外有兴趣的朋友也可以去了解另外两种爬虫方式! 好了现在来讲讲xpath 由于Xpath属于lxml模块,所以首先需要安

用Python爬取豆瓣Top250的电影标题

所以我们可以这么写去得到所有页面的链接我们知道标题是在 target="_blank"> 标题的位置</a> 之中所以可以通过正则表达式找到所有符合条件的标题将内容写入到表格保存起来下面贴入完整代码 import requests, bs4, re, openpyxl url = 'https://www.douban.com/doulist/3936288/?start=%s' urls = [] 多少页 pages = 10 for i in range(p

用Python分分钟爬取豆瓣本周口碑榜，就是有这么秀！

平常在生活中,不知道大家是怎么找电影的,反正小编是通过电影本周口碑榜来找的,个人感觉通过这种方式找来的电影都挺不错的.既然提到口碑榜,不如我们来爬下豆瓣电影本周口碑榜上的电影吧,怎么爬嘞,当然是用我们的Python爬虫啦!下面开始简单的介绍如何写爬虫. 在写爬虫前,我们首先简单明确两点: 1. 爬虫的网址: 2. 需要爬取的内容(数据). 鼠标点击需要爬取的数据,这里我们点"看不见的客人",如图所示. 看到大红色框框里的东西,是不是和我们最"重要"的代码有很多相似的

Python爬虫爬取豆瓣读书

最近用Python写了个豆瓣读书的爬虫玩,方便搜罗各种美美书,分享给大家实现功能: 1 可以爬下豆瓣读书标签下的所有图书 2 按评分排名依次存储 3 存储到Excel中,可方便大家筛选搜罗,比如筛选评价人数>1000的高分书籍:可依据不同的主题存储到Excel不同的Sheet 4 采用User Agent伪装为浏览器进行爬取,并加入随机延时来更好的模仿浏览器行为,避免爬虫被封试着爬了七八万本书,存在了book_list.xlsx中,结果截图如下: 详细代码和爬取的一些结果可移步到GitHub

爬取豆瓣的tp250电影名单

# https://movie.douban.com/top250?start=25&filter= 要爬取的网页 import re from urllib.request import urlopen def getPage(url): response=urlopen(url) return response.read().decode('utf-8') def parsePage(s): ret=com.finditer(s) for i in ret: ret={ 'id': i.gr

Python爬虫爬取豆瓣电影名称和链接，分别存入txt，excel和数据库

前提条件是python操作excel和数据库的环境配置是完整的,这个需要在python中安装导入相关依赖包: 实现的具体代码如下: #!/usr/bin/python# -*- coding: utf-8 -*- import urllibimport urllib2import sysimport reimport sslimport openpyxlimport MySQLdbimport time #from bs4 import BeautifulSoup #修改系统默认编码为utf-8

python爬虫--爬取豆瓣top250电影名

关于模拟浏览器登录的header,可以在相应网站按F12调取出编辑器,点击netwook,如下: 以便于不会被网站反爬虫拒绝. 1 import requests 2 from bs4 import BeautifulSoup 5 def get_movies(): 6 headers = { 7 'user-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrom

python爬虫爬取豆瓣电影前250名电影及评分（requests+pyquery)

写了两个版本: 1.面向过程版本: import requests from pyquery import PyQuery as pq url='https://movie.douban.com/top250' moves=[] def sec(item): return item[1] for i in range(0,255,25): content=requests.get(url+"?start="+str(i))#?start=25 for movie in pq(conte