scrapy框架
scrapy安装(win)
1.pip insatll wheel
2.下载合适的版本的twisted:http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted
3.安装twisted,到同一个目录,然后pip install
4.pip install pywin32
5.pip intstall scrapy
如果:在终端输入scrapy没有问题就是安装成功了
执行工程
scrapy crawl 工程名字
爬虫文件信息
# -*- coding: utf-8 -*-
import scrapy
class ZxSpider(scrapy.Spider):
#工程名称,唯一标志
name = 'zx'
#允许爬取的域名(一般不用)
# allowed_domains = ['www.baidu.com']
#起始爬取的url,可以是多个
start_urls = ['http://www.baidu.com/',"https://docs.python.org/zh-cn/3/library/index.html#library-index"]
#回调函数,返回请求回来的信息
def parse(self, response):
print(response)
配置文件修改(setting.py)
修改UA和是否遵守爬虫协议 添加日志打印等级
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'zx_spider (+http://www.yourdomain.com)'
# Obey robots.txt rules
ROBOTSTXT_OBEY = True
LOG_LEVEL='ERROR'
最后测试下配置成功没有
简单案例(爬段子)
# -*- coding: utf-8 -*-
import scrapy
class DuanziSpider(scrapy.Spider):
name = 'duanzi'
# allowed_domains = ['www.xxx.com']
start_urls = ['http://duanziwang.com/']
def parse(self, response):
div_list=response.xpath('//main/article')
for i in div_list:
title=i.xpath('.//h1/a/text()').extract_first()
#xpath返回的是存放selector对象的列表,想要拿到数据需要调用extract()函数取出内容,如果列表长度为1可以使用extract_first()
content=i.xpath('./div[@class="post-content"]/p/text()').extract_first()
print(title)
print(content)
原文地址:https://www.cnblogs.com/zx125/p/11502713.html
时间: 2024-11-08 15:50:40