【python】抄写大神的糗事百科代码

照着静觅大神的博客学习，原文在这：http://cuiqingcai.com/990.html

划重点：

1. str.strip() strip函数会把字符串的前后多余的空白字符去掉

2. response.read().decode(‘utf-8‘,‘ignore‘) 要加‘ignore‘忽略非法字符，不然总是报解码错误

3. python 3.x 中 raw_input 改成 input 了

4. 代码最好用notepad++先写格式清晰一点容易发现错尤其是缩进和中文标点的错误

5. .*? 常用组合，后面的?表示非贪婪模式

用python3.4.3实现的糗百爬虫代码如下（就是照着大神的抄的，把2.x的部分给改了而已）：

import urllib.request
import urllib.parse
import re
import time

#糗事百科爬虫类
class QSBK:
    #初始化方法，定义一些变量
    def __init__(self):
        self.pageIndex = 1
        self.user_agent = ‘Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.93 Safari/537.36‘
        self.headers = {‘User-Agent‘ : self.user_agent}
        #存放段子的变量，每个元素是每一页的段子
        self.stories = []
        #存放程序是否继续运行的变量
        self.enable = False
    #传入某一页的索引获得页面代码
    def getPage(self, pageIndex):
        try:
            url = ‘http://www.qiushibaike.com/hot/page/‘ + str(pageIndex)
            request = urllib.request.Request(url, headers = self.headers)
            response = urllib.request.urlopen(request)
            pageCode = response.read().decode(‘utf-8‘,‘ignore‘) #这个ignore忽略非法字符 一定要加 不然总报解码错误
            return pageCode
        except urllib.error.URLError as e:
            if hasattr(e, "reason"):
                print(u"连接糗事百科失败，错误原因：", e.reason)
                return None
    #传入某一页代码，返回本页不断图片的段子列表
    def getPageItems(self, pageIndex):
        pageCode = self.getPage(pageIndex)
        if not pageCode:
            print(u"页面加载失败....")
            return None
        pattern = re.compile(‘<div.*?author">.*?<a.*?<img.*?>(.*?)</a>.*?<div.*?‘+
        ‘content">(.*?)<!--(.*?)-->.*?</div>(.*?)<div class="stats.*?class="number">(.*?)</i>‘, re.S)
        items = re.findall(pattern, pageCode)
        #用来存储每页的段子
        pageStories = []
        for item in items:
            haveImg = re.search("img", item[3])
            if not haveImg:
                replaceBR = re.compile(‘<br/>‘)
                text = re.sub(replaceBR, "\n", item[1])
                pageStories.append([item[0].strip(), text.strip(),item[4].strip()]) #.strip（） 用来删除空白符
        return pageStories
    #加载并提取页面的内容，加入到列表中
    def loadPage(self):
        #如果当前未看的页数少于2页，则加载新一页
        if self.enable == True:
            if len(self.stories) < 2:
                #获取新一页
                pageStories = self.getPageItems(self.pageIndex)
                #将该页的段子存放到全局list中
                if pageStories:
                    self.stories.append(pageStories)
                    #页码加1，下次读取下一页
                    self.pageIndex += 1
    #每次敲回车打印一个段子
    def getOneStory(self,pageStories,page):
        #遍历一页的段子
        for story in pageStories:
            #等待用户输入
            input_v = input()
            #每当输入回车一次，判断一下是否要加载新页面
            self.loadPage()
            #如果输入Q则程序结束
            if input_v == "Q":
                self.enable = False
                return
            print(u"第%d页\t发布人：%s\t赞：%s\n%s" % (page, story[0], story[2],story[1]))
    #开始方法
    def start(self):
        print(u"正在读取糗事百科，按回车查看新段子，Q退出")
        #使变量为True，程序可以正常运行
        self.enable = True
        #先加载一页内容
        self.loadPage()
        #局部变量，控制当前读到2了第几页
        nowPage = 0
        while self.enable:
            if len(self.stories) > 0:
                #从全局list中获取一页段子
                pageStories = self.stories[0]
                #当前读到的页数加1
                nowPage += 1
                #删除已经取出的元素
                del self.stories[0]
                #输出该页的段子
                self.getOneStory(pageStories,nowPage)

spider = QSBK()
spider.start()

时间： 2025-01-14 01:10:48

【python】抄写大神的糗事百科代码的相关文章

Python爬虫实战-爬取糗事百科段子

1.本文的目的是练习Web爬虫目标: 1.爬去糗事百科热门段子 2.去除带图片的段子 3.获取段子的发布时间,发布人,段子内容,点赞数. 2.首先我们确定URL为http://www.qiushibaike.com/hot/page/10(可以随便自行选择),先构造看看能否成功构造代码: 1 # -*- coding:utf-8 -*- 2 import urllib 3 import urllib2 4 import re 5 6 page = 10 7 url = 'http://www

Python 网络爬虫 - 抓取糗事百科的段子(最新版)

代码 # -*- coding: cp936 -*- __author__ = "christian chen" import urllib2 import re import threading import time class Tool: def pTitle(self): return re.compile('<title.*?>(.*?)</', re.S) def pContent(self): return re.compile('<div cla

#python爬虫：爬取糗事百科段子

#出处:http://python.jobbole.com/81351/#确定url并抓取页面代码,url自己写一个import urllib,urllib2def getUrl(): page=1 url="http://www.qiushibaike.com/hot/page/"+str(page) try: request=urllib2.Request(url) response=urllib2.urlopen(request) print response.read() ex

Python爬虫：爬取糗事百科

网上看到的教程,但是是用正则表达式写的,并不能运行,后面我就用xpath改了,然后重新写了逻辑,并且使用了双线程,也算是原创了吧#!/usr/bin/python# -*- encoding:utf-8 -*- from lxml import etreefrom multiprocessing.dummy import Pool as ThreadPoolimport requestsimport sys#编码reload(sys)sys.setdefaultencoding('utf-8')

python爬虫基础案例之糗事百科

关于爬虫也是刚接触,案例是基于python3做的, 依靠selenium的webdriver做的,所以python3必须有selenium这个包, 如果是基于谷歌浏览器的话需要下载谷歌浏览器的驱动,放在python的目录下,在此之前记得把环境变量安装好直接上代码原文地址:https://www.cnblogs.com/MsHibiscus/p/9073565.html

Python 简单爬虫抓取糗事百科

# coding:utf-8 import timeimport randomimport urllib2from bs4 import BeautifulSoup #引入 beautifulsoup模块 #p = 1 #定义页url = 'http://www.qiushibaike.com/text/page/'#定义header my_headers = [ 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:39.0) Gecko/20100101 F

python+正则+多进程爬取糗事百科图片

话不多说,直接上代码: # 需要的库 import requests import re import os from multiprocessing import Pool # 请求头 headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36' } # 主函数 def get_im

使用Python爬取糗事百科热门文章

默认情况下取糗事百科热门文章只有35页,每页20条,根据下面代码可以一次性输出所有的文章,也可以选择一次输出一条信息,回车继续.不支持图片内容的显示,显示内容包括作者,热度(觉得好笑的人越多,热度越高),内容.从热度最高开始显示到最低.实现代码如下: #!/usr/bin/python #coding:utf8 """ 爬取糗事百科热门文章 """ import urllib2 import re #模拟浏览器访问,否则无法访问 user_age

python 多线程糗事百科案例

案例要求参考上一个糗事百科单进程案例 Queue(队列对象) Queue是python中的标准库,可以直接import Queue引用;队列是线程间最常用的交换数据的形式 python下多线程的思考对于资源,加锁是个重要的环节.因为python原生的list,dict等,都是not thread safe的.而Queue,是线程安全的,因此在满足使用条件下,建议使用队列初始化: class Queue.Queue(maxsize) FIFO 先进先出包中的常用方法: Queue.qsize