python爬虫beta版之抓取知乎单页面回答（low 逼版）

　　闲着无聊，逛知乎。发现想找点有意思的回答也不容易，就想说要不写个爬虫帮我把点赞数最多的给我搞下来方便阅读，也许还能做做数据分析（意淫中～～）

　　鉴于之前用python写爬虫，帮运营人员抓取过京东的商品品牌以及分类，这次也是用python来搞简单的抓取单页面版，后期再补充哈。

#-*- coding: UTF-8 -*-
import requests
import sys
from bs4 import BeautifulSoup

#－－－－－－知乎答案收集－－－－－－－－－－

#获取网页body里的内容
def get_content(url , data = None):
    header={
        ‘Accept‘: ‘text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8‘,
        ‘Accept-Encoding‘: ‘gzip, deflate, sdch‘,
        ‘Accept-Language‘: ‘zh-CN,zh;q=0.8‘,
        ‘Connection‘: ‘keep-alive‘,
        ‘User-Agent‘: ‘Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.235‘
    }

    req = requests.get(url, headers=header)
    req.encoding = ‘utf-8‘
    bs = BeautifulSoup(req.text, "html.parser")  # 创建BeautifulSoup对象
    body = bs.body # 获取body部分
    return body

#获取问题标题
def get_title(html_text):
     data = html_text.find(‘span‘, {‘class‘: ‘zm-editable-content‘})
     return data.string.encode(‘utf-8‘)

#获取问题内容
def get_question_content(html_text):
     data = html_text.find(‘div‘, {‘class‘: ‘zm-editable-content‘})
     if data.string is None:
         out = ‘‘;
         for datastring in data.strings:
             out = out + datastring.encode(‘utf-8‘)
         print ‘内容：\n‘ + out
     else:
         print ‘内容：\n‘ + data.string.encode(‘utf-8‘)

#获取点赞数
def get_answer_agree(body):
    agree = body.find(‘span‘,{‘class‘: ‘count‘})
    print ‘点赞数：‘ + agree.string.encode(‘utf-8‘) + ‘\n‘

#获取答案
def get_response(html_text):
     response = html_text.find_all(‘div‘, {‘class‘: ‘zh-summary summary clearfix‘})
     for index in range(len(response)):
         #获取标签
         answerhref = response[index].find(‘a‘, {‘class‘: ‘toggle-expand‘})
         if not(answerhref[‘href‘].startswith(‘javascript‘)):
             url = ‘http://www.zhihu.com/‘ + answerhref[‘href‘]
             print url
             body = get_content(url)
             get_answer_agree(body)
             answer = body.find(‘div‘, {‘class‘: ‘zm-editable-content clearfix‘})
             if answer.string is None:
                 out = ‘‘;
                 for datastring in answer.strings:
                     out = out + ‘\n‘ + datastring.encode(‘utf-8‘)
                 print out
             else:
                 print answer.string.encode(‘utf-8‘)

html_text = get_content(‘https://www.zhihu.com/question/43879769‘)
title = get_title(html_text)
print "标题：\n" + title + ‘\n‘
questiondata = get_question_content(html_text)
print ‘\n‘
data = get_response(html_text)

　　　输出结果：

时间： 2024-12-29 11:31:40

python爬虫beta版之抓取知乎单页面回答（low 逼版）的相关文章

Python爬虫实战四之抓取淘宝MM照片

福利啊福利,本次为大家带来的项目是抓取淘宝MM照片并保存起来,大家有没有很激动呢? 最新动态更新时间:2015/8/2 最近好多读者反映代码已经不能用了,原因是淘宝索引页的MM链接改了.网站改版了,URL的索引已经和之前的不一样了,之前可以直接跳转到每个MM的个性域名,现在中间加了一个跳转页,本以为可以通过这个页面然后跳转到原来的个性域名,而经过一番折腾发现,这个跳转页中的内容是JS动态生成的,所以不能用Urllib库来直接抓取了,本篇就只提供学习思路,代码不能继续用了. 之后博主会利用其它方

Python爬虫使用Selenium+PhantomJS抓取Ajax和动态HTML内容

在上一篇python使用xslt提取网页数据中,要提取的内容是直接从网页的source code里拿到的. 但是对于一些Ajax或动态html, 很多时候要提取的内容是在source code找不到的,这种情况就要想办法把异步或动态加载的内容提取出来. python中可以使用selenium执行javascript,selenium可以让浏览器自动加载页面,获取需要的数据.selenium自己不带浏览器,可以使用第三方浏览器如Firefox, Chrome等,也可以使用headless浏览器如P

[Python爬虫] 之四：Selenium 抓取微博数据

抓取代码: # coding=utf-8import osimport refrom selenium import webdriverimport selenium.webdriver.support.ui as uifrom selenium.webdriver.common.keys import Keysimport timefrom selenium.webdriver.common.action_chains import ActionChainsimport IniFileclas

Python 爬虫学习3 -简单抓取小说网信息

小说网 https://www.qu.la/paihangbang/ 功能:抓取每个排行榜内的小说名和对应链接,然后写入excel表格里面. 按F12 审查页面元素可以得到你所要的信息的class,从而来定位. 具体看代码讲解吧. #coding:utf-8 #为了正常转码必写 import codecs #为下面新建excel,转码正确准备得一个包 __author__ = 'Administrator' import requests from bs4 import BeautifulSo

python 爬虫2-正则表达式抓取拉勾网职位信息

import requestsimport re #正则表达式import time import pandas #保存成 CSV #header={'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:57.0) Gecko/20100101 Firefox/57.0'}header = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:57.0) Gecko/2

python爬虫学习(1)__抓取煎蛋图片

#coding=utf-8 #python_demo 爬取煎蛋妹子图在本地文件夹 import requests import threading import time import os from bs4 import BeautifulSoup #伪造头文件 headers = { 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chr

python爬虫学习(2)__抓取糗百段子，与存入mysql数据库

import pymysql import requests from bs4 import BeautifulSoup#pymysql链接数据库 conn=pymysql.connect(host='127.0.1',unix_socket='/tmp/mysql.sock',user='root',passwd='19950311',db='mysql') cur=conn.cursor() cur.execute("USE scraping") #存储段子标题,内容 def st

一个抓取知乎页面图片的简单爬虫

在知乎上看到一个问题能利用爬虫技术做到哪些很酷很有趣很有用的事情?发现蛮好玩的,便去学了下正则表达式,以前听说正则表达式蛮有用处的,学完后觉得确实很实用的工具.问题评论下基本都是python写的爬虫,我看了下原理,感觉爬一个简单的静态网页还是挺容易的.就是获取网站html源码,然后解析需要的字段,最后拿到字段处理(下载).想起我学java的时候有个URL类好像有这个功能,便去翻了下api文档,发现URLConnection果然可以获取html源码. 首先从核心开始写,获取网页源码 packa

Python从零开始写爬虫-3 获取需要抓取的URLs

Python从零开始写爬虫-3 获取需要抓取的URLs ??在一节中,我们学习了如果通过正则表达式来获取HTML里面指点便签的内容, 那么我今天就来看看实际的效果.在抓取小说之前, 我们需要知道小说有哪些章节,以及这些章节的顺序. ??刚开始我是通过获取一个章节, 然后从这个章节获取下个章节的链接, 然后发现这个方法问题很大. 该方法只能单线程操作, 对于抓取小说来非常的满, 几乎没有爬虫是单线程, 单线程的效率是无法被接受的. 鲁棒性差, 只要有一个章节没有正确的抓取下来, 那么整个程序就无法