我的爬虫笔记(1)

最简单的 先把网页的HTML代码爬取下来

from urllib.request import urlopen
from urllib.request import Request
#遇到反爬取可以添加模拟浏览器协议头
headers = {‘User-Agent‘:‘Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6‘}
#想要爬取的网站地址
url = "https://www.zhihu.com/"
req_timeout=5  #设置req_timeout防止url不可访问,或者响应速度太慢而造成的时间浪费。
req=Request(url=url,headers=headers)
f=urlopen(req,None,req_timeout)
s=f.read()
s=s.decode(‘utf-8‘)# 防止爬取的页面中文出现乱码
ss=str(s)
print(ss)

遇到的问题:

1.大部分网站会有发爬取措施 所以我们需要添加一段代码:

headers = {‘User-Agent‘:‘Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6‘}

这个是添加模拟浏览器协议头,可以解决这个问题。自己亲测百度知乎都可以用这个方法爬取下来HTML代码

2.爬取的代码中有乱码

s=s.decode(‘utf-8‘)

使用这个方法可以解决

3.输出结果需要str类型

将其转换成str类型

上面代码结果(爬取知乎首页代码):

<!DOCTYPE html>
<html lang="zh-CN" class="">
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1" />
<meta http-equiv="X-ZA-Response-Id" content="1b244bb1a32b4315">
<meta http-equiv="X-ZA-Experiment" content="default:None,ge3:ge3_9,ge2:ge2_1,nweb_sticky_sidebar:sticky,live_review_buy_bar:live_review_buy_bar_2,is_office:false,home_ui2:default,is_show_unicom_free_entry:unicom_free_entry_off,app_store_rate_dialog:close,qa_sticky_sidebar:sticky_sidebar,android_profile_panel:panel_b,live_store:ls_a2_b2_c1_f2,search_hybrid_tabs:without-tabs,answer_related_readings:qa_recommend_with_ads_and_article,asdfadsf:asdfad,new_mobile_column_appheader:new_header,fav_act:default,remix_one_key_play_button:headerButton,mobile_qa_page_proxy_heifetz:m_qa_page_nweb,nweb_write_answer:default,android_pass_through_push:getui,new_more:new,new_buy_bar:livenewbuy3,zcm-lighting:zcm,iOS_newest_version:4.2.0,qrcode_login:qrcode,wechat_share_modal:wechat_share_modal_show">
<meta name="renderer" content="webkit" />
<meta name="description" content="中文互联网最大的知识平台,帮助人们便捷地分享彼此的知识、经验和见解。"/>
<meta name="viewport" content="user-scalable=no, width=device-width, initial-scale=1.0, maximum-scale=1.0"/>
<title>知乎 - 发现更大的世界</title>

<link rel="apple-touch-icon" href="https://static.zhihu.com/static/revved/img/ios/touch-icon-152.87c020b9.png" sizes="152x152">
<link rel="apple-touch-icon" href="https://static.zhihu.com/static/revved/img/ios/touch-icon-120.496c913b.png" sizes="120x120">
<link rel="apple-touch-icon" href="https://static.zhihu.com/static/revved/img/ios/touch-icon-76.dcf79352.png" sizes="76x76">
<link rel="apple-touch-icon" href="https://static.zhihu.com/static/revved/img/ios/touch-icon-60.9911cffb.png" sizes="60x60">

<link rel="shortcut icon" href="https://static.zhihu.com/static/favicon.ico" type="image/x-icon" />
<link rel="dns-prefetch" href="p1.zhimg.com"/>
<link rel="dns-prefetch" href="p2.zhimg.com"/>
<link rel="dns-prefetch" href="p3.zhimg.com"/>
<link rel="dns-prefetch" href="p4.zhimg.com"/>
<link rel="dns-prefetch" href="comet.zhihu.com"/>
<link rel="dns-prefetch" href="static.zhihu.com"/>
<link rel="dns-prefetch" href="upload.zhihu.com"/>
<link rel="stylesheet" href="https://static.zhihu.com/static/revved/-/css/pages/unlogin-index/main.f214513a.css">
<meta name="google-site-verification" content="FTeR0c8arOPKh8c5DYh_9uu98_zJbaWw53J-Sch9MTg" />
<meta name="baidu-site-verification" content="KPFppAFoYF4Kkdv9" />
<meta property="qc:admins" content="00544670776201056375" />
<link rel="canonical" href="http://www.zhihu.com" />
<meta id="znonce" name="znonce" content="d5e581328572473aad8501685dae174f">
<!--[if lt IE 9]>
<script src="https://static.zhihu.com/static/components/respond/dest/respond.min.js"></script>
<link href="https://static.zhihu.com/static/components/respond/cross-domain/respond-proxy.html" id="respond-proxy" rel="respond-proxy" />
<link href="/static/components/respond/cross-domain/respond.proxy.gif" id="respond-redirect" rel="respond-redirect" />
<script src="/static/components/respond/cross-domain/respond.proxy.js"></script>
<![endif]-->
<script src="https://static.zhihu.com/static/revved/-/js/instant.14757a4a.js"></script>
</head>
<body class="zhi ">

<div class="index-main">
<div class="index-main-body">
<div class="index-header">
<h1 class="logo hide-text">知乎</h1>

<h2 class="subtitle">与世界分享你的知识、经验和见解</h2>

</div>

<div class="desk-front sign-flow sign-flow clearfix sign-flow-simple">

<div class="index-tab-navs">
<div class="navs-slider">
<a href="#signup" class="active">注册</a>
<a href="#signin">登录</a>
<span class="navs-slider-bar"></span>
</div>
</div>

<div class="view view-signin" data-za-module="SignInForm">
<form method="POST">
<input type="hidden" name="_xsrf" value="060f3dedb5b35e2ac6fe354bed716d04"/>
<div class="group-inputs">

<div class="account input-wrapper">

<input type="text" name="account" aria-label="手机号或邮箱" placeholder="手机号或邮箱" required>
</div>
<div class="verification input-wrapper">
<input type="password" name="password" aria-label="密码" placeholder="密码" required /><button type="button" class="send-code-button">获取验证码</button>
</div>

<div class="Captcha input-wrapper" data-type="cn" data-za-module="Captcha">
<div class="Captcha-operate">
<input type="hidden" name="captcha" required data-rule-required="true" data-msg-required="请点击图中所有倒立的文字">
<input type="hidden" name="captcha_type" value="cn" required>
<label class="Captcha-prompt">请点击图中所有倒立的文字</label>
<span class="Captcha-refresh js-refreshCaptcha sprite-index-icon-refresh"></span>
</div>
<div class="Captcha-imageConatiner">
<img class="Captcha-image" alt="验证码" >
</div>
</div>

</div>
<div class="button-wrapper command">
<button class="sign-button submit" type="submit">登录</button>
</div>
<div class="signin-misc-wrapper clearfix">

<button type="button" class="signin-switch-button">手机验证码登录</button>

<a class="unable-login" href="#">无法登录?</a>
</div>

<div class="other-signup-wrapper" data-za-module="SNSSignIn">

<span class="name signin-switch-qrcode-buttons">二维码登录</span>
<span class="signup-footer-separate signup-footer-se"> · </span>

<span class="name signup-social-buttons js-toggle-sns-buttons">社交帐号登录</span>

<div class="sns-buttons">
<a title="微信登录" class="js-bindwechat" href="#"><i class="sprite-index-icon-wechat"></i></a>
<a title="微博登录" class="js-bindweibo" href="#"><i class="sprite-index-icon-weibo"></i></a>
<a title="QQ 登录" class="js-bindqq" href="#"><i class="sprite-index-icon-qq"></i></a>
</div>

</div>

</form>

<div class="qrcode-signin-container">
<div class="qrcode-signin-step1">
<div class="qrcode-signin-img-wrapper">
<img src="/static/img/spinner/grey-loading.gif" class="qrcode-signin-loading"/>
</div>
<p>打开最新 <a href="https://www.zhihu.com/app/" target="_blank">知乎 App</a></p>
<p>在「更多」页面右上角打开扫一扫</p>
<div class="qrcode-signin-cut-button">
<span class="signin-switch-password">使用密码登录</span>
</div>
</div>
<div class="qrcode-signin-step2">
<div class="qrcode-signin-scan-status"></div>
<p class="qrcode-signin-scan-tips">扫描成功</p>
<p>请在手机上「确认登录」</p>
<div class="qrcode-signin-cut-button">
<span class="qrcode-goto-scan">返回二维码</span>
</div>
</div>
<div class="qrcode-signin-failure">
<div class="qrcode-signin-failure-icon"></div>
<p class="qrcode-signin-failure-message"></p>
<div class="qrcode-signin-cut-button">
<span class="signin-switch-password">使用密码登录</span>
</div>
</div>
<div class="qrcode-signin-guide"></div>
</div>

<div class="QRCode">
<button class="QRCode-toggleButton">
<span class="sprite-global-icon-qrcode"></span>
<span class="QRCode-toggleButtonText ">下载知乎 App</span>
</button>
<div class="QRCode-card">
<div class="QRCode-image"></div>
<div class="sprite-index-icon-arrow"></div>
</div>
</div>

</div>
<div class="view  view-signup selected" data-za-module="SignUpForm">

<form class="zu-side-login-box" action="/register/email" id="sign-form-1" autocomplete="off" method="POST">
<input type="password" hidden>
<input type="hidden" name="_xsrf" value="060f3dedb5b35e2ac6fe354bed716d04"/>

<div class="group-inputs">

<div class="name input-wrapper">
<input required type="text" name="fullname" aria-label="姓名" placeholder="姓名">
</div>
<div class="email input-wrapper">

<input required type="text" class="account" name="phone_num" aria-label="手机号" placeholder="手机号">

</div>
<div class="input-wrapper">
<input required type="password" name="password" aria-label="密码" placeholder="密码(不少于 6 位)" autocomplete="off">
</div>

<div class="Captcha input-wrapper" data-type="cn" data-za-module="Captcha">
<div class="Captcha-operate">
<input type="hidden" name="captcha" required data-rule-required="true" data-msg-required="请点击图中所有倒立的文字">
<input type="hidden" name="captcha_type" value="cn" required>
<label class="Captcha-prompt">请点击图中所有倒立的文字</label>
<span class="Captcha-refresh js-refreshCaptcha sprite-index-icon-refresh"></span>
</div>
<div class="Captcha-imageConatiner">
<img class="Captcha-image" alt="验证码" >
</div>
</div>

</div>
<div class="button-wrapper command">
<button class="sign-button submit" type="submit">注册知乎</button>
</div>

</form>

<p class="agreement-tip">点击「注册」按钮,即代表你同意<a href="/terms" target="_blank">《知乎协议》</a></p>
<a class="signup-entry--org" href="/org/signup">注册机构号</a>

<div class="QRCode">
<button class="QRCode-toggleButton">
<span class="sprite-global-icon-qrcode"></span>
<span class="QRCode-toggleButtonText ">下载知乎 App</span>
</button>
<div class="QRCode-card">
<div class="QRCode-image"></div>
<div class="sprite-index-icon-arrow"></div>
</div>
</div>

</div>
</div>
</div>

</div>

<div class="footer">
<a target="_blank" href="https://zhuanlan.zhihu.com">知乎专栏</a>
<span class="dot">·</span>
<a target="_blank" href="/roundtable">知乎圆桌</a>
<span class="dot">·</span>
<a target="_blank" href="/explore" data-za-c="explore" data-za-a="visit_explore" data-za-l="home_bottom_explore">发现</a>
<span class="dot">·</span>
<a target="_blank" href="/app">移动应用</a>
<span class="dot">·</span>
<a href="/contact" class="footer-mobile-show">联系我们</a>
<span class="dot">·</span>
<a target="_blank" href="/careers">来知乎工作</a>
<br />
<span>&copy; 2017 知乎</span>
<span class="dot">·</span>
<a href="http://www.miibeian.gov.cn/" target="_blank">京 ICP 证 110745 号</a>
<span class="dot">·</span>
<span>京公网安备 11010802010035 号</span>
<span class="dot">·</span>
<a href="http://zhstatic.zhihu.com/assets/zhihu/publish-license.jpg" target="_blank">出版物经营许可证</a>
<br />
<a target="_blank" href="https://zhuanlan.zhihu.com/p/28852607">侵权举报</a>
<span class="dot">·</span>
<a target="_blank" href="http://www.12377.cn">网上有害信息举报专区</a>
<span class="dot">·</span>
<a target="_blank" href="/jubao">儿童色情信息举报专区</a>
<span class="dot">·</span>
<span>违法和不良信息举报:010-82716601</span>
<div class="chengxing">
<a id=‘___szfw_logo___‘ href=‘https://credit.szfw.org/CX20170607038331320388.html‘ target=‘_blank‘>
<img src="https://static.zhihu.com/static/revved/img/index/[email protected]" border=‘0‘ />
</a>
<script type=‘text/javascript‘>(function(){document.getElementById(‘___szfw_logo___‘).oncontextmenu = function(){return false;}})();</script>
</div>
</div>

<script type="text/json" class="json-inline" data-name="disabled_components">["back_to_top"]</script>
<script type="text/json" class="json-inline" data-name="current_user">["","","","-1","",0,0]</script>
<script type="text/json" class="json-inline" data-name="env">["zhihu.com","comet.zhihu.com",false,null,false,false]</script>

<script type="text/json" class="json-inline" data-name="ga_vars">{"user_created":0,"now":1509713487000,"abtest_mask":"------------------------------","user_attr":[0,0,0,"-","-"],"user_hash":0}</script>

<script src="https://static.zhihu.com/static/revved/-/js/vendor.cb14a042.js"></script>
<script src="https://static.zhihu.com/static/revved/-/js/closure/base.41bb3b24.js"></script>

<script src="https://static.zhihu.com/static/revved/-/js/closure/common.ef6c9c27.js"></script>
<script src="https://static.zhihu.com/static/revved/-/js/closure/page-index.f17f3a40.js"></script>
<meta name="entry" content="ZH.entrySignPage" data-module-id="page-index">

<input type="hidden" name="_xsrf" value="060f3dedb5b35e2ac6fe354bed716d04"/>
</body>
</html>
时间: 2024-08-02 02:48:12

我的爬虫笔记(1)的相关文章

nodejs爬虫笔记(二)

node爬虫代理设置 最近想爬取YouTube上面的视频信息,利用nodejs爬虫笔记(一)的方法,代码和错误如下 var request = require('request'); var cheerio = require('cheerio');**** var url = 'https://www.youtube.com '; function crawler(url,callback){ var list = []; request(url,function(err,res){ if(e

Python网络爬虫笔记(五):下载、分析京东P20销售数据

(一)  分析网页 下载下面这个链接的销售数据 https://item.jd.com/6733026.html#comment 1.      翻页的时候,谷歌F12的Network页签可以看到下面的请求. 从Preview页签可以看出,这个请求是获取评论信息的 2.      对比第一页.第二页.第三页-请求URL的区别 可以发现 page=0.page=1,0和1指的应该是页数. 第一页的 request url:没有这个rid=0& . 第二.三页-的request url:多了这个ri

nodejs爬虫笔记(五)---利用nightmare模拟点击下一页

目标 以腾讯滚动新闻为例,利用nightmare模拟点击下一页,爬取所有页面的信息.首先得感谢node社区godghdai的帮助,开始接触不太熟悉nightmare,感觉很高大上,自己写代码的时候问题也很多,多亏大神的指点. 一.选择模拟的原因 腾讯滚动新闻,是每六十秒更新一次,而且有下一页.要是直接获取页面的话得一页一页的获取,不太方便,又想到了找数据接口,然后通过请求得到数据,结果腾讯新闻的数据接口是加密的,这种想法又泡汤了.因而想到笔记(四)中模拟加载更多的模块,看利用nightmare这

【爬虫笔记】第一次写爬虫,爬取新浪新闻网标题

昨晚在网易云课堂上看到了这个爬虫教程,是个基础入门教程,看了几节课,按照示例也去爬了一下新闻标题 课程一些截图:

nodejs 爬虫笔记

目标:爬取慕课网里面一个教程的视频信息,并将其存入mysql数据库.以http://www.imooc.com/learn/857为例. 一.工具 1.安装nodejs:(操作系统环境:WiN 7 64位)  在Windows环境下安装相对简单(ps:其他版本我也不太清楚,可以问度娘) http://nodejs.org/download/ 链接中下载对应操作系统安装文件(安装最新版本就行) 按照提示,一路下一步直到安装成功后,在默认安装路径下可以看到(C:\Program Files\node

Python 爬虫笔记(不定时更新)

参考笔记 虫师  http://www.cnblogs.com/fnng/p/3576154.html #自动访某个网址 from selenium import webdriver import time M = 100000 i = 0 URL = 'http://www.yyxxww.com/html/2015/edu_0318/3386.html' browser = webdriver.Firefox() #浏览器名字,以本机安装为准 while i < M: browser.get(

nodejs爬虫笔记(三)

思路:通过笔记(二)中代理的设置,已经可以对YouTube的信息进行爬取了,这几天想着爬取网站下的视频信息.通过分析YouTube,可以从订阅号入手,先选择几个订阅号,然后爬取订阅号里面的视频分类,之后进入到每个分类下的视频列表,最后在具体到每一个视频,获取需要的信息.以订阅号YouTube 电影为例. 一.爬取YouTube 电影里面的视频分类列表 打开订阅号,我们可以发现订阅号下有许多视频分类如下图所示,接下来可以解析该订阅号信息,把视频分类的URL和名称爬取下来. 接下来我们通过浏览器点击

爬虫笔记(四)------关于BeautifulSoup4解析器与编码

前言:本机环境配置:ubuntu 14.10,python 2.7,BeautifulSoup4 一.解析器概述 如同前几章笔记,当我们输入: soup=BeautifulSoup(response.body) 对网页进行析取时,并未规定解析器,此时使用的是python内部默认的解析器“html.parser”. 解析器是什么呢? BeautifulSoup做的工作就是对html标签进行解释和分类,不同的解析器对相同html标签会做出不同解释. 举个官方文档上的例子: BeautifulSoup

Scrapy爬虫笔记

Scrapy是一个优秀的Python爬虫框架,可以很方便的爬取web站点的信息供我们分析和挖掘,在这记录下最近使用的一些心得. 1.安装 通过pip或者easy_install安装: 1 sudo pip install scrapy 2.创建爬虫项目 1 scrapy startproject youProjectName 3.抓取数据 首先在items.py里定义要抓取的内容,以豆瓣美女为例: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 from sc

python爬虫笔记----4.Selenium库(自动化库)

4.Selenium库 (自动化测试工具,支持多种浏览器,爬虫主要解决js渲染的问题) pip install selenium 基本使用 from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.common.keys import Keys from selenium.webdriver.support import expected_condition