Python3爬虫04（其他例子，如处理获取网页的内容）

#!/usr/bin/env python# -*- coding:utf-8 -*-

import osimport reimport requestsfrom bs4 import NavigableStringfrom bs4 import BeautifulSoup

res=requests.get("https://www.qiushibaike.com/")qiushi=res.contentsoup=BeautifulSoup(qiushi,"html.parser")duanzis=soup.find_all(class_="content")for i in duanzis:    duanzi=i.span.contents[0]    # duanzi=i.span.string    print(duanzi)    # print(i.span.string)

res=requests.get("http://699pic.com/sousuo-218808-13-1-0-0-0.html")image=res.contentsoup=BeautifulSoup(image,"html.parser")images=soup.find_all(class_="lazy")

for i in images:    original=i["data-original"]    title=i["title"]    # print(title)    # print(original)    # print("")    try:        with open(os.getcwd()+"\\jpg\\"+title+‘.jpg‘,‘wb‘) as file:            file.write(requests.get(original).content)    except:        pass

r = requests.get("http://699pic.com/sousuo-218808-13-1.html")fengjing = r.contentsoup = BeautifulSoup(fengjing, "html.parser")# 找出所有的标签images = soup.find_all(class_="lazy")# print images # 返回list对象

for i in images:    jpg_rl = i["data-original"]  # 获取url地址    title = i["title"]           # 返回title名称    print(title)    print(jpg_rl)    print("")

r = requests.get("https://www.qiushibaike.com/")r=requests.get("http://www.cnblogs.com/nicetime/")blog=r.contentsoup=BeautifulSoup(blog,"html.parser")soup=BeautifulSoup(blog,features="lxml")print(soup.contents[0].contents)

tag=soup.find(‘div‘)tag=soup.find(class_="menu-bar menu clearfix")tag=soup.find(id="menu")print(list(tag))

tag01=soup.find(class_="c_b_p_desc")

print(len(list(tag01.contents)))print(len(list(tag01.children)))print(len(list(tag01.descendants)))

print(tag01.contents)print(tag01.children)for i in tag01.children:    print(i)

print(len(tag01.contents))

for i in tag01:    print(i)

print(tag01.contents[0].string)print(tag01.contents[1])print(tag01.contents[1].string)

url = "http://www.dygod.net/html/tv/oumeitv/109673.html"s = requests.get(url)print(s.text.encode("iso-8859-1").decode(‘gbk‘))res = re.findall(‘href="(.*?)">ftp‘,s.text)for resi in res:    a=resi.encode("iso-8859-1").decode(‘gbk‘)    print(a)

原文地址：https://www.cnblogs.com/NiceTime/p/10125289.html

时间： 2024-10-08 18:52:01

Python3爬虫04（其他例子，如处理获取网页的内容）的相关文章

03 爬虫实例-获取网页弹幕内容

练习:爬取哔哩哔哩网页弹幕内容,并将爬取的内容以五角星的形式显示出来思路: 向哔哩哔哩网站发送请求请求成功后,解析爬取的弹幕内容保存到一个文件中读取文件并分析弹幕内容中词组或文字出现的频率将这些词组或文字组成五角星图形组成五角星图形后,以图片的形式输出实现: 1 #!/usr/bin/env python 2 # -*- coding: utf-8 -*- 3 # author:albert time:2019/10/28 4 import requests 5 from bs4 i

telnet建立http连接获取网页HTML内容

利用telnet可以与服务器建立http连接,获取网页,实现浏览器的功能. 它对于需要对http header进行观察和测试到时候非常方便.因为浏览器看不到http header. 步骤如下: telnet www.csua.berkeley.edu 80 输入GET /officers.html HTTP/1.0 并2次回车. 这时就应该可以看到http response了,包括了header和body. 因为window自己带到telnet在输入内容的时候看不到输入的内容,可以下载putty

scrapy爬虫获取网页特定内容

上次把scrapy环境配置好了,这次试着来做些实际的东西. 关于scrapy抓取网页的文章已经有很多了,但大多数的内容已经过期,不再适用于最新的scrapy版本,故在此另作一文,记录学习过程. 目标是一个政府网站,红框内的部分. 思路很简单: 有了url之后,用xpath表达式提取出来,再写到文件里即可如果之前没有scrapy的经验,可以先看看这两篇文章: http://www.cnblogs.com/txw1958/archive/2012/07/16/scrapy-tutorial.htm

【Python3 爬虫】16_抓取腾讯视频评论内容

上一节我们已经知道如何使用Fiddler进行抓包分析,那么接下来我们开始完成一个简单的小例子抓取腾讯视频的评论内容首先我们打开腾讯视频的官网https://v.qq.com/ 我们打开[电视剧]这一栏,找到一部比较精彩的电视剧爬取一下,例如:我们就爬取[下一站,别离]这部吧我们找到这部电视剧的评论如下图: 我们看到上图标记部分[查看更多评论] 我们首先在Fiddelr中使用命令clear清除之前浏览的记录输入命令直接回车即可接着我们点击[查看更多评论],此时再次看Fiddler,我们可

C++ 与 php 的交互之----- C++ 异步获取网页文字内容，异步获取 php 的 echo 值。

转载请声明出处! http://www.cnblogs.com/linguanh/category/633252.html 距离上次谈 C++ 制作json 或者其他数据传送给服务器,时隔两个多月. 链接:http://www.cnblogs.com/linguanh/p/4340119.html 这次是从服务器上中获取文字内容到控制台,或者写入本地文本等操作,废话不多说,开讲. ------------------------------------------------------

C++ 与 php 的交互之----- C++ 获取网页文字内容，获取 php 的 echo 值。

php利用curl获取网页title内容

<?php $url = 'http://www.k7wan.com'; echo getTitle_web_curl($url); function getTitle_web_curl($url){ $title = ''; $ch = curl_init(); //设置选项,包括URL curl_setopt($ch, CURLOPT_URL, $url); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); curl_setopt($ch, CURLO

使用SOCKET获取网页的内容

使用fsockopen()函数来实现获取页面信息,完整代码如下 //设置字符集(由于要抓取的网易网站字符集编码是gbk编码) header("content-type:text/html;charset=gb2312"); //设置中国时区 date_default_timezone_set('PRC'); //页面域名 $hostname = "news.163.com";//"www.163.com"; //请求方式 $method = 'G

[python]获取网页中内容为汉字的字符串的判断

IPerf%E2%80%94%E2%80%94%E7%BD%91%E7%BB%9C%E6%B5%8B%E8%AF%95%E5%B7%A5%E5%85%B7%E4%BB%8B%E7%BB%8D%E4%B8%8E%E6%BA%90%E7%A0%81%E8%A7%A3%E6%9E%904 ?????DbYE1tZV??x?????g ????o12dt6wwG???ó??????? http://auto.315che.com/tyrs/qa23824193.htm?hwx http://auto.3