python爬取html中文乱码

环境：

python3.6

爬取网址：https://www.dygod.net/html/tv/hytv/

爬取代码：

import requestsurl = ‘https://www.dygod.net/html/tv/hytv/‘req = requests.get(url)print(req.text)

爬取结果：

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<META http-equiv=Content-Type content="text/html; charset=gb2312">
<title>µçÊÓ¾ç / »ªÓïµçÊÓ¾ç_µçÓ°ÌìÌÃ-Ñ¸À×µçÓ°ÏÂÔØ</title>
<meta name="keywords" content="Ñ¸À×µçÓ°£¬Ñ¸À×ÏÂÔØ£¬Ãâ·ÑµçÓ°">
<meta name=description content="Ãâ·ÑÑ¸À×µçÓ°ÏÂÔØ,Ñ¸À×ÏÂÔØ£¬×îºÃµÄÑ¸À×ÏÂÔØÕ¾£¬ÊÇÓ°ÃÔµÄÊ×Ñ¡">
<link href="/css/dygod.css" rel="stylesheet" type="text/css" />

如上，title内容出现乱码，自己感觉应该是编码的问题，但是不知道如何解决，于是上网查看

参考网址：

https://www.cnblogs.com/bw13/p/6549248.html

问题找到，原来是reqponse header只指定了type，但是没有指定编码(一般现在页面编码都直接在html页面中)，查找原网页可以看到

在content-type属性中，未设置编码格式，正常设置如下

所以使用默认的编码格式

《HTTP权威指南》里第16章国际化里提到，如果HTTP响应中Content-Type字段没有指定charset，则默认页面是‘ISO-8859-1‘编码。

这处理英文页面当然没有问题，但是中文页面，就会有乱码了！

print(req.apparent_encoding)

结果为：GB2312

所以只需要加上

req.encoding = req.apparent_encoding

这个就可以了！

代码:

import requestsurl = ‘https://www.dygod.net/html/tv/hytv/‘req = requests.get(url)req.encoding = req.apparent_encodingprint(req.text)

结果中文就不会乱码了

原文地址：https://www.cnblogs.com/bingchuan-study/p/11487164.html

时间： 2024-09-30 15:12:18

python爬取html中文乱码

python爬取html中文乱码的相关文章

python爬取百度翻译返回：{'error': 997, 'from': 'zh', 'to': 'en', 'query 问题

python 爬取网页简单数据---以及详细解释用法

Python爬取视频指南

使用python爬取csdn博客访问量

[转]Python的经典问题——中文乱码

python爬取某个网站的图片并保存到本地

python爬取某个网页的图片-如百度贴吧

使用python爬取MedSci上的影响因子排名靠前的文献

Python爬取网页的三种方法