[微博爬虫] 登录+爬取+mysql存储+echart可视化

目前新浪微博登录修改登录加密方法，使用rsa进行加密。

以下为个人实现登录的过程，不过得到cookie出现了问题，使用urllib2可以有效抓取但是httplib2对cookie操作很恶心需要自己处理终于搞定了用httplib2抓取新浪微博，不知道其他微博是否适用，，，

下面就是登录的介绍：

1，安装rsa模块

下载地址：https://pypi.python.org/pypi/rsa/3.1.1

rsa模块文档地址：http://stuvel.eu/files/python-rsa-doc/index.html

2，获得以及查看新浪微博登录js文件

查看新浪通行证url url为http://login.sina.com.cn/signup/signin.php的源代码其中就可以找到该js的地址 http://login.sina.com.cn/js/sso/ssologin.js不过打开后里面的内容需要在网上找个在线js格式化站点美化一下

3，登录

登录第一步，请求prelogin_url = ‘http://login.sina.com.cn/sso/prelogin.php?entry=sso&callback=sinaSSOController.preloginCallBack&su=%s&rsakt=mod&client=ssologin.js(v1.4.2)‘ % username

这个自己添加帐号就可以了，使用get方法

得到内容如下：

sinaSSOController.preloginCallBack({"retcode":0,"servertime":1362041092,"pcid":"gz-6664c3dea2bfdaa3c94e8734c9ec2c9e6a1f","nonce":"IRYP4N","pubkey":"EB2A38568661887FA180BDDB5CABD5F21C7BFD59C090CB2D245A87AC253062882729293E5506350508E7F9AA3BB77F4333231490F915F6D63C55FE2F08A49B353F444AD3993CACC02DB784ABBB8E42A9B1BBFFFB38BE18D78E87A0E41B9B8F73A928EE0CCEE1F6739884B9777E4FE9E88A1BBE495927AC4A799B3181D6442443","rsakv":"1330428213","exectime":1})

我们需要得到servertime，nonce，pubkey和rsakv的值，当然，pubkey和rsakv的值我们可以写死在代码中，他们是固定值。

pubkey为rsa加密中的公钥中的第一个参数，rsakv是作为下一步登录的headers中的一个值。

加密，username还是以前一样的处理，

username_ = urllib.quote(username)
username = base64.encodestring(username)[:-1]

密码加密也是最重要的一部分，

1，先创建一个rsa公钥，公钥的两个参数新浪微博都给了是固定值，不过给的都是16进制的字符串，第一个是登录第一步中的pubkey，第二个是js加密文件中的‘10001’。

这两个值需要先从16进制转换成10进制，不过也可以写死在代码里。我就把‘10001’直接写死为65537

rsaPublickey = int(pubkey, 16)

key = rsa.PublicKey(rsaPublickey, 65537) ＃创建公钥

message = str(servertime) + ‘\t‘ + str(nonce) + ‘\n‘ + str(password)＃拼接明文 js加密文件中得到

passwd = rsa.encrypt(message, key)＃加密

passwd = binascii.b2a_hex(passwd) ＃将加密信息转换为16进制。

2，请求通行证url：login_url = "http://login.sina.com.cn/sso/login.php?client=ssologin.js(v1.4.2)"

需要发送的报头信息

formdata = {"entry" : ‘weibo‘, "gateway" : ‘1‘, "from" : ‘‘, "savestate" : ‘7‘, "useticket" : ‘1‘, "ssosimplelogin" : ‘1‘, "su" : usr, "service" : ‘miniblog‘, "servertime" : servertime, "nonce" : nonce, "pwencode" : ‘rsa2‘, "sp" : passwd, "encoding" : ‘UTF-8‘, "rsakv" : ‘1330428213‘, "url" : ‘http://weibo.com/ajaxlogin.php?framelogin=1&callback=parent.sinaSSOController.feedBackUrlCallBack‘, "returntype" : ‘META‘}

请求的内容中添加了rsakv，将pwencode的值修改为rsa2，其他跟以前一致

 1     def login(self,username,password,code):
 2         mix=self.GetMixUser(username,password)
 3         uname=mix[‘uname‘]
 4         upass=mix[‘upass‘]
 5         url="https://login.sina.com.cn/sso/login.php?client=ssologin.js(v1.4.19)"
 6         print("登录中……")
 7         postData={
 8             "door":code,
 9             "encoding":"utf-8",
10             "entry":"weibo",
11             "from":"null",
12             "gateway":1,
13             "nonce":self.pre[‘nonce‘],
14             "prelt":72,
15             "pwencode":"rsa2",
16             "qrcode_flag":False,
17             "returntype":"META",
18             "savestate":7,
19             "servertime":self.pre[‘servertime‘],
20             "service":"miniblog",
21             "rsakv":self.pre[‘rsakv‘],
22             "su":uname,
23             "sp":upass,
24             "url":"https://weibo.com/ajaxlogin.php?framelogin=1&callback=parent.sinaSSOController.feedBackUrlCallBack",
25             "useticket":1,
26             "vsnf":1
27         }
28         postData=parse.urlencode(postData).encode(‘utf-8‘)
29         result=self.opener.open(url,postData).read().decode(‘gbk‘)
30         url1=result[result.find("replace")+9:result.find(‘)‘)-1]
31         result=self.opener.open(url1).read().decode("gbk")
32         if(result.find("身份")!=-1):
33             return False
34         result=result[result.find(‘location‘)+18:]
35         url2=result[:result.find(‘)‘)-1]
36         self.opener.open(url2).read().decode("gbk")
37         return True

爬取

 1     def GetUserList(self,uid,pageNum):
 2         url="https://weibo.com/"+str(uid)+"/follow?page="+str(pageNum)
 3         try:
 4             result=self.opener.open(url).read().decode(‘utf-8‘)
 5             html = result.replace(‘\\n‘, ‘‘).replace(‘\\t‘, ‘‘).replace(‘\\r‘, ‘‘).replace(‘\\‘, ‘‘)
 6             html = html[html.find("<!--关注/粉丝列表-->"):html.find("<!--关欧盟隐私协议弹窗-->")]
 7             soup = BeautifulSoup(html, "html.parser")
 8             list_a = soup.findAll(name=‘div‘, attrs={"class": "info_name W_fb W_f14"})
 9             name = []
10             uid = []
11             for a in list_a:
12                 try:
13                     b = a.find(name="a")
14                     b = b[‘usercard‘]
15                     b = b[3:13:]
16                     uid.append(b)
17                     name.append(a.text)
18                     print("加入用户:" + a.text)
19                 except:
20                     print("No Data")
21             dic = {"name": name, "uid": uid}
22             return dic
23         except:
24             pass

 1     def GetTalks(self,uid):
 2         rlist = []
 3         i=0
 4         html=""
 5         while(True):
 6             try:
 7                 result=self.opener.open("https://weibo.com/u/"+str(uid)+"?page="+str(i)).read().decode("utf-8")
 8                 html = result.replace("\\t", "").replace("\\n", "").replace("\\r", "").replace("\\", "")
 9                 html = html[html.find("<div class=\"WB_feed WB_feed_v3 WB_feed_v4\""):]
10             except:
11                 pass
12             soup=BeautifulSoup(html,"html.parser")
13             list_a = soup.find_all(name="div", attrs={"class": "WB_text W_f14"})
14             i = i + 1
15             if list_a:
16                 print("第" + str(i) + "页")
17                 for a in list_a:
18                     at=a.text
19                     at=at.replace(" ","")
20                     if at:
21                         rlist.append(at)
22                         print("内容存入："+at)
23             else:
24                 break
25         return rlist

存储

 1     def sqllogin(self):
 2         db=pymysql.connect(host=‘localhost‘,user=‘root‘,db=‘weibouser‘,passwd=‘root‘,charset=‘utf8mb4‘)
 3         return db
 4     def sqlProcess(self,db):
 5         while(True):
 6             cursor=db.cursor()
 7             cursor.execute("SELECT * FROM USERS WHERE TAG =1")           #1 表示 未处理 2 表示 正在处理 3 表示完成处理
 8             result=cursor.fetchone()
 9             if(result):
10                 cursor.execute("UPDATE USERS SET TAG=2 WHERE USERID=‘%s‘" % (result[2]))
11                 talks=self.GetTalks(uid=result[2])
12                 for i in range(1,4):
13                    uids=""
14                    names=""
15                    userlist = self.GetUserList(uid=result[2], pageNum=i)
16                    try:
17                         uids=userlist[‘uid‘]
18                         names=userlist[‘name‘]
19                    except:
20                        break
21                    if int(result[4])!=3:
22                        for t in range(len(uids)):
23                            try:
24                                if self.IfExist(db,"users","name",names[t])==False:
25                                     cursor.execute("INSERT INTO USERS (NAME,USERID,TAG,CLASS) VALUES (‘%s‘,‘%s‘,%d,%d)" % (names[t], uids[t], 1, int(result[4])+1))  # 数据库写userlist
26                                     cursor.execute("INSERT INTO FOLLOWS (USERID,FUID,FUNAME) VALUES (‘%s‘,‘%s‘,‘%s‘)" % (result[2], uids[t],names[t]))
27                            except:
28                                print("Error")
29                 for talk in talks:
30                     try:
31                         cursor.execute("INSERT INTO USERTALKS (USERID,NAME,TALK)VALUES (‘%s‘,‘%s‘,‘%s‘)" % (result[2],result[1],talk))#数据库写评论
32                     except:
33                         print("Error")
34                 cursor.execute("UPDATE USERS SET TAG=3 WHERE USERID=‘%s‘"%(result[2]))
35             else:
36                 break
37     def AnotherProcess(self,db):
38         cursor=db.cursor();
39         cursor.execute("SELECT * FROM USERS WHERE 1");
40         results=cursor.fetchall()
41         for result in results:
42             sex="女"
43             try:
44                 r = self.opener.open("https://weibo.com/u/"+result[2]).read().decode("utf-8")
45                 html = r.replace("\\t", "").replace("\\n", "").replace("\\r", "").replace("\\", "")
46                 if html.find("female") == -1:
47                     sex="男"
48             except:
49                 pass
50             soup = BeautifulSoup(html, "html.parser")
51             keywords=soup.find(attrs={"name":"keywords"})[‘content‘]
52             description=soup.find(attrs={"name":"description"})[‘content‘]
53             cursor.execute("INSERT INTO USERDETAILS (NAME,DESCRIPTION,KEYWORDS,SEX)VALUES(‘{}‘,‘{}‘,‘{}‘,‘{}‘)".format(result[1],description,keywords,sex))

前端

echarts前端配置-关系图

图表效果如下：

例子

<!DOCTYPE html>

<html>

<head>

<meta charset="UTF-8">

<title>关系图案例</title>

<!-- 引入 ECharts 文件 -->

<script src="js/echarts4.0.js" type="text/javascript" charset="utf-8"></script>

</head>

<body>

<!-- 为 ECharts 准备一个具备大小（宽高）的 容器 -->

<div id="chart1" style="width: 80%;height: 400px;top: 50px;left: 10%;border: 3px solid #FF0000;"></div>

</body>

</html>

<script type="text/javascript">

// 基于准备好的容器(这里的容器是id为chart1的div)，初始化echarts实例

var chart1 = echarts.init(document.getElementById("chart1"));

varoption = {

backgroundColor: ‘#ccc‘,// 背景颜色

    title: {                    // 图表标题

        text: "收入支出分析",           // 标题文本

        left : ‘3%‘,                    // 标题距离左侧边距

        top : ‘3%‘,                     // 标题距顶部边距

textStyle : {                       // 标题样式

color : ‘#000‘,                     // 标题字体颜色

fontSize : ‘30‘,                    // 标题字体大小

}

    },

    tooltip: {                  // 提示框的配置

        formatter: function(param) {

            if (param.dataType === ‘edge‘) {

                //return param.data.category + ‘: ‘ + param.data.target;

                return param.data.target;

            }

            //return param.data.category + ‘: ‘ + param.data.name;

            return param.data.name;

        }

    },

    series: [{

        type: "graph",          // 系列类型:关系图

        top: ‘10%‘,             // 图表距离容器顶部的距离

        roam: true,             // 是否开启鼠标缩放和平移漫游。默认不开启。如果只想要开启缩放或者平移，可以设置成 ‘scale‘ 或者 ‘move‘。设置成 true 为都开启

        focusNodeAdjacency: true,   // 是否在鼠标移到节点上的时候突出显示节点以及节点的边和邻接节点。[ default: false ]

                force: {                // 力引导布局相关的配置项，力引导布局是模拟弹簧电荷模型在每两个节点之间添加一个斥力，每条边的两个节点之间添加一个引力，每次迭代节点会在各个斥力和引力的作用下移动位置，多次迭代后节点会静止在一个受力平衡的位置，达到整个模型的能量最小化。

                                // 力引导布局的结果有良好的对称性和局部聚合性，也比较美观。

            repulsion: 1000,            // [ default: 50 ]节点之间的斥力因子(关系对象之间的距离)。支持设置成数组表达斥力的范围，此时不同大小的值会线性映射到不同的斥力。值越大则斥力越大

            edgeLength: [150, 100]      // [ default: 30 ]边的两个节点之间的距离(关系对象连接线两端对象的距离,会根据关系对象值得大小来判断距离的大小)，

                                        // 这个距离也会受 repulsion。支持设置成数组表达边长的范围，此时不同大小的值会线性映射到不同的长度。值越小则长度越长。如下示例:

                                        // 值最大的边长度会趋向于 10，值最小的边长度会趋向于 50      edgeLength: [10, 50]

        },

        layout: "force",            // 图的布局。[ default: ‘none‘ ]

                                    // ‘none‘ 不采用任何布局，使用节点中提供的 x， y 作为节点的位置。

                                    // ‘circular‘ 采用环形布局;‘force‘ 采用力引导布局.

        // 标记的图形

        //symbol: "path://M19.300,3.300 L253.300,3.300 C262.136,3.300 269.300,10.463 269.300,19.300 L269.300,21.300 C269.300,30.137 262.136,37.300 253.300,37.300 L19.300,37.300 C10.463,37.300 3.300,30.137 3.300,21.300 L3.300,19.300 C3.300,10.463 10.463,3.300 19.300,3.300 Z",

        symbol: ‘circle‘,

        lineStyle: {            // 关系边的公用线条样式。其中 lineStyle.color 支持设置为‘source‘或者‘target‘特殊值，此时边会自动取源节点或目标节点的颜色作为自己的颜色。

            normal: {

                color: ‘#000‘,          // 线的颜色[ default: ‘#aaa‘ ]

                width: 1,               // 线宽[ default: 1 ]

                type: ‘solid‘,          // 线的类型[ default: solid ]   ‘dashed‘    ‘dotted‘

                opacity: 0.5,           // 图形透明度。支持从 0 到 1 的数字，为 0 时不绘制该图形。[ default: 0.5 ]

                curveness: 0            // 边的曲度，支持从 0 到 1 的值，值越大曲度越大。[ default: 0 ]

            }

        },

        label: {                // 关系对象上的标签

            normal: {

                show: true,                 // 是否显示标签

                position: "inside",         // 标签位置:‘top‘‘left‘‘right‘‘bottom‘‘inside‘‘insideLeft‘‘insideRight‘‘insideTop‘‘insideBottom‘‘insideTopLeft‘‘insideBottomLeft‘‘insideTopRight‘‘insideBottomRight‘

                textStyle: {                // 文本样式

                    fontSize: 16

                }

            }

        },

        edgeLabel: {                // 连接两个关系对象的线上的标签

            normal: {

                show: true,

                textStyle: {                

                    fontSize: 14

                },

                formatter: function(param) {        // 标签内容

                    return param.data.category;

                }

            }

        },

        data: [{

            name: "某IT男",

            draggable: true,                // 节点是否可拖拽，只在使用力引导布局的时候有用。

            symbolSize: [100, 100],         // 关系图节点标记的大小，可以设置成诸如 10 这样单一的数字，也可以用数组分开表示宽和高，例如 [20, 10] 表示标记宽为20，高为10。

            itemStyle: {

            color: ‘#000‘// 关系图节点标记的颜色

            },

            category: "收入支出分析"         // 数据项所在类目的 index。

        }, {

            name: "工资\n6000",

            draggable: true,

            symbolSize: [80, 80],

            itemStyle: {

            color: ‘#0000ff‘

            },

            category: "收入+"

        }, {

            name: "租房\n600",

            draggable: true,

            symbolSize: [80, 80],

            itemStyle: {

            color: ‘#ff0000‘

            },

            category: "支出-"

        }, {

            name: "生活开销\n1400",

            draggable: true,

            symbolSize: [80, 80],

            itemStyle: {

            color: ‘#ff0000‘

            },

            category: "支出-"

        }, {

            name: "储蓄\n4000",

            draggable: true,

            symbolSize: [80, 80],

            itemStyle: {

            color: ‘#00ff00‘

            },

            category: "剩余="

        }],

        categories: [{              // 节点分类的类目，可选。如果节点有分类的话可以通过 data[i].category 指定每个节点的类目，类目的样式会被应用到节点样式上。图例也可以基于categories名字展现和筛选。

            name: "收入支出分析"            // 类目名称，用于和 legend 对应以及格式化 tooltip 的内容。

        }, {

            name: "收入+"

        }, {

            name: "支出-"

        }, {

            name: "支出-"

        }, {

            name: "剩余="

        }],

        links: [{                   // 节点间的关系数据

            target: "工资\n6000",

            source: "某IT男",

            category: "收入+"              // 关系对象连接线上的标签内容

        }, {

            target: "租房\n600",

            source: "某IT男",

            category: "支出-"

        }, {

            target: "生活开销\n1400",

            source: "某IT男",

            category: "支出-"

        }, {

            target: "储蓄\n4000",

            source: "某IT男",

            category: "剩余="

        }]

    }],

    animationEasingUpdate: "quinticInOut",          // 数据更新动画的缓动效果。[ default: cubicOut ]    "quinticInOut"

    animationDurationUpdate: 100                    // 数据更新动画的时长。[ default: 300 ]

};

// 使用刚指定的配置项和数据显示图表

chart1.setOption(option)

</script>

Vue+SpringBoot+Python

代码不贴了

后端mysql不贴代码。

原文地址：https://www.cnblogs.com/aoru45/p/9748769.html

时间： 2024-08-29 08:34:02

[微博爬虫] 登录+爬取+mysql存储+echart可视化

echarts前端配置-关系图

Vue+SpringBoot+Python

[微博爬虫] 登录+爬取+mysql存储+echart可视化的相关文章

爬取电影天堂最新电影的名称和下载链接(增量爬取mysql存储版)

[python爬虫] Selenium爬取内容并存储至MySQL数据库

[Python爬虫] Selenium爬取新浪微博客户端用户信息、热点话题及评论 (上)

[Python爬虫] Selenium爬取新浪微博移动端热点话题及评论 (下)

Python自定义豆瓣电影种类，排行，点评的爬取与存储（进阶）

用JAVA制作一个爬取商品信息的爬虫（爬取大众点评）

scrapy-redis实现爬虫分布式爬取分析与实现

Selenium+PhantomJS自动化登录爬取博客文章

第一个nodejs爬虫：爬取豆瓣电影图片