python爬虫设计刷博客访问量（刷访问量，赞，爬取图片）

需要准备的工具：

安装python软件，下载地址：https://www.python.org/

Fiddler抓包软件：http://blog.csdn.net/qq_21792169/article/details/51628123

刷博客访问量的原理是：打开一次网页博客访问量就增加一次。（新浪，搜狐等博客满足这个要求）

count.py

<span style="font-size:18px;">import webbrowser as web
import time
import os
import random
count = random.randint(1,2)
j=0
while j<count:
    i=0
    while i<=8 :
        web.open_new_tab(‘http://blog.sina.com.cn/s/blog_552d7c620100aguu.html‘)  #网址替换这里
        i=i+1
        time.sleep(3)  #这个时间根据自己电脑处理速度设置，单位是s
    else:
        time.sleep(10)  <span style="font-family: Arial, Helvetica, sans-serif;">#这个时间根据自己电脑处理速度设置，单位是s</span>
        os.system(‘taskkill /F /IM chrome.exe‘)  #google浏览器，其他的更换下就行
        #print ‘time webbrower closed‘

    j=j+1
</span>

刷赞就需要用Fiddler来获取Request header数据，比如Cookie,Host,Referer,User-Agent等

sina.py

<span style="font-size:18px;">import urllib.request
import sys

points = 2   #how count ?
if len(sys.argv) > 1:
    points = int(sys.argv[1])

aritcleUrl = ‘‘
point_header = {
    ‘Accept‘ : ‘*/*‘,
    ‘Cookie‘ : 	‘‘,#填你的cookie信息

    ‘Host‘:‘‘,  #主机
    ‘Referer‘ : ‘‘,
    ‘User-Agent‘ : ‘Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.110 Safari/537.36‘,
}

for i in range(points):
    point_request = urllib.request.Request(aritcleUrl, headers = point_header)
    point_response = urllib.request.urlopen(point_request)
</span>

上面的header头通过抓包数据可以获取，这里只是提供思路。

爬取网页上的图片：

getimg.py

#coding=utf-8
import urllib
import urllib2
import re

def getHtml(url):

	headers = {‘User-Agent‘:‘Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6‘}
	req = urllib2.Request(url,headers=headers)

	page = urllib2.urlopen(req);
	html = page.read()
	return html

def getImg(html):
     reg = r‘src="(h.*?g)"‘
    #reg = r‘<img src="(.+?\.jpg)"‘
    imgre = re.compile(reg)
    imglist = re.findall(imgre,html)
    print imglist
    x = 0
    for imgurl in imglist:
        urllib.urlretrieve(imgurl,‘%s.jpg‘ % x)
        x+=1

html = getHtml("http://pic.yxdown.com/list/0_0_1.html")

print getImg(html)

1、 .*? 三个符号可以匹配任意多个任意符号

2、 \. 是将 ‘.’ 转义，代表的就是HTML中的 .

3、（）表示我们只取括号中的部分，省略之外的。

爬取CSDN的访问量csdn.py

#!usr/bin/python
# -*- coding: utf-8 -*-

import urllib2
import re

#当前的博客列表页号
page_num = 1
#不是最后列表的一页
notLast = 1

fs = open(‘blogs.txt‘,‘w‘)
account = str(raw_input(‘Input csdn Account:‘))

while notLast:

    #首页地址
    baseUrl = ‘http://blog.csdn.net/‘+account
    #连接页号，组成爬取的页面网址
    myUrl = baseUrl+‘/article/list/‘+str(page_num)

    #伪装成浏览器访问，直接访问的话csdn会拒绝
    user_agent = ‘Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)‘
    headers = {‘User-Agent‘:user_agent}
    #构造请求
    req = urllib2.Request(myUrl,headers=headers)

    #访问页面
    myResponse = urllib2.urlopen(req)
    myPage = myResponse.read()

    #在页面中查找是否存在‘尾页’这一个标签来判断是否为最后一页
    notLast = re.findall(‘<a href=".*?">尾页</a>‘,myPage,re.S)

    print ‘-----------------------------第%d页---------------------------------‘ % (page_num,)
    fs.write(‘--------------------------------第%d页--------------------------------\n‘ % page_num)
    #利用正则表达式来获取博客的href
    title_href = re.findall(‘<span class="link_title"><a href="(.*?)">‘,myPage,re.S)

    titleListhref=[]
    for items in title_href:
        titleListhref.append(str(items).lstrip().rstrip())     

    #利用正则表达式来获取博客的
    title= re.findall(‘<span class="link_title"><a href=".*?">(.*?)</a></span>‘,myPage,re.S)
    titleList=[]
    for items in title:
        titleList.append(str(items).lstrip().rstrip())     

    #利用正则表达式获取博客的访问量
    view = re.findall(‘<span class="link_view".*?><a href=".*?" title="阅读次数">阅读</a>\((.*?)\)</span>‘,myPage,re.S)

    viewList=[]
    for items in view:
        viewList.append(str(items).lstrip().rstrip())

    #将结果输出
    for n in range(len(titleList)):
        print ‘访问量:%s href:%s 标题:%s‘ % (viewList[n].zfill(4),titleListhref[n],titleList[n])
	fs.write(‘访问量:%s\t\thref:%s\t\t标题:%s\n‘ % (viewList[n].zfill(4),titleListhref[n],titleList[n]))
    #页号加1
    page_num = page_num + 1

这个正则表达式写的不是很完整，如果有置顶文章的话，抓取到的文章标题就会多出<font color="red">[置顶]</font>，所以这里应该添加一个判断语句，读者可以自行尝试。

手动生成IP列表creat_ip：

#-*- coding:utf-8 -*-
#!/usr/bin/python

import time
time_start = time.time()
def get_ip(number=‘10‘ ,start=‘1.1.1.1‘ ):
    file = open(‘ip_list.txt‘, ‘w‘)
    starts = start.split( ‘.‘)
    A = int(starts[0])
    B = int(starts[1])
    C = int(starts[2])
    D = int(starts[3])
    for A in range(A,256):
        for B in range(B, 256):
            for C in range(C, 256):
                for D in range(D, 256):
                    ip = "%d.%d.%d.%d" %(A,B,C,D) 

                    if number > 1:
                        file.write(ip+ ‘\n‘)
                        number -= 1
                    elif number == 1:    #解决最后多一行回车问题
                        file.write(ip)
                        number -= 1
                    else:
                        file.close()
                        print ip
                        return
                D = 0
            C = 0
        B = 0
get_ip(100000,‘101.23.228.102‘)
time_end = time.time()
time = time_end - time_start
print ‘耗时%s秒‘ %time

grab_ip.py 抓取代理IP网站，读取出IP和端口号，具体怎么使用这些IP和端口看个人实际情况。

#!/usr/bin/python
#-*- coding:utf-8 -*-

import urllib,time,re,logging
import urllib
import urllib2
import re
import time
import os
import random

url = ‘http://www.xicidaili.com/‘
csdn_url=‘http://blog.csdn.net/qq_21792169/article/details/51628142‘
header = {‘User-Agent‘:‘Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6‘}

USER_AGENT_LIST = [
                ‘Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11‘,
                ‘Opera/9.25 (Windows NT 5.1; U; en)‘,
                ‘Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)‘,
                ‘Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.5 (like Gecko) (Kubuntu)‘,
                ‘Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.12) Gecko/20070731 Ubuntu/dapper-security Firefox/1.5.0.12‘,
                ‘Lynx/2.8.5rel.1 libwww-FM/2.14 SSL-MM/1.4.1 GNUTLS/1.2.9‘
            ]

def getProxyHtml(url):
    headers = {‘User-Agent‘:‘Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6‘}
    req = urllib2.Request(url,headers=headers)
    page = urllib2.urlopen(req);
    html = page.read()
    return html

def ipPortGain(html):
    ip_re = re.compile(r‘(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}).+\n.+>(\d{1,5})<‘)
    ip_port = re.findall(ip_re,html)
    return ip_port

def proxyIP(ip_port):
#to ip deal with[‘221.238.28.158:8081‘, ‘183.62.62.188:9999‘]格式
	proxyIP = []
	for i in range( 0,len(ip_port)):
		proxyIP.append( ‘:‘.join(ip_port[i]))
		logging.info(proxyIP[i])
#to ip deal with[{‘http‘: ‘http://221.238.28.158:8081‘}, {‘http‘: ‘http://183.62.62.188:9999‘}]格式
	proxy_list = []
	for i in range( 0,len(proxyIP)):
		a0 = ‘http://%s‘%proxyIP[i]
		a1 = { ‘http ‘:‘%s‘%a0}
		proxy_list.append(a1)
	return proxy_list

def csdn_Brush(ip):
	print ip

#use ping verify ip if alive
def ping_ip(ip):
    ping_cmd = ‘ping -c 2 -w 5 %s‘ % ip
    ping_result = os.popen(ping_cmd).read()
    print ‘ping_cmd : %s, ping_result : %r‘ % (ping_cmd, ping_result)

    if ping_result.find(‘100% packet loss‘) < 0:
        print ‘ping %s ok‘ % ip
        return True
    else:
        print ‘ping %s fail‘ % ip	

fh = open(‘proxy_ip.txt‘,‘w‘)
html=getProxyHtml(url)
ip_port=ipPortGain(html)
proxy_list=proxyIP(ip_port)

for proxy_ip in proxy_list:
    ping_ip(proxy_ip)
    fh.write(‘%s\n‘%(proxy_ip,))
    res=urllib.urlopen(csdn_url,proxies=proxy_ip).read()#这里可以添加一个for循环，把博文所以的文章都用这个IP请求一次，然后博文的访问量=IP*博文总数*进程数

（有时间间隔，大约是半个小时，CSDN设置时间检测，所以我们配合上C语言）
fh.close()

这样一个完整的刷访问量脚本就写成功了，这样一个脚本运行一次只是一个进程，一个进程出现我问题，整个程序也就无法执行下去，这里写一个C语言脚本程序。

#include<stdlib.h>

int main(int argc,char **argv)
{
	while(1)
	{
		char *cmd="python /home/book/csdn.py";  /* 这里是CSDN刷访问量的Python脚本程序路径 */
		system(cmd);   /* 这里是执行一个进程，一个进程出现问题，立马开启新的进程，一个进程运行脚本的时间大约是半个小时，所以CSDN的时间检测也就无效了，一天访问量=IP*博文总数*24*2*/
		return 0;
	}
}

最后一个比较可靠的办法：抓取肉鸡，执行我们的脚本程序，安全，可靠。

时间： 2024-10-07 18:34:29

python爬虫设计刷博客访问量（刷访问量，赞，爬取图片）的相关文章

Python 自动刷博客浏览量

哈哈,今天的话题有点那什么了哈.咱们应该秉承学习技术的角度来看,那么就开始今天的话题吧. 思路来源今天很偶然的一个机会,听到别人在谈论现在的"刷量"行为,于是就激发了我的好奇心.然后看了下requests模块正好对我有用,就写了一个简单的测试用例.神奇的发现这一招竟然是管用的.那还等什么,开刷咯. 前奏思路很简单,就是一个发送请求的实现,就可以了.代码如下: headers = { 'referer':'http://blog.csdn.net/', 'User-Agent':'M

刷流量，免费手机在线刷网站流量，刷网站PV，刷博客流量，刷博客访问量

刷流量,免费手机在线刷网站流量,刷网站PV,刷博客(淘宝)流量,刷博客(淘宝)访问量,用手机浏览器或者微信扫以下二维码: 有图有真相:还怕网站每天流量极低的站长们,还有网店的店主们,动动你们的手指,打开手机浏览器或微信扫扫二维码:你会惊讶的看到,手机也能刷网站(网店)流量,网站PV哦! 网站来源:http://www.learnphp.cn

开源分享：用Python开发的开源博客系统Blog_mini

本博文在51CTO技术博客首发. 开源不易,Python良心之作,真心送给广大朋友,恳请给予支持,不胜感激! 0.Blog_mini送给你们:让每个人都轻松拥有可管理的个人博客你从未架设过服务器或网站,希望可以接触一下这方面的知识-- 你从未使用过Linux操作系统,希望可以接触一下这方面的知识-- 你是初中生/高中生/大学生,希望能在学业之余锻炼一下自己的IT技能-- 你是Python新手,希望能有一个用Python开发的个人博客-- 你学习Python许久,希望有一个开源的项目可以用来学习

【python】获取51cto博客的文章列表

python的正则与网页操作练习二: import re import urllib.request #51cto urlcode=gb18030 class down51web: s_url='' s_blogid='' s_blogpages='' s_html='' s_code='' def __init__(self,url,code): self.s_url=url self.s_code=code def get_html(self): self.s_html=urllib.req

简单爬虫-抓取博客园文章列表

原文:简单爬虫-抓取博客园文章列表如果使用对方网站数据,而又没有响应的接口,或者使用接口不够灵活的情况下,使用爬虫在合适不过了.爬虫有几种,对方网站展示形式有几种都是用分析,每个网站展示有相似的地方,有不同的地方. 大部分使用httpRequst就能完成,不管是否添加了口令.随即码.请求参数.提交方式get或者post.地址来源.多次响应等等.但是有些网站使用ajax如果是返回json或固定格式的也好处理,如果是很复杂的,可以使用webbrower控件进行抓取,最后正则解析,获取所需要的数据即

python实现的刷博客浏览量（有待改进）

python3.4, 使用了url.request,re ,bs4这些库, 在mooc看了很久爬虫的代码, 感觉自己可以实现这么一个贱贱的功能, 但是写完了之后访问页面是可以的, 但是浏览量并不增加. 宝宝心里苦, 感觉还要每次清空Cookie, 有空再改. import urllib.request import re import time import random from bs4 import BeautifulSoup p = re.compile('/MnsterLu/p/....

Python3.7实现自动刷博客访问量（只需要输入用户id）（转)

新增了代理功能,代码很浅显易懂不想多余讲解 import re import requests from requests import RequestException import time import random from bs4 import BeautifulSoup # 获取网页的response文件 def get_response(url): try: headers = { 'Referer': 'https://blog.csdn.net', # 伪装成从CSDN博客搜索

[Python学习] 简单网络爬虫抓取博客文章及思想介绍

前面一直强调Python运用到网络爬虫方面非常有效,这篇文章也是结合学习的Python视频知识及我研究生数据挖掘方向的知识.从而简单介绍下Python是如何爬去网络数据的,文章知识非常简单,但是也分享给大家,就当简单入门吧!同时只分享知识,希望大家不要去做破坏网络的知识或侵犯别人的原创型文章.主要包括: 1.介绍爬取CSDN自己博客文章的简单思想及过程 2.实现Python源码爬取新浪韩寒博客的316篇文章一.爬虫的简单思想最近看刘兵的<Web数据挖掘>知道,在研

python网络爬虫新浪博客篇

上次写了一个爬世纪佳缘的爬虫之后,今天再接再厉又写了一个新浪博客的爬虫.写完之后,我想了一会儿,要不要在博客园里面写个帖子记录一下,因为我觉得这份代码的含金量确实太低,有点炒冷饭的嫌疑,就是把上次的代码精简了一下,用在另外一个网站而已,而且爬别人的博客总有一种做贼心虚的感觉,怕被各位园友认为是偷窥狂魔.但是这份代码总归是我花了精力去写的,我也不想就此让它深藏在硬盘之中(电脑实在太老了,可能过两年硬盘坏了,这份代码就消失了),还是贴出来权当作抛砖引玉. 说起要爬新浪博客,总归是有一个原因吧.我的原