Python网页小爬虫

  最近右胳膊受伤,打了石膏在家休息。为了实现之前的想法,就用左手打字、写代码,查资料完成了这个资源小爬虫。网页爬虫,

最主要的是协议分析(必须要弄清楚自己的目的),另外就是要考虑对爬取的数据归类,存储。这是一个在线歌曲网站的爬虫,网站名

字就不说了,此贴目的是技术交流,请不用做其他用途!

相关技术点:http协议、js、AES、文件、文件夹操作、正则表达式、数据库技术、SQL

-------------------------------------------分割线 以下 为设计思路---------------------------------------------

思路:

1.打开fiddler,打开网站,在线试听

2.观察网页播放器的资源加载方式

3.搜索请求返回的报文,从中找到资源信息

4.分析资源信息的数据格式,找出规律

5.筛选出需要的数据,爬虫下载、存储

-------------------------------------------分割线 以下 遇到问题与解决思路,记录,以备后用---------------------------------------------

问题1:跟踪不到资源数据来源

  

  在跟包过程中 发现了资源播放的加载路径,但搜索所求服务端返回的报文,均找不到对应的信息。无奈,只好把这网站的HTML、

JS 源码跟踪 ,静态分享 + 网页调试,最终得到答案:网页使用的资源播放列表是经过 AES 加密的,同时 密钥跟随页面一起发

送给了客户端(浏览器)

  ps: eval("var playlist = " + GibberishAES.dec(pl, $.xxxx.xxx.aes))

问题2:在知道 密文、密钥的情况下,使用Python、Java、C#语言均无法解密

  

  将从页面拿到的AES密文、密钥 通过在线AES解密网页(http://tool.chinaz.com/Tools/TextEncrypt.aspx)直接解开,考虑到

是AES,安装Crypto包,并进行加解密操作。实际情况是 使用 CBC模式,16位密钥长度,16位与密码一致的向量 无法实现与网页加解

密相同的结果(自己代码解不开播放列表密文)。后使用JAVA、C# 设置为相同模式,相同密钥,实现了C#、Java、Python之间的相互

加解密操作,但是与JS加解密(CryptoJS)仍不相同,到这就卡住了一天多。

无奈 硬头皮跟踪在线解密网站的JS,发现CryptoJS在加解密时候,会根据原始密钥 通过一定的算法(openssl kdf 算法?)得到新

的密钥、向量 并加盐处理! 很无奈 JS 语言能力不够,没办法看得懂那算法,也就没办法用其他语言实现了。在分析CryptoJS加密过程

的时候发现另外一个情况,就是相同的原文、密钥 每次加密的结果是不同的(有随机的加盐处理?),虽然每次密文都不同,但用相同的

密钥都能解得开,每次密文(Base64) 都会以U2FsdGVkX1开头 (很神奇,网上搜了下资料,应该是在密文中隐藏了密码盐 salt 随机数,

有朋友用C#实现类似的算法,见:http://www.cnblogs.com/stone_w/p/4229275.html)

  ps:

    相同明文、密钥 产生不同 密文(随机向量、盐?):

    在线加解密:http://tool.chinaz.com/Tools/TextEncrypt.aspx

    text:  abcdefgcnblogs

    key:  12345678

    结果1:U2FsdGVkX19GtgRcaq8vxaWN8HjO+tJWZXmLBBbQv8c=

    结果2:U2FsdGVkX18y6MsDEIxKqT0gS0AsoFsa9trUwnFzH5c=

    结果3:U2FsdGVkX1/2rk/eD0MuFfJEe076aogxPnOacIDpqNs=

    结果N:。。。

    

    Java版本AES:

/*
 * To change this template, choose Tools | Templates
 * and open the template in the editor.
 */

/**
 *
 * @author Jacker
 */

import javax.crypto.Cipher;
import javax.crypto.spec.IvParameterSpec;
import javax.crypto.spec.SecretKeySpec;
import sun.misc.BASE64Decoder;

public class Encryption
{
    public static void main(String args[]) throws Exception {
        System.out.println(encrypt());
        System.out.println(desEncrypt());
    }

    public static String encrypt() throws Exception {
        try {
            String data = "123456abcdefgyhn123456abcdefgyhn";
            String key = "b6ce159334e155d8";
            String iv = "b6ce159334e155d8";

            Cipher cipher = Cipher.getInstance("AES/CBC/NoPadding");
            int blockSize = cipher.getBlockSize();

            byte[] dataBytes = data.getBytes();
            int plaintextLength = dataBytes.length;
            if (plaintextLength % blockSize != 0) {
                plaintextLength = plaintextLength + (blockSize - (plaintextLength % blockSize));
            }

            byte[] plaintext = new byte[plaintextLength];
            System.arraycopy(dataBytes, 0, plaintext, 0, dataBytes.length);

            SecretKeySpec keyspec = new SecretKeySpec(key.getBytes(), "AES");
            IvParameterSpec ivspec = new IvParameterSpec(iv.getBytes());

            cipher.init(Cipher.ENCRYPT_MODE, keyspec, ivspec);
            byte[] encrypted = cipher.doFinal(plaintext);

            return new sun.misc.BASE64Encoder().encode(encrypted);

        } catch (Exception e) {
            e.printStackTrace();
            return null;
        }
    }

    public static String desEncrypt() throws Exception {
        try
        {
            String data = "JooiPOsR21TbUksOLu21kZcR15RbEFtAhYn6VKdRoJw=";
            String key = "b6ce159334e155d8";
            String iv = "b6ce159334e155d8";

            byte[] encrypted1 = new BASE64Decoder().decodeBuffer(data);

            Cipher cipher = Cipher.getInstance("AES/CBC/NoPadding");
            SecretKeySpec keyspec = new SecretKeySpec(key.getBytes(), "AES");
            IvParameterSpec ivspec = new IvParameterSpec(iv.getBytes());

            cipher.init(Cipher.DECRYPT_MODE, keyspec, ivspec);

            byte[] original = cipher.doFinal(encrypted1);
            String originalString = new String(original);
            return originalString;
        }
        catch (Exception e) {
            e.printStackTrace();
            return null;
        }
    }
}

  C#版本AES:

using System;
using System.Text;
using System.Security.Cryptography;
using System.IO;

namespace ConsoleAppAESEncrypt
{
    /// <summary>
    /// DES加密解密
    /// </summary>
    public class DES
    {
        /// <summary>
        /// 获取密钥
        /// </summary>
        private static string Key
        {
            get { return @"[email protected]+#wG+Z"; }
        }

        /// <summary>
        /// 获取向量
        /// </summary>
        private static string IV
        {
            get { return @"L%n67}G\[email protected]%:~Y"; }
        }

        /// <summary>
        /// DES加密
        /// </summary>
        /// <param name="plainStr">明文字符串</param>
        /// <returns>密文</returns>
        public static string DESEncrypt(string plainStr)
        {
            byte[] bKey = Encoding.UTF8.GetBytes(Key);
            byte[] bIV = Encoding.UTF8.GetBytes(IV);
            byte[] byteArray = Encoding.UTF8.GetBytes(plainStr);

            string encrypt = null;
            DESCryptoServiceProvider des = new DESCryptoServiceProvider();
            try
            {
                using (MemoryStream mStream = new MemoryStream())
                {
                    using (CryptoStream cStream = new CryptoStream(mStream, des.CreateEncryptor(bKey, bIV), CryptoStreamMode.Write))
                    {
                        cStream.Write(byteArray, 0, byteArray.Length);
                        cStream.FlushFinalBlock();
                        encrypt = Convert.ToBase64String(mStream.ToArray());
                    }
                }
            }
            catch { }
            des.Clear();

            return encrypt;
        }

        /// <summary>
        /// DES解密
        /// </summary>
        /// <param name="encryptStr">密文字符串</param>
        /// <returns>明文</returns>
        public static string DESDecrypt(string encryptStr)
        {
            byte[] bKey = Encoding.UTF8.GetBytes(Key);
            byte[] bIV = Encoding.UTF8.GetBytes(IV);
            byte[] byteArray = Convert.FromBase64String(encryptStr);

            string decrypt = null;
            DESCryptoServiceProvider des = new DESCryptoServiceProvider();
            try
            {
                using (MemoryStream mStream = new MemoryStream())
                {
                    using (CryptoStream cStream = new CryptoStream(mStream, des.CreateDecryptor(bKey, bIV), CryptoStreamMode.Write))
                    {
                        cStream.Write(byteArray, 0, byteArray.Length);
                        cStream.FlushFinalBlock();
                        decrypt = Encoding.UTF8.GetString(mStream.ToArray());
                    }
                }
            }
            catch { }
            des.Clear();

            return decrypt;
        }
    }

    /// <summary>
    /// AES加密解密
    /// </summary>
    public class AES
    {
        /// <summary>
        /// 获取密钥
        /// </summary>
        private static string Key
        {
          //  get { return @")O[NB]6,YF}+efcaj{+oESb9d8>Z‘e9M"; }
            get { return @"b6ce159334e155d8"; }
        }

        /// <summary>
        /// 获取向量
        /// </summary>
        private static string IV
        {
           // get { return @"L+\~f4,Ir)b$=pkf"; }
            get { return @"b6ce159334e155d8"; }
        }

        /// <summary>
        /// AES加密
        /// </summary>
        /// <param name="plainStr">明文字符串</param>
        /// <returns>密文</returns>
        public static string AESEncrypt(string plainStr)
        {
            byte[] bKey = Encoding.UTF8.GetBytes(Key);
            byte[] bIV = Encoding.UTF8.GetBytes(IV);
            byte[] byteArray = Encoding.UTF8.GetBytes(plainStr);

            string encrypt = null;
            Rijndael aes = Rijndael.Create();
            aes.Mode = CipherMode.CBC;
            aes.Padding = PaddingMode.Zeros;
            try
            {
                using (MemoryStream mStream = new MemoryStream())
                {
                    using (CryptoStream cStream = new CryptoStream(mStream, aes.CreateEncryptor(bKey, bIV), CryptoStreamMode.Write))
                    {
                        cStream.Write(byteArray, 0, byteArray.Length);
                        cStream.FlushFinalBlock();
                        encrypt = Convert.ToBase64String(mStream.ToArray());
                    }
                }
            }
            catch { }
            aes.Clear();

            return encrypt;
        }

        /// <summary>
        /// AES加密
        /// </summary>
        /// <param name="plainStr">明文字符串</param>
        /// <param name="returnNull">加密失败时是否返回 null,false 返回 String.Empty</param>
        /// <returns>密文</returns>
        public static string AESEncrypt(string plainStr, bool returnNull)
        {
            string encrypt = AESEncrypt(plainStr);
            return returnNull ? encrypt : (encrypt == null ? String.Empty : encrypt);
        }

        /// <summary>
        /// AES解密
        /// </summary>
        /// <param name="encryptStr">密文字符串</param>
        /// <returns>明文</returns>
        public static string AESDecrypt(string encryptStr)
        {
            byte[] bKey = Encoding.UTF8.GetBytes(Key);
            byte[] bIV = Encoding.UTF8.GetBytes(IV);
            byte[] byteArray = Convert.FromBase64String(encryptStr);

            string decrypt = null;
            Rijndael aes = Rijndael.Create();
            aes.Mode = CipherMode.CBC;
            aes.Padding = PaddingMode.Zeros;
            try
            {
                using (MemoryStream mStream = new MemoryStream())
                {
                    using (CryptoStream cStream = new CryptoStream(mStream, aes.CreateDecryptor(bKey, bIV), CryptoStreamMode.Write))
                    {
                        cStream.Write(byteArray, 0, byteArray.Length);
                        cStream.FlushFinalBlock();
                        decrypt = Encoding.UTF8.GetString(mStream.ToArray());
                    }
                }
            }
            catch { }
            aes.Clear();

            return decrypt;
        }

        /// <summary>
        /// AES解密
        /// </summary>
        /// <param name="encryptStr">密文字符串</param>
        /// <param name="returnNull">解密失败时是否返回 null,false 返回 String.Empty</param>
        /// <returns>明文</returns>
        public static string AESDecrypt(string encryptStr, bool returnNull)
        {
            string decrypt = AESDecrypt(encryptStr);
            return returnNull ? decrypt : (decrypt == null ? String.Empty : decrypt);
        }
    }
}

  Python版本AES:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
from Crypto.Cipher import AES
import base64
# PADDING = ‘\0‘
PADDING = ‘ ‘
# pad_it = lambda s: s + (16 - len(s) % 16) * PADDING
# pad_it = lambda s: s# + (32 - len(s) % 32) * PADDING

def pad_it(text):
    length = 16
    # count = text.count(‘‘)
    count = len(text)  # .count(‘‘)
    padcount = (count % length)
    if padcount == 0:
        return text

    add = length - padcount
    text = text + (PADDING * add)
    return text
    pass

key = ‘b6ce159334e155d8‘
iv = ‘b6ce159334e155d8‘
source = ‘123456abcdefgyhn123456abcdefgyhn‘

print ‘source len:‘, len(source), source

generator = AES.new(key, AES.MODE_CBC, iv)
source = pad_it(source)

crypt = generator.encrypt(source)
cryptedStr = base64.b64encode(crypt)
print cryptedStr
generator = AES.new(key, AES.MODE_CBC, iv)
recovery = generator.decrypt(crypt)

print len(recovery), len(recovery.rstrip(PADDING)), recovery.rstrip(PADDING)

问题3:如何在pytyon内调用JS实现解密

  卡在AES解密上无解一天后,在群里问问其他朋友,在此感谢 @寻找和谐 给的思路,用虚机货其他方式加载并执行JS,实现

解密,真是一语惊醒梦中人。网上查了资料,在Python中执行JS至少有3种方案,考虑到我是系统是windows,就用了比较简单的

方式 调用控件MSScriptControl.ScriptControl (需要安装 pywin32-214.win32-py2.7)资料参见:

http://www.360doc.com/content/13/0318/15/7492958_272244611.shtml

http://blog.chinaunix.net/uid-9407860-id-2423996.html

-------------------------------------------分割线 以下 为本次爬虫相关代码---------------------------------------------

备注:爬虫除了获取点点音频文件外,还将解析的数据存放到了sqlite中(建表脚本在py代码中)

代码结构:

相关代码:

pywinjsaes.py

#! /usr/bin/env python
# coding=utf-8

import win32com.server.util, win32com.client

# 以下代码解决输出乱码问题
import sys
# print sys.getdefaultencoding()
reload(sys)
sys.setdefaultencoding(‘utf8‘)
# print sys.getdefaultencoding()

class __PyWinJsAes:

    def __init__(self):

        js = win32com.client.Dispatch(‘MSScriptControl.ScriptControl‘)
        js.Language = ‘JavaScript‘
        js.AllowUI = False
        js.AddCode(self.__readJsFile("jsfiles/aes.js"))
        js.AddCode(self.__readJsFile("jsfiles/aesutil.js"))
        self.jsengine = js

    def __readJsFile(self, filename):

        fp = file(filename, ‘r‘)
        lines = ‘‘
        for line in fp:
            lines += line
        return lines

    def __driveJsCode(self, func, paras):        

        if paras:
            return self.jsengine.Run(func, paras[0], paras[1])
        else:
            return self.jsengine.Run(func)

    def encrypt(self, text, key):
        return self.__driveJsCode("DoAesEncrypt", [text, key])

    def decrypt(self, text, key):
#         print text,key
        return self.__driveJsCode("DoAesDecrypt", [text, key])

JsAes = __PyWinJsAes()

if __name__ == ‘__main__‘:

    p = JsAes.decrypt("U2FsdGVkX19FDZhhIeMCH9SHfLg8B34NUbWxnuRFtc++fkhyKov9urtLuG7qatqm TP2/LEy+g35Jarbm5KoGCg==",
                    "456")
    print  ‘*‘ * 20
    print p

# js.run

luomusic.py

#! /usr/bin/env python
# coding=utf-8

import requests
import json
import re
import struct
import base64
import sqlite3
import os
import urllib
import urllib2
import time
import random

from pywinjsaes import JsAes

# 以下代码解决输出乱码问题
import sys
# print sys.getdefaultencoding()
reload(sys)
sys.setdefaultencoding(‘utf8‘)
# print sys.getdefaultencoding()

re_win_filename_pattern = ‘\\\|/|:|\*|\?|\"|<|>|\|‘
re_win_filename = re.compile(re_win_filename_pattern)

class LuoMusic:

    def __init__(self):
        self.RootDir = "LuooMusic/"
        self.AudioDir = self.RootDir + "AudioFile/"
        self.ReqHeader = self.__initReqHeader()
        self.ReqRootUrl = "http://www.luoo.net/music/"

        self.dbcx = sqlite3.connect(self.RootDir + "luomusic.sqlite")
        pass

    def __initReqHeader(self):

        # Mozilla/5.0 (Windows NT 6.3; WOW64; rv:37.0) Gecko/20100101 Firefox/37.0
        headers = {
                    ‘User-Agent‘: ‘Mozilla/5.0 (Windows NT 6.3; WOW64; rv:37.0) Gecko/20100101 Firefox/37.0‘
                    }
        return headers

    def getAlbumInfoByIndex(self, index):

        errRet = (None, None, None)        

        # 发起请求
        r = requests.get(self.ReqRootUrl + str(index), headers=self.ReqHeader)
        r.encoding = ‘utf-8‘
#         print r.encoding, r.text
        luohtml = r.text

        # 从HTML中正则解析专辑信息        

        # 获取 标题 <span class="vol-title">关于失眠和夜晚的世界</span>
        titlepattern = ‘"vol-title">(.*)</span>‘
        titlelist = re.compile(titlepattern).findall(luohtml)
        if len(titlelist) == 0:
            return errRet
        title = titlelist[0]

        # 获取 描述 "vol-desc">本期音乐为台湾后摇音乐专题。<br></div><div class="clearfix vol-meta">
        descpattern = ‘<div class="vol-desc">([\s\S]*)</div>\s*<div class="clearfix vol-meta">‘
        desclist = re.compile(descpattern).findall(luohtml)
        if len(desclist) == 0:
            return errRet
        desc = desclist[0]

        re_br = re.compile(‘<br\s*?/?>‘)  # 处理换行
        desc = re_br.sub(‘\n‘, desc)
        # re_br=re.compile(‘<br\s*?/?>|\s‘)#去除所有换行空格<br/>
#         desc=re_br.sub(‘‘,desc)
#         print "desc:", len(desc), desc        

        # 获取加密  歌单  var pl = "U2FsdGVkX1/";
        playListPattern = ‘pl\s*=\s*\"(\S*)";‘
        playListStrs = re.compile(playListPattern).findall(luohtml)
        if len(playListStrs) == 0:
            return errRet
        playList = playListStrs[0]

        # 获取加密 密钥 "aes":"b6ce159334e155d8"}}‘]
        aeskeypattern = ‘"aes":"(.*)"\}\}‘
        aeskeylist = re.compile(aeskeypattern).findall(luohtml)
        if len(aeskeylist) == 0:
            return errRet
        aeskey = aeskeylist[0]

        # 解密并返回结果
        playList = JsAes.decrypt(playList, aeskey)
        return title, desc, playList

    def saveAlbumInfo(self, ablumindex, title, desc , playList):
        """
        CREATE TABLE "AlbumInfo" ("albumindex" INTEGER  PRIMARY KEY  NOT NULL  DEFAULT (null) ,"title" TEXT,"desc" TEXT,"playlist" TEXT, "dir" TEXT)
        CREATE TABLE "MusicInfo" ("albumindex" INTEGER NOT NULL , "musicindex" INTEGER NOT NULL ,"filename" TEXT,"dir" TEXT, "id" TEXT, "title" TEXT, "artist" TEXT, "album" TEXT, "mp3" TEXT, "poster" TEXT, "poster_small" TEXT, "is_fav" INTEGER, PRIMARY KEY ("albumindex", "musicindex"))

        """

        # 创建专辑文件夹,并存放专辑信息
        foldername = ‘%s%s-%s‘ % (self.AudioDir, ablumindex, title)
        if not os.path.exists(foldername) and not os.path.isdir(foldername):
            print ‘创建文件夹:‘, foldername
            os.mkdir(foldername)

        # 将专辑信息记录一份为文本 存放在专辑目录下
        abluminftxt = foldername + ‘/0-%s.txt‘ % title
        print ‘创建文件:‘, abluminftxt
        file_object = open(abluminftxt, ‘w‘)
        try:
            file_object.writelines(‘%s<$>%s<$>%s<$>%s‘ % (ablumindex, title, desc , playList))
        finally:
            file_object.close() 

        # 循环处理歌曲信息 保存文件
        songinsvalues = []
        songdelvalues = []
        songlistjson = json.loads(playList)
        songindex = 0

        for s in songlistjson:
            songindex = songindex + 1
            url = s["mp3"]
            title = re_win_filename.sub(‘‘, s["title"])
            filename = title + os.path.splitext(url)[-1]
            localfilepath = foldername + ‘/‘ + filename
            print "下载-%s:%s => %s  " % (songindex, url, localfilepath)

            if songindex % 5 == 0:
                time.sleep(12)
            else:
                time.sleep(random.randint(4, 7))

            try: 

#                 # urllib方式下载
#                 urllib.urlretrieve(url, localfilepath)

#                 # urllib2方式下载
#                 f = urllib2.urlopen(url)
#                 data = f.read()
#                 with open(localfilepath, "wb") as code:
#                     code.write(data)

                # requests 方式下载
                r = requests.get(url, timeout=20, headers=self.ReqHeader)
                with open(localfilepath, "wb") as code:
                    code.write(r.content)

            except Exception , e:
                print  e
                print ‘下载文件异常,继续处理下一个‘
                filename = ‘‘
                time.sleep(15)

            # 将歌曲信息存入列表 存储数据时使用
            songdelvalues.append((ablumindex, songindex))
            songinsvalues.append((ablumindex, songindex, filename, foldername, s["id"], title                               , s["artist"], s["album"], s["mp3"], s["poster"], s["poster_small"], s["is_fav"]))

        # 记录信息到数据库
        cur = self.dbcx.cursor()

        # 专辑信息
        delSql = "delete from AlbumInfo where albumindex=%s;" % ablumindex
        cur.execute(delSql)

        insSql = ‘insert into AlbumInfo values(?,?,?,?,?);‘
        cur.execute(insSql, (ablumindex, title, desc , playList, foldername)) 

        print ‘提交专辑信息到数据库...‘
        self.dbcx.commit()

        # 歌曲信息
        delSql = "delete from MusicInfo where albumindex=? and musicindex=?;"
        cur.executemany(delSql, songdelvalues)

        insSql = ‘insert into MusicInfo values(?,?,?,?,?,?,?,?,?,?,?,?);‘
        cur.executemany(insSql, songinsvalues) 

        print ‘提交歌曲列表信息到数据库...‘
        self.dbcx.commit()       

        # 关闭游标
        cur.close()
        pass  

if __name__ == "__main__":    

    luo = LuoMusic()

    for ablumindex in range(23, 100): 

        print ‘*‘ * 20, ‘开始处理第%s个专辑‘ % ablumindex, ‘*‘ * 20

        title, desc, playList = luo.getAlbumInfoByIndex(ablumindex)
        if title is None:
            print ‘处理失败,解析结果为 None!‘
            continue

        title = re_win_filename.sub(‘‘, title)
        print ‘解析成功,专辑名称为:‘, title

        print ‘开始保存专辑信息...‘
        luo.saveAlbumInfo(ablumindex, title, desc , playList)

aesutil.js

function DoAesDecrypt(text, key) {

	return CryptoJS.AES.decrypt(text, key).toString(CryptoJS.enc.Utf8);
};

function DoAesEncrypt(text, key) {

	var ss = CryptoJS.AES.encrypt(text, key);
	// var ret =CryptoJS;
	ret = ss.toString(CryptoJS.enc.Utf8);
	return ret;

};

aes.js

/*
CryptoJS v3.0.2
code.google.com/p/crypto-js
(c) 2009-2012 by Jeff Mott. All rights reserved.
code.google.com/p/crypto-js/wiki/License
*/
var CryptoJS=CryptoJS||function(p,h){var i={},l=i.lib={},r=l.Base=function(){function a(){}return{extend:function(e){a.prototype=this;var c=new a;e&&c.mixIn(e);c.$super=this;return c},create:function(){var a=this.extend();a.init.apply(a,arguments);return a},init:function(){},mixIn:function(a){for(var c in a)a.hasOwnProperty(c)&&(this[c]=a[c]);a.hasOwnProperty("toString")&&(this.toString=a.toString)},clone:function(){return this.$super.extend(this)}}}(),o=l.WordArray=r.extend({init:function(a,e){a=
this.words=a||[];this.sigBytes=e!=h?e:4*a.length},toString:function(a){return(a||s).stringify(this)},concat:function(a){var e=this.words,c=a.words,b=this.sigBytes,a=a.sigBytes;this.clamp();if(b%4)for(var d=0;d<a;d++)e[b+d>>>2]|=(c[d>>>2]>>>24-8*(d%4)&255)<<24-8*((b+d)%4);else if(65535<c.length)for(d=0;d<a;d+=4)e[b+d>>>2]=c[d>>>2];else e.push.apply(e,c);this.sigBytes+=a;return this},clamp:function(){var a=this.words,e=this.sigBytes;a[e>>>2]&=4294967295<<32-8*(e%4);a.length=p.ceil(e/4)},clone:function(){var a=
r.clone.call(this);a.words=this.words.slice(0);return a},random:function(a){for(var e=[],c=0;c<a;c+=4)e.push(4294967296*p.random()|0);return o.create(e,a)}}),m=i.enc={},s=m.Hex={stringify:function(a){for(var e=a.words,a=a.sigBytes,c=[],b=0;b<a;b++){var d=e[b>>>2]>>>24-8*(b%4)&255;c.push((d>>>4).toString(16));c.push((d&15).toString(16))}return c.join("")},parse:function(a){for(var e=a.length,c=[],b=0;b<e;b+=2)c[b>>>3]|=parseInt(a.substr(b,2),16)<<24-4*(b%8);return o.create(c,e/2)}},n=m.Latin1={stringify:function(a){for(var e=
a.words,a=a.sigBytes,c=[],b=0;b<a;b++)c.push(String.fromCharCode(e[b>>>2]>>>24-8*(b%4)&255));return c.join("")},parse:function(a){for(var e=a.length,c=[],b=0;b<e;b++)c[b>>>2]|=(a.charCodeAt(b)&255)<<24-8*(b%4);return o.create(c,e)}},k=m.Utf8={stringify:function(a){try{return decodeURIComponent(escape(n.stringify(a)))}catch(e){throw Error("Malformed UTF-8 data");}},parse:function(a){return n.parse(unescape(encodeURIComponent(a)))}},f=l.BufferedBlockAlgorithm=r.extend({reset:function(){this._data=o.create();
this._nDataBytes=0},_append:function(a){"string"==typeof a&&(a=k.parse(a));this._data.concat(a);this._nDataBytes+=a.sigBytes},_process:function(a){var e=this._data,c=e.words,b=e.sigBytes,d=this.blockSize,q=b/(4*d),q=a?p.ceil(q):p.max((q|0)-this._minBufferSize,0),a=q*d,b=p.min(4*a,b);if(a){for(var j=0;j<a;j+=d)this._doProcessBlock(c,j);j=c.splice(0,a);e.sigBytes-=b}return o.create(j,b)},clone:function(){var a=r.clone.call(this);a._data=this._data.clone();return a},_minBufferSize:0});l.Hasher=f.extend({init:function(){this.reset()},
reset:function(){f.reset.call(this);this._doReset()},update:function(a){this._append(a);this._process();return this},finalize:function(a){a&&this._append(a);this._doFinalize();return this._hash},clone:function(){var a=f.clone.call(this);a._hash=this._hash.clone();return a},blockSize:16,_createHelper:function(a){return function(e,c){return a.create(c).finalize(e)}},_createHmacHelper:function(a){return function(e,c){return g.HMAC.create(a,c).finalize(e)}}});var g=i.algo={};return i}(Math);
(function(){var p=CryptoJS,h=p.lib.WordArray;p.enc.Base64={stringify:function(i){var l=i.words,h=i.sigBytes,o=this._map;i.clamp();for(var i=[],m=0;m<h;m+=3)for(var s=(l[m>>>2]>>>24-8*(m%4)&255)<<16|(l[m+1>>>2]>>>24-8*((m+1)%4)&255)<<8|l[m+2>>>2]>>>24-8*((m+2)%4)&255,n=0;4>n&&m+0.75*n<h;n++)i.push(o.charAt(s>>>6*(3-n)&63));if(l=o.charAt(64))for(;i.length%4;)i.push(l);return i.join("")},parse:function(i){var i=i.replace(/\s/g,""),l=i.length,r=this._map,o=r.charAt(64);o&&(o=i.indexOf(o),-1!=o&&(l=o));
for(var o=[],m=0,s=0;s<l;s++)if(s%4){var n=r.indexOf(i.charAt(s-1))<<2*(s%4),k=r.indexOf(i.charAt(s))>>>6-2*(s%4);o[m>>>2]|=(n|k)<<24-8*(m%4);m++}return h.create(o,m)},_map:"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/="}})();
(function(p){function h(f,g,a,e,c,b,d){f=f+(g&a|~g&e)+c+d;return(f<<b|f>>>32-b)+g}function i(f,g,a,e,c,b,d){f=f+(g&e|a&~e)+c+d;return(f<<b|f>>>32-b)+g}function l(f,g,a,e,c,b,d){f=f+(g^a^e)+c+d;return(f<<b|f>>>32-b)+g}function r(f,g,a,e,c,b,d){f=f+(a^(g|~e))+c+d;return(f<<b|f>>>32-b)+g}var o=CryptoJS,m=o.lib,s=m.WordArray,m=m.Hasher,n=o.algo,k=[];(function(){for(var f=0;64>f;f++)k[f]=4294967296*p.abs(p.sin(f+1))|0})();n=n.MD5=m.extend({_doReset:function(){this._hash=s.create([1732584193,4023233417,
2562383102,271733878])},_doProcessBlock:function(f,g){for(var a=0;16>a;a++){var e=g+a,c=f[e];f[e]=(c<<8|c>>>24)&16711935|(c<<24|c>>>8)&4278255360}for(var e=this._hash.words,c=e[0],b=e[1],d=e[2],q=e[3],a=0;64>a;a+=4)16>a?(c=h(c,b,d,q,f[g+a],7,k[a]),q=h(q,c,b,d,f[g+a+1],12,k[a+1]),d=h(d,q,c,b,f[g+a+2],17,k[a+2]),b=h(b,d,q,c,f[g+a+3],22,k[a+3])):32>a?(c=i(c,b,d,q,f[g+(a+1)%16],5,k[a]),q=i(q,c,b,d,f[g+(a+6)%16],9,k[a+1]),d=i(d,q,c,b,f[g+(a+11)%16],14,k[a+2]),b=i(b,d,q,c,f[g+a%16],20,k[a+3])):48>a?(c=
l(c,b,d,q,f[g+(3*a+5)%16],4,k[a]),q=l(q,c,b,d,f[g+(3*a+8)%16],11,k[a+1]),d=l(d,q,c,b,f[g+(3*a+11)%16],16,k[a+2]),b=l(b,d,q,c,f[g+(3*a+14)%16],23,k[a+3])):(c=r(c,b,d,q,f[g+3*a%16],6,k[a]),q=r(q,c,b,d,f[g+(3*a+7)%16],10,k[a+1]),d=r(d,q,c,b,f[g+(3*a+14)%16],15,k[a+2]),b=r(b,d,q,c,f[g+(3*a+5)%16],21,k[a+3]));e[0]=e[0]+c|0;e[1]=e[1]+b|0;e[2]=e[2]+d|0;e[3]=e[3]+q|0},_doFinalize:function(){var f=this._data,g=f.words,a=8*this._nDataBytes,e=8*f.sigBytes;g[e>>>5]|=128<<24-e%32;g[(e+64>>>9<<4)+14]=(a<<8|a>>>
24)&16711935|(a<<24|a>>>8)&4278255360;f.sigBytes=4*(g.length+1);this._process();f=this._hash.words;for(g=0;4>g;g++)a=f[g],f[g]=(a<<8|a>>>24)&16711935|(a<<24|a>>>8)&4278255360}});o.MD5=m._createHelper(n);o.HmacMD5=m._createHmacHelper(n)})(Math);
(function(){var p=CryptoJS,h=p.lib,i=h.Base,l=h.WordArray,h=p.algo,r=h.EvpKDF=i.extend({cfg:i.extend({keySize:4,hasher:h.MD5,iterations:1}),init:function(i){this.cfg=this.cfg.extend(i)},compute:function(i,m){for(var h=this.cfg,n=h.hasher.create(),k=l.create(),f=k.words,g=h.keySize,h=h.iterations;f.length<g;){a&&n.update(a);var a=n.update(i).finalize(m);n.reset();for(var e=1;e<h;e++)a=n.finalize(a),n.reset();k.concat(a)}k.sigBytes=4*g;return k}});p.EvpKDF=function(i,l,h){return r.create(h).compute(i,
l)}})();
CryptoJS.lib.Cipher||function(p){var h=CryptoJS,i=h.lib,l=i.Base,r=i.WordArray,o=i.BufferedBlockAlgorithm,m=h.enc.Base64,s=h.algo.EvpKDF,n=i.Cipher=o.extend({cfg:l.extend(),createEncryptor:function(b,d){return this.create(this._ENC_XFORM_MODE,b,d)},createDecryptor:function(b,d){return this.create(this._DEC_XFORM_MODE,b,d)},init:function(b,d,a){this.cfg=this.cfg.extend(a);this._xformMode=b;this._key=d;this.reset()},reset:function(){o.reset.call(this);this._doReset()},process:function(b){this._append(b);return this._process()},
finalize:function(b){b&&this._append(b);return this._doFinalize()},keySize:4,ivSize:4,_ENC_XFORM_MODE:1,_DEC_XFORM_MODE:2,_createHelper:function(){return function(b){return{encrypt:function(a,q,j){return("string"==typeof q?c:e).encrypt(b,a,q,j)},decrypt:function(a,q,j){return("string"==typeof q?c:e).decrypt(b,a,q,j)}}}}()});i.StreamCipher=n.extend({_doFinalize:function(){return this._process(!0)},blockSize:1});var k=h.mode={},f=i.BlockCipherMode=l.extend({createEncryptor:function(b,a){return this.Encryptor.create(b,
a)},createDecryptor:function(b,a){return this.Decryptor.create(b,a)},init:function(b,a){this._cipher=b;this._iv=a}}),k=k.CBC=function(){function b(b,a,d){var c=this._iv;c?this._iv=p:c=this._prevBlock;for(var e=0;e<d;e++)b[a+e]^=c[e]}var a=f.extend();a.Encryptor=a.extend({processBlock:function(a,d){var c=this._cipher,e=c.blockSize;b.call(this,a,d,e);c.encryptBlock(a,d);this._prevBlock=a.slice(d,d+e)}});a.Decryptor=a.extend({processBlock:function(a,d){var c=this._cipher,e=c.blockSize,f=a.slice(d,d+
e);c.decryptBlock(a,d);b.call(this,a,d,e);this._prevBlock=f}});return a}(),g=(h.pad={}).Pkcs7={pad:function(b,a){for(var c=4*a,c=c-b.sigBytes%c,e=c<<24|c<<16|c<<8|c,f=[],g=0;g<c;g+=4)f.push(e);c=r.create(f,c);b.concat(c)},unpad:function(b){b.sigBytes-=b.words[b.sigBytes-1>>>2]&255}};i.BlockCipher=n.extend({cfg:n.cfg.extend({mode:k,padding:g}),reset:function(){n.reset.call(this);var b=this.cfg,a=b.iv,b=b.mode;if(this._xformMode==this._ENC_XFORM_MODE)var c=b.createEncryptor;else c=b.createDecryptor,
this._minBufferSize=1;this._mode=c.call(b,this,a&&a.words)},_doProcessBlock:function(b,a){this._mode.processBlock(b,a)},_doFinalize:function(){var b=this.cfg.padding;if(this._xformMode==this._ENC_XFORM_MODE){b.pad(this._data,this.blockSize);var a=this._process(!0)}else a=this._process(!0),b.unpad(a);return a},blockSize:4});var a=i.CipherParams=l.extend({init:function(a){this.mixIn(a)},toString:function(a){return(a||this.formatter).stringify(this)}}),k=(h.format={}).OpenSSL={stringify:function(a){var d=
a.ciphertext,a=a.salt,d=(a?r.create([1398893684,1701076831]).concat(a).concat(d):d).toString(m);return d=d.replace(/(.{64})/g,"$1\n")},parse:function(b){var b=m.parse(b),d=b.words;if(1398893684==d[0]&&1701076831==d[1]){var c=r.create(d.slice(2,4));d.splice(0,4);b.sigBytes-=16}return a.create({ciphertext:b,salt:c})}},e=i.SerializableCipher=l.extend({cfg:l.extend({format:k}),encrypt:function(b,d,c,e){var e=this.cfg.extend(e),f=b.createEncryptor(c,e),d=f.finalize(d),f=f.cfg;return a.create({ciphertext:d,
key:c,iv:f.iv,algorithm:b,mode:f.mode,padding:f.padding,blockSize:b.blockSize,formatter:e.format})},decrypt:function(a,c,e,f){f=this.cfg.extend(f);c=this._parse(c,f.format);return a.createDecryptor(e,f).finalize(c.ciphertext)},_parse:function(a,c){return"string"==typeof a?c.parse(a):a}}),h=(h.kdf={}).OpenSSL={compute:function(b,c,e,f){f||(f=r.random(8));b=s.create({keySize:c+e}).compute(b,f);e=r.create(b.words.slice(c),4*e);b.sigBytes=4*c;return a.create({key:b,iv:e,salt:f})}},c=i.PasswordBasedCipher=
e.extend({cfg:e.cfg.extend({kdf:h}),encrypt:function(a,c,f,j){j=this.cfg.extend(j);f=j.kdf.compute(f,a.keySize,a.ivSize);j.iv=f.iv;a=e.encrypt.call(this,a,c,f.key,j);a.mixIn(f);return a},decrypt:function(a,c,f,j){j=this.cfg.extend(j);c=this._parse(c,j.format);f=j.kdf.compute(f,a.keySize,a.ivSize,c.salt);j.iv=f.iv;return e.decrypt.call(this,a,c,f.key,j)}})}();
(function(){var p=CryptoJS,h=p.lib.BlockCipher,i=p.algo,l=[],r=[],o=[],m=[],s=[],n=[],k=[],f=[],g=[],a=[];(function(){for(var c=[],b=0;256>b;b++)c[b]=128>b?b<<1:b<<1^283;for(var d=0,e=0,b=0;256>b;b++){var j=e^e<<1^e<<2^e<<3^e<<4,j=j>>>8^j&255^99;l[d]=j;r[j]=d;var i=c[d],h=c[i],p=c[h],t=257*c[j]^16843008*j;o[d]=t<<24|t>>>8;m[d]=t<<16|t>>>16;s[d]=t<<8|t>>>24;n[d]=t;t=16843009*p^65537*h^257*i^16843008*d;k[j]=t<<24|t>>>8;f[j]=t<<16|t>>>16;g[j]=t<<8|t>>>24;a[j]=t;d?(d=i^c[c[c[p^i]]],e^=c[c[e]]):d=e=1}})();
var e=[0,1,2,4,8,16,32,64,128,27,54],i=i.AES=h.extend({_doReset:function(){for(var c=this._key,b=c.words,d=c.sigBytes/4,c=4*((this._nRounds=d+6)+1),i=this._keySchedule=[],j=0;j<c;j++)if(j<d)i[j]=b[j];else{var h=i[j-1];j%d?6<d&&4==j%d&&(h=l[h>>>24]<<24|l[h>>>16&255]<<16|l[h>>>8&255]<<8|l[h&255]):(h=h<<8|h>>>24,h=l[h>>>24]<<24|l[h>>>16&255]<<16|l[h>>>8&255]<<8|l[h&255],h^=e[j/d|0]<<24);i[j]=i[j-d]^h}b=this._invKeySchedule=[];for(d=0;d<c;d++)j=c-d,h=d%4?i[j]:i[j-4],b[d]=4>d||4>=j?h:k[l[h>>>24]]^f[l[h>>>
16&255]]^g[l[h>>>8&255]]^a[l[h&255]]},encryptBlock:function(a,b){this._doCryptBlock(a,b,this._keySchedule,o,m,s,n,l)},decryptBlock:function(c,b){var d=c[b+1];c[b+1]=c[b+3];c[b+3]=d;this._doCryptBlock(c,b,this._invKeySchedule,k,f,g,a,r);d=c[b+1];c[b+1]=c[b+3];c[b+3]=d},_doCryptBlock:function(a,b,d,e,f,h,i,g){for(var l=this._nRounds,k=a[b]^d[0],m=a[b+1]^d[1],o=a[b+2]^d[2],n=a[b+3]^d[3],p=4,r=1;r<l;r++)var s=e[k>>>24]^f[m>>>16&255]^h[o>>>8&255]^i[n&255]^d[p++],u=e[m>>>24]^f[o>>>16&255]^h[n>>>8&255]^
i[k&255]^d[p++],v=e[o>>>24]^f[n>>>16&255]^h[k>>>8&255]^i[m&255]^d[p++],n=e[n>>>24]^f[k>>>16&255]^h[m>>>8&255]^i[o&255]^d[p++],k=s,m=u,o=v;s=(g[k>>>24]<<24|g[m>>>16&255]<<16|g[o>>>8&255]<<8|g[n&255])^d[p++];u=(g[m>>>24]<<24|g[o>>>16&255]<<16|g[n>>>8&255]<<8|g[k&255])^d[p++];v=(g[o>>>24]<<24|g[n>>>16&255]<<16|g[k>>>8&255]<<8|g[m&255])^d[p++];n=(g[n>>>24]<<24|g[k>>>16&255]<<16|g[m>>>8&255]<<8|g[o&255])^d[p++];a[b]=s;a[b+1]=u;a[b+2]=v;a[b+3]=n},keySize:8});p.AES=h._createHelper(i)})();

o(︶︿︶)o 唉,我X,终于写到这地方了。一只手打字就是累。。。

希望此贴能给朋友们带来帮助。。。排版很渣,请无视。

时间: 2024-10-25 08:45:45

Python网页小爬虫的相关文章

python之小爬虫

#!/usr/bin/python #抓取网页上的图片保存 import urllib import urllib.request //python3版本将urllib2分成urllib.request和urllib.error import re def gethtml(url): page = urllib.request.urlopen(url) html = page.read() return html def getImages(html): reg = r'src="(.*?\.j

python使用正则表达式编写网页小爬虫

""" 文本处理是当下计算机处理的主要任务,从文本中找到某些有用的信息, 挖掘出某些信息是现在计算机程序大部分所做的工作.而python这中轻量型.小巧的语言包含了很多处理的函数库, 这些库的跨平台性能很好,可移植性能很强. 在Python中re模块提供了很多高级文本模式匹配的功能,以及相应的搜索替换对应字符串的功能. """ """ 正则表达式符号和特殊字符 re1|re2 -----> 匹配正则表达式的re

亲身试用python简单小爬虫

前几天基友分享了一个贴吧网页,有很多漂亮的图片,想到前段时间学习的python简单爬虫,刚好可以实践一下. 以下是网上很容易搜到的一种方法: 1 #coding=utf-8 2 import urllib 3 import re 4 5 def getHtml(url): 6 page = urllib.urlopen(url) 7 html = page.read() 8 return html 9 10 def getImg(html): 11 reg = r'src="(.+?\.jpg)

python的小爬虫的基本写法

1.最基本的抓站 import urllib2 content = urllib2.urlopen('http://XXXX').read() 2.使用代理服务器 这在某些情况下比较有用,比如IP被封了,或者比如IP访问的次数受到限制等等. import urllib2 proxy_support = urllib2.ProxyHandler({'http':'http://XX.XX.XX.XX:XXXX'}) opener = urllib2.build_opener(proxy_suppo

简单的网页小爬虫

var http = require('http');var Promise = require('bluebird'); // 第三方 Promises 模块var cheerio = require('cheerio');  // 爬虫分析模块var BufferHelper = require('bufferhelper'); // buffer 组装模块var iconv = require('iconv-lite'); // 字符转码模块 var baseUrl = 'http://w

python图片小爬虫

import re import urllib import os def rename(name): name = name + '.jpg' return name def getHtml(url): page = urllib.urlopen(url) html = page.read() return html def getImg(html): reg = r'src="(.+?\.jpg)" pic_ext' imgre = re.compile(reg) imglist

Python小爬虫-自动下载三亿文库文档

新手学python,写了一个抓取网页后自动下载文档的脚本,和大家分享. 首先我们打开三亿文库下载栏目的网址,比如专业资料(IT/计算机/互联网)http://3y.uu456.com/bl-197?od=1&pn=0,可以观察到,链接中pn=后面的数字就是对应的页码,所以一会我们会用iurl = 'http://3y.uu456.com/bl-197?od=1&pn=',后面加上页码来抓取网页. 一般网页会用1,2,3...不过机智的三亿文库用0,25,50...来表示,所以我们在拼接ur

python速成第二篇(小爬虫+文件操作+socket网络通信小例子+oop编程)

大家好,由于前天熬夜写完第一篇博客,然后昨天又是没休息好,昨天也就不想更新博客,就只是看了会资料就早点休息了,今天补上我这两天的所学,先记录一笔.我发现有时候我看的话会比较敷衍,而如果我写出来(无论写到笔记本中还是博客中,我都有不同的感觉)就会有不同的想法,我看书或者看资料有时候感觉就是有一种惰性,得过且过的感觉,有时候一个知识想不通道不明,想了一会儿,就会找借口给自己说这个知识不重要,不需要太纠结了,还是去看下一个吧,然后就如此往复下去,学习就会有漏洞,所以这更加坚定了我写博客来记录的想法.

Python练习,网络小爬虫(初级)

最近还在看Python版的rcnn代码,附带练习Python编程写一个小的网络爬虫程序. 抓取网页的过程其实和读者平时使用IE浏览器浏览网页的道理是一样的.比如说你在浏览器的地址栏中输入    www.baidu.com    这个地址.打开网页的过程其实就是浏览器作为一个浏览的“客户端”,向服务器端发送了 一次请求,把服务器端的文件“抓”到本地,再进行解释.展现.HTML是一种标记语言,用标签标记内容并加以解析和区分.浏览器的功能是将获取到的HTML代码进行解析,然后将原始的代码转变成我们直接