第一次搞爬虫,经验不足,爬出来的效果也不是很好,记录一下吧。
认识的哥们最近在爬今日头条的数据,不过他是做java的。之前也想用php做点爬数据的东西,于是直接也搞今日头条,万一有不明白的地方还能有个人商量。话不多说,上点干货。
关于爬虫,我之前的认知是,curl+正则,有点模糊,下面一步一步说吧
一、观察页面
今日头条的首页推送数据,是通过ajax获取的,打开页面调试我们可以看到下图
请求是每次滚动条滚动到底部触发的,然后我们右键新页面打开这个链接
不难发现这是个接口,返回json数据,在线格式化一下,结果如下
1 { 2 "has_more": true, 3 "message": "success", 4 "data": [{ 5 "chinese_tag": "财经", 6 "media_avatar_url": "//p1.pstatp.com/large/4d00054b126ceaf920", 7 "is_feed_ad": false, 8 "tag_url": "news_finance", 9 "title": "重要里程碑!中国政府采购服务器列入“中国芯”", 10 "single_mode": false, 11 "middle_mode": false, 12 "abstract": "尽管美国对中兴销售零部件和软件的禁令将被解除。近日有媒体注意到,在中央国家机关发布的新采购名单中,服务器产品的技术要求格外引人注目。", 13 "tag": "news_finance", 14 "label": ["网络安全", "龙芯", "CPU", "信息安全", "英特尔"], 15 "behot_time": 1527057631, 16 "source_url": "/group/6558562819279684109/", 17 "source": "环球网", 18 "more_mode": false, 19 "article_genre": "article", 20 "comments_count": 628, 21 "group_source": 2, 22 "item_id": "6558562819279684109", 23 "has_gallery": false, 24 "group_id": "6558562819279684109", 25 "media_url": "/c/user/5954781019/" 26 }, { 27 "single_mode": true, 28 "abstract": "一个动作,一句话语,一个眼神,这些看似微小的细节,有时都有着丰富的内涵,传递出意味深长的信号。中美华盛顿磋商过去已有几天,但美方最近释放的两个小细节,感觉拼出了一个更完整的中美谈判成果。", 29 "middle_mode": true, 30 "more_mode": true, 31 "tag": "news_world", 32 "label": ["中美关系", "中兴", "国际"], 33 "comments_count": 233, 34 "tag_url": "news_world", 35 "title": "这两个小细节,拼出一个更完整的中美谈判成果", 36 "chinese_tag": "国际", 37 "source": "牛弹琴", 38 "group_source": 2, 39 "has_gallery": false, 40 "media_url": "/c/user/3647305700/", 41 "media_avatar_url": "//p2.pstatp.com/large/1566/2791711243", 42 "image_list": [{ 43 "url": "//p3.pstatp.com/list/pgc-image/15269858649775498ea28fe" 44 }, { 45 "url": "//p3.pstatp.com/list/pgc-image/15269858650368af9b6b970" 46 }, { 47 "url": "//p3.pstatp.com/list/pgc-image/1526985864933a9921fd8d0" 48 }], 49 "source_url": "/group/6558354654705484295/", 50 "article_genre": "article", 51 "item_id": "6558354654705484295", 52 "is_feed_ad": false, 53 "behot_time": 1527057631, 54 "image_url": "//p3.pstatp.com/list/190x124/pgc-image/15269858649775498ea28fe", 55 "group_id": "6558354654705484295", 56 "middle_image": "http://p3.pstatp.com/list/pgc-image/15269858649775498ea28fe" 57 }, { 58 "single_mode": true, 59 "abstract": "这个小品还有一个插曲就是原定的女主本来是闫妮,但是在春晚前3天临时换成了金玉婷,登上春晚后,她还被称为是“春晚第一美女”。", 60 "middle_mode": true, 61 "more_mode": true, 62 "tag": "news_entertainment", 63 "label": ["春晚", "金玉婷", "抑郁症", "我是大侦探", "潘长江"], 64 "comments_count": 229, 65 "tag_url": "news_entertainment", 66 "title": "曾5次上春晚,患抑郁症淡出,如今做主播没人看,四处走穴很凄凉", 67 "chinese_tag": "娱乐", 68 "source": "猫眼娱乐", 69 "group_source": 2, 70 "has_gallery": false, 71 "media_url": "/c/user/64781639962/", 72 "media_avatar_url": "//p3.pstatp.com/large/2c6b001dd55cf954a3f6", 73 "image_list": [{ 74 "url": "//p3.pstatp.com/list/pgc-image/15270494556167c19bc847f" 75 }, { 76 "url": "//p3.pstatp.com/list/pgc-image/1527049455588fe78498ed1" 77 }, { 78 "url": "//p3.pstatp.com/list/pgc-image/152704945540787d17fdfb9" 79 }], 80 "source_url": "/group/6558629445270241805/", 81 "article_genre": "article", 82 "item_id": "6558629445270241805", 83 "is_feed_ad": false, 84 "behot_time": 1527057631, 85 "image_url": "//p3.pstatp.com/list/190x124/pgc-image/15270494556167c19bc847f", 86 "group_id": "6558629445270241805", 87 "middle_image": "http://p3.pstatp.com/list/pgc-image/15270494556167c19bc847f" 88 }], 89 "next": { 90 "max_behot_time": 1527057631 91 } 92 }
我们看这个json的结构,data里的就是我们想要的数据,好了我们只要仿造地址然后 curl模拟请求获取数据就得了。
二、爬虫研发
分析数据接口
地址:‘https://www.toutiao.com/api/pc/feed/?max_behot_time=1527057712&category=__all__&utm_source=toutiao&widen=1&tadrequire=true&as=A1656B107572ABC&cp=5B05529AAB2CEE1&_signature=PNjkzBAeZ-Tys2Ie-4uYMTzY5N‘
对比之后发现,每次触发的ajax地址只有max_behot_time、as、cp、_signature这4个参数有变化。max_behot_time有点像时间戳,格式化一下,果然是。然后就是as、cp、_signature三个变参,百度了下是js对时间戳的加密出来的。
知乎上对as、cp有一些文章,_signature百度出来的结果很少。
再次打开页面调试
找到参数生成的js,格式化js代码找到如下代码
片段一:返回3个参数
{ key: "_setParams", value: function(t) { var e = (0, h. default)(), i = 0; this.url = this._url, "refresh" === t ? (i = this.list.length > 0 ? this.list[0].behot_time: 0, this.url += "min_behot_time=" + i) : (i = this.list.length > 0 ? this.list[this.list.length - 1].behot_time: 0, this.url += "max_behot_time=" + i); var n = (0, _.sign)(i + ""); (0, a.default)(this.params, { as: e.as, cp: e.cp, _signature: n }) } }
片段二:as、cp 时间戳加密逻辑,对照知乎上那片文章,这个可以用php的逻辑写,easy
1 function s() { 2 var t = Math.floor((new Date).getTime() / 1e3), 3 e = t.toString(16).toUpperCase(), 4 i = (0, o. 5 default)(t).toString().toUpperCase(); 6 if (8 != e.length) return { 7 as: "479BB4B7254C150", 8 cp: "7E0AC8874BB0985" 9 }; 10 for (var n = i.slice(0, 5), s = i.slice( - 5), a = "", r = 0; r < 5; r++) a += n[r] + e[r]; 11 for (var l = "", 12 u = 0; u < 5; u++) l += e[u + 3] + s[u]; 13 return { 14 as: "A1" + a + e.slice( - 3), 15 cp: e.slice(0, 3) + l + "E1" 16 } 17 }
还差一个参数_signature,这个参数有点难找,结合百度找到了js文件
相关代码片段:
1 function(t, e) { 2 Function(function(t) { 3 return ‘e(e,a,r){(b[e]||(b[e]=t("x,y","x "+e+" y")(r,a)}a(e,a,r){(k[r]||(k[r]=t("x,y","new x[y]("+Array(r+1).join(",x[y]")(1)+")")(e,a)}r(e,a,r){n,t,s={},b=s.d=r?r.d+1:0;for(s["$"+b]=s,t=0;t<b;t)s[n="$"+t]=r[n];for(t=0,b=s=a;t<b;t)s[t]=a[t];c(e,0,s)}c(t,b,k){u(e){v[x]=e}f{g=,ting(bg)}l{try{y=c(t,b,k)}catch(e){h=e,y=l}}for(h,y,d,g,v=[],x=0;;)switch(g=){case 1:u(!)4:f5:u((e){a=0,r=e;{c=a<r;c&&u(e[a]),c}}(6:y=,u((y8:if(g=,lg,g=,y===c)b+=g;else if(y!==l)y9:c10:u(s(11:y=,u(+y)12:for(y=f,d=[],g=0;g<y;g)d[g]=y.charCodeAt(g)^g+y;u(String.fromCharCode.apply(null,d13:y=,h=delete [y]14:59:u((g=)?(y=x,v.slice(x-=g,y:[])61:u([])62:g=,k[0]=65599*k[0]+k[1].charCodeAt(g)>>>065:h=,y=,[y]=h66:u(e(t[b],,67:y=,d=,u((g=).x===c?r(g.y,y,k):g.apply(d,y68:u(e((g=t[b])<"<"?(b--,f):g+g,,70:u(!1)71:n72:+f73:u(parseInt(f,3675:if(){bcase 74:g=<<16>>16g76:u(k[])77:y=,u([y])78:g=,u(a(v,x-=g+1,g79:g=,u(k["$"+g])81:h=,[f]=h82:u([f])83:h=,k[]=h84:!085:void 086:u(v[x-1])88:h=,y=,h,y89:u({e{r(e.y,arguments,k)}e.y=f,e.x=c,e})90:null91:h93:h=0:;default:u((g<<16>>16)-16)}}n=this,t=n.Function,s=Object.keys||(e){a={},r=0;for(c in e)a[r]=c;a=r,a},b={},k={};r‘.replace(/[-]/g, 4 function(e) { 5 return t[15 & e.charCodeAt(0)] 6 }) 7 } ("v[x++]=v[--x]t.charCodeAt(b++)-32function return ))++.substrvar .length(),b+=;break;case ;break}".split("")))()(‘gr$Daten Иb/s!l y?y?g,(lfi~ah`{mv,-n|jqewVxp{rvmmx,&effkx[!cs"l".Pq%widthl"@q&heightl"vr*getContextx$"2d[!cs#l#,*;?|u.|uc{uq$fontl#vr(fillTextx$$龘???2<[#c}l#2q*shadowBlurl#1q-shadowOffsetXl#$$limeq+shadowColorl#vr#arcx88802[%c}l#vr&strokex[ c}l"v,)}eOmyoZB]mx[ cs!0s$l$Pb<k7l l!r&lengthb%^l$1+s$jl s#i$1ek1s$gr#tack4)zgr#tac$! +0o![#cj?o ]!l$b%s"o ]!l"l$b*b^0d#>>>s!0s%yA0s"l"l!r&lengthb<k+l"^l"1+s"jl s&l&z0l!$ +["cs\‘(0l#i\‘1ps9wxb&s() &{s)/s(gr&Stringr,fromCharCodes)0s*yWl ._b&s o!])l l Jb<k$.aj;l .Tb<k$.gj/l .^b<k&i"-4j!+& s+yPo!]+s!l!l Hd>&l!l Bd>&+l!l <d>&+l!l 6d>&+l!l &+ s,y=o!o!]/q"13o!l q"10o!],l 2d>& s.{s-yMo!o!]0q"13o!]*Ld<l 4d#>>>b|s!o!l q"10o!],l!& s/yIo!o!].q"13o!],o!]*Jd<l 6d#>>>b|&o!]+l &+ s0l-l!&l-l!i\‘1z141z4b/@d<l"b|&+l-l(l!b^&+l-l&zl\‘g,)gk}ejo{cm,)|yn~Lij~em["cl$b%@d<l&zl\‘l $ +["cl$b%b|&+l-l%8d<@b|l!b^&+ q$sign ‘, [Object.defineProperty(e, "__esModule", { 8 value: !0 9 })]) 10 }
我槽,第一眼看,这代码好尼玛长,php怎么搞啊!
然后有个想法,我直接用js跑出这个 参数来不就得了吗,easy
代码实现:(注意sign这个算法的编码,我被搞过) 从这个链接里去复制:https://bbs.125.la/thread-14108290-1-1.html
1 <!DOCTYPE html> 2 <html> 3 <head> 4 <title>check</title> 5 </head> 6 <body> 7 <script type="text/javascript" src="{{ URL::asset(‘/assets/js/md5.js‘) }}"></script> 8 <script type="text/javascript"> 9 function xx(t){ 10 var e = t.toString(16).toUpperCase(), 11 i = md5(t.toString()).toUpperCase(), 12 str = ‘‘; 13 if (8 != e.length){ 14 str = ‘http://www.toutiao.com/api/pc/feed/?max_behot_time=‘+t+‘&category=__all__&utm_source=toutiao&widen=1&tadrequire=true&as=479BB4B7254C150&cp=7E0AC8874BB0985&_signature=‘+TAC.sign(t); 15 return str; 16 } 17 for (var n = i.slice(0, 5), s = i.slice( - 5), a = "", r = 0; r < 5; r++) { 18 a += n[r] + e[r]; 19 } 20 for (var l = "",u = 0; u < 5; u++) { 21 l += e[u + 3] + s[u]; 22 } 23 str = ‘http://www.toutiao.com/api/pc/feed/?max_behot_time=‘+t+‘&category=__all__&utm_source=toutiao&widen=1&tadrequire=true&as=‘+"A1" + a + e.slice( - 3)+‘&cp=‘+e.slice(0, 3) + l + "E1"+‘&_signature=‘+TAC.sign(t); 24 return str; 25 } 26 Function(function(t) { return ‘e(e,a,r){(b[e]||(b[e]=t("x,y","x "+e+" y")(r,a)}a(e,a,r){(k[r]||(k[r]=t("x,y","new x[y]("+Array(r+1).join(",x[y]")(1)+")")(e,a)}r(e,a,r){n,t,s={},b=s.d=r?r.d+1:0;for(s["$"+b]=s,t=0;t<b;t)s[n="$"+t]=r[n];for(t=0,b=s=a;t<b;t)s[t]=a[t];c(e,0,s)}c(t,b,k){u(e){v[x]=e}f{g=,ting(bg)}l{try{y=c(t,b,k)}catch(e){h=e,y=l}}for(h,y,d,g,v=[],x=0;;)switch(g=){case 1:u(!)4:f5:u((e){a=0,r=e;{c=a<r;c&&u(e[a]),c}}(6:y=,u((y8:if(g=,lg,g=,y===c)b+=g;else if(y!==l)y9:c10:u(s(11:y=,u(+y)12:for(y=f,d=[],g=0;g<y;g)d[g]=y.charCodeAt(g)^g+y;u(String.fromCharCode.apply(null,d13:y=,h=delete [y]14:59:u((g=)?(y=x,v.slice(x-=g,y:[])61:u([])62:g=,k[0]=65599*k[0]+k[1].charCodeAt(g)>>>065:h=,y=,[y]=h66:u(e(t[b],,67:y=,d=,u((g=).x===c?r(g.y,y,k):g.apply(d,y68:u(e((g=t[b])<"<"?(b--,f):g+g,,70:u(!1)71:n72:+f73:u(parseInt(f,3675:if(){bcase 74:g=<<16>>16g76:u(k[])77:y=,u([y])78:g=,u(a(v,x-=g+1,g79:g=,u(k["$"+g])81:h=,[f]=h82:u([f])83:h=,k[]=h84:!085:void 086:u(v[x-1])88:h=,y=,h,y89:u({e{r(e.y,arguments,k)}e.y=f,e.x=c,e})90:null91:h93:h=0:;default:u((g<<16>>16)-16)}}n=this,t=n.Function,s=Object.keys||(e){a={},r=0;for(c in e)a[r]=c;a=r,a},b={},k={};r‘.replace(/[-]/g, function(i) { return t[15 & i.charCodeAt(0)] })}("v[x++]=v[--x]t.charCodeAt(b++)-32function return ))++.substrvar .length(),b+=;break;case ;break}".split("")))()(‘gr$Daten Иb/s!l y?y?g,(lfi~ah`{mv,-n|jqewVxp{rvmmx,&effkx[!cs"l".Pq%widthl"@q&heightl"vr*getContextx$"2d[!cs#l#,*;?|u.|uc{uq$fontl#vr(fillTextx$$龘???2<[#c}l#2q*shadowBlurl#1q-shadowOffsetXl#$$limeq+shadowColorl#vr#arcx88802[%c}l#vr&strokex[ c}l"v,)}eOmyoZB]mx[ cs!0s$l$Pb<k7l l!r&lengthb%^l$1+s$jl s#i$1ek1s$gr#tack4)zgr#tac$! +0o![#cj?o ]!l$b%s"o ]!l"l$b*b^0d#>>>s!0s%yA0s"l"l!r&lengthb<k+l"^l"1+s"jl s&l&z0l!$ +["cs\‘(0l#i\‘1ps9wxb&s() &{s)/s(gr&Stringr,fromCharCodes)0s*yWl ._b&s o!])l l Jb<k$.aj;l .Tb<k$.gj/l .^b<k&i"-4j!+& s+yPo!]+s!l!l Hd>&l!l Bd>&+l!l <d>&+l!l 6d>&+l!l &+ s,y=o!o!]/q"13o!l q"10o!],l 2d>& s.{s-yMo!o!]0q"13o!]*Ld<l 4d#>>>b|s!o!l q"10o!],l!& s/yIo!o!].q"13o!],o!]*Jd<l 6d#>>>b|&o!]+l &+ s0l-l!&l-l!i\‘1z141z4b/@d<l"b|&+l-l(l!b^&+l-l&zl\‘g,)gk}ejo{cm,)|yn~Lij~em["cl$b%@d<l&zl\‘l $ +["cl$b%b|&+l-l%8d<@b|l!b^&+ q$sign ‘, [TAC = {}]);32 var t = Math.floor((new Date).getTime() / 1e3); 33 var tmpstr = xx(t); 34 $(‘body‘).append(‘<p>‘+tmpstr+‘</p>‘); 35 </script> 36 </body> 37 </html>
效果参见:http://58.87.108.192/check (可加时间戳参数,例:http://58.87.108.192/check/1527057712)
然后是抓取数据的事了,php没办法抓取动态数据!
这块我当时很头疼,问了问同事才知道要用v8引擎的拓展去解析
然后装v8js,资料太少装不上,gg。想别的方法,百度了一下发现抓取动态数据的时候很多人用 phantomjs,无头浏览器,咦,我那哥们也是用的这个。
安装 phantomjs 博客网上很多,不赘述。放一个链接:https://blog.csdn.net/wanght89/article/details/78320375
js代码(获取之前页面生成的 url)
check.js
1 var page = require(‘webpage‘).create(), 2 system = require(‘system‘); 3 var url = ‘http://58.87.108.192/check‘; 4 if (system.args.length === 2){ 5 url += ‘/‘+system.args[1]; 6 } 7 page.open(url, function(status) { 8 if (status !== ‘success‘) { 9 console.log(‘Unable to access network‘); 10 } else { 11 var ua = page.evaluate(function() { 12 var res = []; 13 var domres = document.getElementsByTagName(‘p‘); 14 for(var i=0;i<domres.length;i++){ 15 res.push(domres[i].textContent); 16 } 17 return res; 18 }); 19 console.log(ua); 20 } 21 phantom.exit(); 22 unset(page); 23 });
php 代码(其实就是一个curl)
spider.php
1 <?php 2 3 /** 4 * 今日头条首页爬虫 5 */ 6 class Spider 7 { 8 private $searchTime = false; 9 private $stopTime = false; 10 11 public function __construct() 12 { 13 $this->stopTime = strtotime(date(‘Y-m-d‘)); 14 $this->useData(); 15 } 16 17 //curl 获取接口内容 18 public function curlRequest($url) 19 { 20 $ch = curl_init(); 21 curl_setopt($ch,CURLOPT_URL,$url); 22 curl_setopt($ch,CURLOPT_RETURNTRANSFER,1); 23 curl_setopt($ch,CURLOPT_SSL_VERIFYPEER,0); 24 curl_setopt($ch,CURLOPT_SSL_VERIFYHOST,0); 25 curl_setopt($ch,CURLOPT_HEADER,0); 26 curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1); 27 return curl_exec($ch); 28 } 29 30 //phantomjs 抓取地址 31 public function getUrl() 32 { 33 $commendStr = "phantomjs ****/js/check.js";//地址不让看 34 if ($this->searchTime) { 35 $commendStr .= ‘ ‘.$this->searchTime; 36 } 37 return (exec($commendStr)); 38 } 39 40 //数据处理 41 public function useData() 42 { 43 $urlStr = $this->getUrl(); 44 $urlArr = explode(‘,‘, $urlStr); 45 $res = []; 46 foreach ($urlArr as $curlUrl) { 47 sleep(1); 48 $curlJsonStr = $this->curlRequest($curlUrl); 49 $curlRes = json_decode($curlJsonStr, true); 50 var_dump($curlRes);exit; 51 } 52 } 53 } 54 55 new Spider();
爬取出来的数据示例
关于cookie,这个数据接口没有cookie的校验,有的话,curl模拟下就行了
1 curl_setopt($ch,CURLOPT_COOKIE,$cookie_str);
数据重复问题,接口多次请求会给出很多重复数据,我的处理方法是给表里面加了个唯一索引,md5str 这个是文章标题md5出来的
总结一下:
这个爬虫不是很成功啊,我现在弄一堆url然后去跑,发现第二次数据就没法获取到了,这事让我很烦躁,没找到原因
整体的流程走了一边,数据能拿到但是效率很低。
感觉php不适合做爬虫,静态的数据一个curl就够了,动态数据就比较棘手,phantomjs 感觉比较笨重,而且好像已经不再支持更新
多说一句感觉写爬虫可以有很多种方法,比如我那哥们他就是模拟一步一步点击去做,我这个直接拼接口地址然后拿数据,有点不伦不类,初体验,多多包涵吧。
原文地址:https://www.cnblogs.com/jwcrxs/p/9078201.html