前两天看到院子子里有人用Python写了一个爬虫,爬拉勾网统计薪资等数据,所以我就想我是不是用C#也来一个爬虫
首先分析拉勾网
先选择一个.NET的,地点先统一选择北京
然后进入下面的这个页面
http://www.lagou.com/zhaopin/.NET/?labelWords=label
然后当我使劲刷新 上面这个地址的时候我发现,页面的头先出来的,中间的列表慢了一下,所以我猜测,当这个页面执行完成后通过AJAX加载第一页的信息。
然后我通过 fiddler 抓包验证我的猜想。
刷新一些这个页面抓到80个左右的包
其中第一个包返回的html基本没有什么用,起码暂时没有我想要的信息。
在这个包中,找到了我需要的信息,他返回的是一个JSON格式的数据。
{ "resubmitToken": null, "code": 0, "success": true, "requestId": null, "msg": null, "content": { "totalCount": 1007, "pageNo": 1, "pageSize": 15, "hasNextPage": true, "totalPageCount": 68, "currentPageNo": 1, "hasPreviousPage": false, "result": [ { "relScore": 965, "createTime": "2016-04-28 10:05:39", "companyId": 28818, "calcScore": false, "showOrder": 0, "haveDeliver": false, "positionName": ".NET/C#", "positionType": "后端开发", "workYear": "3-5年", "education": "本科", "jobNature": "全职", "companyShortName": "畅捷通信息技术股份有限公司", "city": "北京", "salary": "15k-25k", "financeStage": "上市公司", "positionId": 1765871, "companyLogo": "image1/M00/00/3F/CgYXBlTUXMOADN_rAADQYzTeBQE385.jpg", "positionFirstType": "技术", "companyName": "畅捷通", "positionAdvantage": "上市公司 免费班车 20-35W 春节14天假", "industryField": "移动互联网 · 企业服务", "companyLabelList": [ "技能培训", "节日礼物", "绩效奖金", "岗位晋升" ], "score": 1323, "deliverCount": 7, "leaderName": "曾志勇", "companySize": "500-2000人", "countAdjusted": false, "adjustScore": 48, "randomScore": 0, "orderBy": 99, "adWord": 1, "formatCreateTime": "2016-04-28", "imstate": "disabled", "createTimeSort": 1461809139000, "positonTypesMap": null, "hrScore": 53, "flowScore": 158, "showCount": 722, "pvScore": 12.26185956183834, "plus": "是", "searchScore": 0, "totalCount": 0 }, ], "start": 0 } }
反序列化后并且干掉一部分多余的数据,得到上面的这个串
JSON串和页面上展示的信息一对,证明了我的猜想是对的。
剩下的就是想办法获取到,每一条招聘信息的URL地址
如何获取到从第一页到最后一页的JSON数据
首先根据返回回来的JSON创建一个对应的实体,用来存放数据。
分析招聘信息的URL
http://www.lagou.com/jobs/1765871.html
http://www.lagou.com/jobs/1613765.html
http://www.lagou.com/jobs/797212.html
http://www.lagou.com/jobs/224215.html
http://www.lagou.com/jobs/1638545.html
从上面的这五个URL地址可以发现,URL地址的结构
http://www.lagou.com/jobs/+?+.html
而这个 ? 可要到JSON串中找一下
很轻松的就在JSON中找到了这个? 为 "positionId": 1765871,
现在还剩下如何从第一页,抓取到最后一页
查看一下请求报文
报文中
Query String
city:北京
Form Data
first:false
pn:1
kd:.NET
其中
city 是搜索的城市
first是用来表示是否是第一次加载第一页
pn 表示第几页
kd 表示搜索的技术
using System; using System.Collections.Generic; using System.IO; using System.Linq; using System.Net; using System.Text; using System.Threading.Tasks; using System.Web.Script.Serialization; namespace LG { public static class RequestDemo { public static DTOModel RequestDTO( string kd = ".NET", string city = "北京", bool first = false, int pn = 1 ) { string URL = "http://www.lagou.com/jobs/positionAjax.json?city=" + city; DTOModel dto = new DTOModel(); StringBuilder result = new StringBuilder(); HttpWebRequest req = null; HttpWebResponse res = null; Stream receiveStream = null; StreamReader sr = null; req = WebRequest.Create(URL) as HttpWebRequest; req.Method = "POST"; req.UserAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.115 Safari/537.36"; req.ContentType = "application/x-www-form-urlencoded; charset=UTF-8"; req.Headers.Add("Accept-Language", "zh-CN,zh;q=0.8"); req.Headers.Add("Origin", "http://www.lagou.com"); req.Headers.Add("X-Requested-With", "XMLHttpRequest"); req.Host = "www.lagou.com"; req.Referer = "http://www.lagou.com/zhaopin/" + kd + "/?labelWords=label"; System.IO.Stream RequestStream = req.GetRequestStream(); string sb = ""; sb += "first" + first; sb += "&pn" + pn; sb += "&kd" + kd; byte[] buf; buf = System.Text.Encoding.GetEncoding("utf-8") .GetBytes(sb); res = req.GetResponse() as HttpWebResponse; receiveStream = res.GetResponseStream(); Encoding encode = Encoding.GetEncoding("UTF-8"); sr = new StreamReader(receiveStream, encode); char[] readbuffer = new char[256]; int n = sr.Read(readbuffer, 0, 256); while (n > 0) { string str = new string(readbuffer, 0, n); result.Append(str); n = sr.Read(readbuffer, 0, 256); } receiveStream.Close(); sr.Close(); JavaScriptSerializer js = new JavaScriptSerializer(); dto = js.Deserialize<DTOModel>(result.ToString()); return dto; } } }
然后根据返回的数据构建对应的URL
"content": { "totalCount": 5000, "hasNextPage": true, "pageNo": 1, "pageSize": 15, "totalPageCount": 334, "currentPageNo": 1, "hasPreviousPage": false,
根据这些参数我们可以循环去获取数据。
拿着构建好的URL地址,我们就可以去找我们想要的页面了。