最近在做一个项目,要得到网页中的一些数据,静态的页面比较容易做,只要解析网站的URL地址就可以得到HTML代码,但是有些网页是动态生成的,比如翻页过程中,地址栏中的URL地址都不会发生变化,所以得到这种网页的内容就相对麻烦些。下面我以https://honors.libraries.psu.edu/browse/author/all/这个网站的翻页动作为例子,说明一下动态网页HTML代码的获取过程。
1、用IE9打开这个网站:https://honors.libraries.psu.edu/browse/author/all/
2、按下F12调出开发者工具
点开发者工具中的“网络”-->"开始捕获",然后点击网页上的“next page”链接
3、得到整个请求的过程
点击“转到详细视图“
4、将参数与c#的HtmlWebRequest对象绑定
///<summary> ///采用https协议访问网络 ///</summary> ///<param name="URL">url地址</param> ///<param name="strPostdata">发送的数据</param> ///<returns></returns> public string OpenReadWithHttps(string URL, string strPostdata, Encoding encoding) { CookieContainer cc = new CookieContainer(); cc.Add(new Cookie("csrftoken", "04696113ff3ee3e8220dd9044921e100", "/browse/author/all/", "honors.libraries.psu.edu")); cc.Add(new Cookie("__utma", "148028590.1404245236.1416720957.1416734716.1416748914.3", "/browse/author/all/", "honors.libraries.psu.edu")); cc.Add(new Cookie("__utmz", "148028590.1416720957.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none)", "/browse/author/all/", "honors.libraries.psu.edu")); cc.Add(new Cookie("__utmb", "148028590.2.10.1416748914", "/browse/author/all/", "honors.libraries.psu.edu")); cc.Add(new Cookie("__utmc", "148028590", "/browse/author/all/", "honors.libraries.psu.edu")); HttpWebRequest request = (HttpWebRequest)WebRequest.Create(URL); request.CookieContainer = cc; request.Method = "post"; request.Accept = "text/html, application/xhtml+xml, */*"; request.ContentType = "application/x-www-form-urlencoded"; request.Referer="https://honors.libraries.psu.edu/browse/author/all/"; request.KeepAlive = true; request.UserAgent = "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0)"; request.Host = "honors.libraries.psu.edu"; request.Headers.Add(HttpRequestHeader.AcceptLanguage, "en-US"); request.Headers.Add(HttpRequestHeader.AcceptEncoding, "gzip, deflate"); request.Headers.Add(HttpRequestHeader.CacheControl, "no-cache"); byte[] buffer = encoding.GetBytes(strPostdata); request.ContentLength = buffer.Length; Stream writer = request.GetRequestStream(); //获得请求流 writer.Write(buffer, 0, buffer.Length); //将请求参数写入流 writer.Close(); //关闭请求流 HttpWebResponse response = (HttpWebResponse)request.GetResponse(); using (StreamReader reader = new StreamReader(response.GetResponseStream(), encoding)) { return reader.ReadToEnd(); } }
参数说明:
URL:请求的地址,strPostdata:POST发送的数据,encoding:页面编码
5、调用
private void button2_Click(object sender, EventArgs e) { string url = "https://honors.libraries.psu.edu/browse/author/all/"; string strPostData = "csrfmiddlewaretoken=04696113ff3ee3e8220dd9044921e100&browse_start=all&browse_type=author&page=9&display=50&num_display_items=50"; textBox1.Text = OpenReadWithHttps(url, strPostData, Encoding.UTF8); }
总结流程:用IE9的开发者工具捕获页面请求过程,得到请求的各参数,然后将各参数绑定到HtmlWebRequest对象进行请求!
时间: 2024-11-11 09:39:45