Html Agility Pack 源码中的类大概有28个左右,其实不算一个很复杂的类库,但它的功能确不弱,为解析DOM已经提供了足够强大的功能支持,可以跟jQuery操作DOM媲 美:)Html Agility Pack最常用的基础类其实不多,对解析DOM来说,就只有HtmlDocument和HtmlNode这两个常用的类,还有一个 HtmlNodeCollection集合类。
一、ScapySharp
HTML Agility Pack的操作起来还是很麻烦,下面我们要介绍的这个组件是ScrapySharp,他在2个方面针对Html Agility Pack进行了包装,使得解析Html页面不再痛苦,幸福指数直线上升到90分哈。
ScapySharp有了一个真实的浏览器包装类(处理Reference,Cookie等),另外一个就是使用类似于jQuery一样的Css选择器和Linq语法。让我们使用起来非常的爽。它的代码放在 https://bitbucket.org/rflechner/scrapysharp。也可以通过Nuget添加
using System; using System.Collections.Generic; using System.Linq; using System.Text; using HtmlAgilityPack; using ScrapySharp.Extensions; using ScrapySharp.Network; namespace HTMLAgilityDemo { class Program { static void Main(string[] args) { var uri = new Uri("http://www.cnblogs.com/shanyou/archive/2012/05/20/2509435.html"); var browser1 = new ScrapingBrowser(); var html1 = browser1.DownloadString(uri); var htmlDocument = new HtmlDocument(); htmlDocument.LoadHtml(html1); var html = htmlDocument.DocumentNode; var title = html.CssSelect("title"); foreach (var htmlNode in title) { Console.WriteLine(htmlNode.InnerHtml); } var divs = html.CssSelect("div.postBody"); foreach (var htmlNode in divs) { Console.WriteLine(htmlNode.InnerHtml); } divs = html.CssSelect("#cnblogs_post_body"); foreach (var htmlNode in divs) { Console.WriteLine(htmlNode.InnerHtml); } } } } Basic examples of CssSelect usages: var divs = html.CssSelect("div"); //all div elements var nodes = html.CssSelect("div.content"); //all div elements with css class ‘content’ var nodes = html.CssSelect("div.widget.monthlist"); //all div elements with the both css class var nodes = html.CssSelect("#postPaging"); //all HTML elements with the id postPaging var nodes = html.CssSelect("div#postPaging.testClass"); // all HTML elements with the id postPaging and css class testClass var nodes = html.CssSelect("div.content > p.para"); //p elements who are direct children of div elements with css class ‘content’ var nodes = html.CssSelect("input[type=text].login"); // textbox with css class login We can also select ancestors of elements: var nodes = html.CssSelect("p.para").CssSelectAncestors("div.content > div.widget");
二、搭配HtmlAgilityPack.CssSelectors(这个有bug,class里面有下划线_会抛异常)
var postItems = htmlDocument.QuerySelectorAll(".post-item");
参考:http://www.cnblogs.com/shanyou/archive/2012/05/27/2520603.html
http://www.tools138.com/create/article/20141014/130844875.html
时间: 2024-10-03 23:24:31