Java - XPath解析爬取内容

code {
margin: 0;
padding: 0;
white-space: pre;
border: none;
background: transparent;
}

pre {
background-color: #f8f8f8;
border: 1px solid #ccc;
font-size: 13px;
line-height: 19px;
overflow: auto;
padding: 6px 10px;
border-radius: 3px;
}

pre code, pre tt {
background-color: transparent;
border: none;
}

kbd {
-moz-border-bottom-colors: none;
-moz-border-left-colors: none;
-moz-border-right-colors: none;
-moz-border-top-colors: none;
background-color: #DDDDDD;
background-image: linear-gradient(#F1F1F1, #DDDDDD);
background-repeat: repeat-x;
border-color: #DDDDDD #CCCCCC #CCCCCC #DDDDDD;
border-image: none;
border-radius: 2px 2px 2px 2px;
border-style: solid;
border-width: 1px;
font-family: "Helvetica Neue",Helvetica,Arial,sans-serif;
line-height: 10px;
padding: 1px 4px;
}
-->

就爬取和解析内容而言，我们有太多选择。
比如，很多人都觉得Jsoup就可以解决所有问题。
无论是Http请求、DOM操作、CSS query selector筛选都非常方便。
　
关键是这个selector，仅通过一个表达式筛选出的只能是一个node。
如过我想获得一个text或者一个node的属性值，我需要从返回的element对象中再获取一次。
而我恰好接到了一个有意思的需求，仅通过一个表达式表示想筛选的内容，获取一个新闻网页的每一条新闻的标题、链接等信息。

　
XPath再合适不过了，比如下面这个例子：

static void crawlByXPath(String url,String xpathExp) throws IOException, ParserConfigurationException, SAXException, XPathExpressionException {

    String html = Jsoup.connect(url).post().html();

    DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
    DocumentBuilder builder = factory.newDocumentBuilder();
    Document document = builder.parse(html);

    XPathFactory xPathFactory = XPathFactory.newInstance();
    XPath xPath = xPathFactory.newXPath();

    XPathExpression expression = xPath.compile(xpathExp);
    expression.evaluate(html);

}

　　
遗憾的是，几乎没有网站可以通过documentBuilder.parse这段代码。
而XPath却对DOM非常严格。
对HTML进行一次clean，于是我加入了这个东西:

    <dependency>
        <groupId>net.sourceforge.htmlcleaner</groupId>
        <artifactId>htmlcleaner</artifactId>
        <version>2.9</version>
    </dependency>

　
HtmlCleaner可以帮我解决这个问题，而且他本身就支持XPath。
仅仅一行HtmlCleaner.clean就解决了:

public static void main(String[] args) throws IOException, XPatherException {
    String url = "http://zhidao.baidu.com/daily";
    String contents = Jsoup.connect(url).post().html();

    HtmlCleaner hc = new HtmlCleaner();
    TagNode tn = hc.clean(contents);
    String xpath = "//h2/a/@href";
    Object[] objects = tn.evaluateXPath(xpath);
    System.out.println(objects.length);

}

　
但是HtmlCleaner又引发了新的问题，当我把表达式写成"//h2/a[contains(@href,‘daily‘)]/@href"时，他提示我不支持contains函数。
而javax.xml.xpath则支持函数使用，这下问题来了。
如何结合二者? HtmlCleaner提供了DomSerializer，可以将TagNode对象转为org.w3c.dom.Document对象，比如:

Document dom = new DomSerializer(new CleanerProperties()).createDOM(tn);

　
如此一来就可以发挥各自长处了。

public static void main(String[] args) throws IOException, XPatherException, ParserConfigurationException, XPathExpressionException {
    String url = "http://zhidao.baidu.com/daily";
    String exp = "//h2/a[contains(@href,‘daily‘)]/@href";

    String html = null;
    try {
        Connection connect = Jsoup.connect(url);
        html = connect.get().body().html();
    } catch (IOException e) {
        e.printStackTrace();
    }
    HtmlCleaner hc = new HtmlCleaner();
    TagNode tn = hc.clean(html);
    Document dom = new DomSerializer(new CleanerProperties()).createDOM(tn);
    XPath xPath = XPathFactory.newInstance().newXPath();
    Object result;
    result = xPath.evaluate(exp, dom, XPathConstants.NODESET);
    if (result instanceof NodeList) {
        NodeList nodeList = (NodeList) result;
        System.out.println(nodeList.getLength());
        for (int i = 0; i < nodeList.getLength(); i++) {
            Node node = nodeList.item(i);
            System.out.println(node.getNodeValue() == null ? node.getTextContent() : node.getNodeValue());
        }
    }
}

时间： 2024-10-13 12:14:24

Java - XPath解析爬取内容

Java - XPath解析爬取内容的相关文章

requests+xpath+map爬取百度贴吧

xpath案例爬取58出租房源信息&解析下载图片数据&乱码问题

7-13爬虫入门之BeautifulSoup对网页爬取内容的解析

Scrapy教程——搭建环境、创建项目、爬取内容、保存文件

Python 2.7_利用xpath语法爬取豆瓣图书top250信息_20170129

【个人】爬虫实践，利用xpath方式爬取数据之爬取虾米音乐排行榜

xpath+多进程爬取网易云音乐热歌榜。

用JAVA制作一个爬取商品信息的爬虫（爬取大众点评）

[python爬虫] Selenium爬取内容并存储至MySQL数据库