ganon抓取网页示例

项目地址： http://code.google.com/p/ganon/
文档： http://code.google.com/p/ganon/w/list

这个功能强大的很，使用类似js的标签选择器识别DOM

The Ganon library gives access to HTML/XML documents in a very simple object oriented way. It eases modifying the DOM and makes finding elements easy with CSS3-like queries.

Ganon 使用示例：

// Parse the google code website into a DOM
$html = file_get_dom(‘http://code.google.com/‘);

Access
Accessing elements is made easy through the CSS3-like selectors and the object model.

// Find all the paragraph tags with a class attribute and print the
 // value of the class attribute
 foreach($html(‘p[class]‘) as $element) {
   echo $element->class, "<br>\n";
 }

 // Find the first div with ID "gc-header" and print the plain text of
 // the parent element (plain text means no HTML tags, just the text)
 echo $html(‘div#gc-header‘, 0)->parent->getPlainText();

 // Find out how many tags there are which are "ns:tag" or "div", but not
 // "a" and do not have a class attribute
 echo count($html(‘(ns|tag, div + !a)[!class]‘);
?>

Modification
Elements can be easily modified after you‘ve found them.

// Find all paragraph tags which are nested inside a div tag, change
     // their ID attribute and print the new HTML code
     foreach($html(‘div p‘) as $index => $element) {
       $element->id = "id$index";
     }
     echo $html;

     // Center all the links inside a document which start with "http://"
     // and print out the new HTML
     foreach($html(‘a[href ^= "http://"]‘) as $element) {
       $element->wrap(‘center‘);
     }
     echo $html;

     // Find all odd indexed "td" elements and change the HTML to make them links
     foreach($html(‘table td:odd‘) as $element) {
       $element->setInnerText(‘<a href="#">‘.$element->getPlainText().‘</a>‘);
     }
     echo $html;

Beautify
Ganon can also help you beautify your code and format it properly.

// Beautify the old HTML code and print out the new, formatted code
     dom_format($html, array(‘attributes_case‘ => CASE_LOWER));
     echo $html;

时间： 2024-10-07 19:11:30

ganon抓取网页示例的相关文章

[Python]网络爬虫（一）：抓取网页的含义和URL基本构成

一.网络爬虫的定义网络爬虫,即Web Spider,是一个很形象的名字. 把互联网比喻成一个蜘蛛网,那么Spider就是在网上爬来爬去的蜘蛛.网络蜘蛛是通过网页的链接地址来寻找网页的. 从网站某一个页面(通常是首页)开始,读取网页的内容,找到在网页中的其它链接地址, 然后通过这些链接地址寻找下一个网页,这样一直循环下去,直到把这个网站所有的网页都抓取完为止. 如果把整个互联网当成一个网站,那么网络蜘蛛就可以用这个原理把互联网上所有的网页都抓取下来. 这样看来,网络爬虫就是一个爬行程序,一个抓取

PHP的cURL库：抓取网页，POST数据及其他,HTTP认证抓取数据

From : http://developer.51cto.com/art/200904/121739.htm 下面是一个小例程: ﹤?php// 初始化一个 cURL 对象$curl = curl_init(); // 设置你需要抓取的URLcurl_setopt($curl, CURLOPT_URL, 'http://cocre.com'); // 设置headercurl_setopt($curl, CURLOPT_HEADER, 1); // 设置cURL 参数,要求结果保存到字符串中还

使用wget工具抓取网页和图片

使用wget工具抓取网页和图片: 包括css\js\html\图片文件 wget -e robots=off -w 1 -xq -np -nH -pk -m -t 1 -P "$PATH" "$URL" 这里robots=off是因为wget默认会根据网站的robots.txt进行操作,如果robots.txt里是User-agent: * Disallow: /的话,wget是做不了镜像或者下载目录的 -e 用来执行额外的.wgetrc命令,会在.wgetrc中所

[转]网络爬虫（一）：抓取网页的含义和URL基本构成

Asp.net 使用正则和网络编程抓取网页数据(有用)

Asp.net 使用正则和网络编程抓取网页数据(有用) /// <summary> /// 抓取网页对应内容 /// </summary> /// <param name="strUrl">採集地址</param> /// <param name="Begin">開始字符</param> /// <param name="End">结束字符</param&g

python多线程实现抓取网页

Python实现抓取网页下面的Python抓取网页的程序比较初级,只能抓取第一页的url所属的页面,只要预定URL足够多,保证你抓取的网页是无限级别的哈,下面是代码: ##coding:utf-8 ''' 无限抓取网页 @author wangbingyu @date 2014-06-26 ''' import sys,urllib,re,thread,time,threading ''' 创建下载线程类 ''' class download(threading.Thread): def __

抓取网页链接

package com.smilezl.scrapy; import java.io.BufferedReader; import java.io.IOException; import java.io.InputStreamReader; import java.net.HttpURLConnection; import java.net.URL; import java.sql.Connection; import java.sql.DriverManager; import java.sq

PHP利用Curl实现多线程抓取网页和下载文件

PHP 利用 Curl 可以完成各种传送文件操作,比如模拟浏览器发送GET,POST请求等等,然而因为php语言本身不支持多线程,所以开发爬虫程序效率并不高,一般采集数据可以利用 PHPquery类来采集数据库,在此之外也可以用 Curl ,借助Curl 这个功能实现并发多线程的访问多个url地址以实现并发多线程抓取网页或者下载文件. 至于具体实现过程,请参考下面几个例子: 1.实现抓取多个URL并将内容写入指定的文件 $urls = array( '路径地址', '路径地址', '路径地址

抓取网页中的内容、如何解决乱码问题、如何解决登录问题以及对所采集的数据进行处理显示的过程

本文主要介绍如何抓取网页中的内容.如何解决乱码问题.如何解决登录问题以及对所采集的数据进行处理显示的过程.效果如下所示: 1.下载网页并加载至HtmlAgilityPack 这里主要用WebClient类的DownloadString方法和HtmlAgilityPack中HtmlDocument类LoadHtml方法来实现.主要代码如下. var url = page == 1 ? "http://www.cnblogs.com/" : "http://www.cnblogs