php采集远程文章简单类

<?php
/**
 * 采集类
 * @author Milkcy
 * @copyright            (C) 2012-2015 TCCMS.COM
 * @lastmodify             2012-07-10 14:00
 */
class gather {

    public $pagestring = ‘‘;
    private $db;

    function __construct() {
        global $db;
        $this->db = $db;
    }

    function geturlfile($url) {
        $url = trim($url);
        $content = ‘‘;
        if (extension_loaded(‘curl‘)) {
            $ch = curl_init();
            curl_setopt($ch, CURLOPT_URL, $url);
            curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
            curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
            curl_setopt($ch, CURLOPT_HEADER, 0);
            $content = curl_exec($ch);
            curl_close($ch);
        } else {
            $content = file_get_contents($url);
        }
        return trim($content);
    }

    function get_all_url($code) {
        preg_match_all(‘/<a.+?href=["|\\‘]?([^>"\\‘ ]+)["|\\‘]?\\s*[^>]*>([^>]+)<\\/a>/is‘, $code, $arr);
        return array(‘name‘ => $arr[2], ‘url‘ => $arr[1]);
    }

    function get_sub_content($str, $start, $end) {
        $start = trim($start);
        $end = trim($end);
        if ($start == ‘‘ || $end == ‘‘) {
            return $str;
        }
        $str = explode($start, $str);
        $str = explode($end, $str[1]);
        return $str[0];
    }

    function vd($var) {
        echo "<div style=\\"border:1px solid #ddd;background:#F7F7F7;padding:5px 10px;\\">\\r\\n";
        echo "<pre style=\\"font-family:Arial,Vrinda;font-size:14px;\\">\\r\\n";
        var_dump($var);
        echo "\\r\\n</pre>\\r\\n";
        echo "</div>";
    }

}

?>

<?php
define(‘ROOT_PATH‘, str_replace(‘\\\\‘, ‘/‘, dirname(__FILE__)));
include ROOT_PATH."/gather.class.php";
set_time_limit(0);
header("Content-type: text/html; charset=gb2312");
//目标网址
$url = ‘http://news.163.com/special/00013C0O/guojibjtj_03.html‘;
//实例化采集机器
$gather = new gather();
//获取目标网址HTML
$html = $gather->geturlfile($url);
//定义采集列表区间
$start = ‘<div class="bd clearfix">‘;
$end = ‘<div class="pages-1 mt25">‘;
//获取区间内的文章URL和TITLE
$code = $gather->get_sub_content($html, $start, $end);
$newsAry = $gather->get_all_url($code);
//打印出结果
//$gather->vd($newsAry);
$tarGetUrl = $newsAry[‘url‘][0];
//获取目标网址HTML
$html = $gather->geturlfile($tarGetUrl);
//定义采集列表区间
$start = ‘<div id="endText">‘;
$end = ‘<span class="cDGray right" style="white-space:nowrap;">‘;
//获取区间内的文章URL和TITLE
$code = $gather->get_sub_content($html, $start, $end);
$killHtml = ‘<iframe src="http://g.163.com/r?site=netease&affiliate=news&cat=article&type=tvscreen200x300&location=1" width="200" height="300" frameborder="no" border="0" marginwidth="0" marginheight="0" scrolling="no"></iframe>‘;
$killHtml2 = ‘<a href="http://news.163.com/"><img src="http://img1.cache.netease.com/cnews/img07/end_i.gif" alt="netease" width="12" height="11" border="0" class="icon" /></a>‘;
$code = str_replace($killHtml, "", $code);
$code = str_replace($killHtml2, "", $code);
$gather->vd($code);
?>
//该片段来自于http://outofmemory.cn

php 文章采集正则代码

//采集html
function getwebcontent($url){
$ch = curl_init();
$timeout = 10;
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
curl_setopt ($ch, CURLOPT_FOLLOWLOCATION, 1);
$contents = trim(curl_exec($ch));
curl_close($ch);
return $contents;
} 

//获得标题和url
$string =
getwebcontent(‘http://www.***.com/learn/zhunbeihuaiyun/jijibeiyun/2‘);
//正则匹配<li>获取标题和地址
preg_match_all ("/<li><a href=\"\/learn\/article\/(.*)\">(.*)<\/a>/",$string, $out, PREG_SET_ORDER);
foreach($out as $key => $value){
$article[‘title‘][] = $out[$key][2];
$article[‘link‘][] = "http://www.***.com/learn/article/".$out[$key][1];
}
//根据url获取文章内容
foreach($article[‘link‘] as $key=>$value){
$content_html = getwebcontent($article[‘link‘][$key]);
preg_match("/<div id=pagenum_0(.*)>[\s|\S]*?<\/div>/",$content_html,$matches);
$article[content][$key] = $matches[0]; 

}
//不转码还真不能保存成文件
foreach($article[title] as $key=>$value){
$article[title][$key] = iconv(‘utf-8‘, ‘gbk‘, $value);//转码
}
//存入文件
$num = count($article[‘title‘]);
for($i=0; $i<$num; $i++){
file_put_contents("{$article[title][$i]}.txt", $article[‘content‘][$i]);
}
?>

时间： 2024-11-10 07:05:14

php采集远程文章简单类的相关文章

ThinkPHP Http工具类（用于远程采集远程下载） phpSimpleHtmlDom采集类库_Jquery筛选方式使用phpQuery轻松采集网页内容

[php]代码库 view sourceprint? <?php // +---------------------------------------------------------------------- // | ThinkPHP [ WE CAN DO IT JUST THINK IT ] // +---------------------------------------------------------------------- // | Copyright (c) 200

简单类

Description 实现一个名为SimpleCircle的简单类.其数据成员int *itsRadius为一个指向其半径值的指针,存放其半径值.(PI=3.14) 设计对数据成员的各种操作:(1)半径设置和读取函数:(2)求面积:(3)求周长给出这个类的完整实现并测试这个类. Input 圆的半径 Output 圆的半径周长面积 Sample Input 5 Sample Output itsRadius=5 circle=31.4 Area=78.5 #include<iostrea

PHP语言编程实现采集远程图片资源

当我们需要采集网络上的某个网页内容时,如果目标网站上的图片做了防盗链的话,我们直接采集过来的图片在自己网站上是不可用的.那么我们使用程序将目标网站上的图片下载到我们网站服务器上,然后就可调用图片了.查看演示DEMO本文将使用PHP实现采集远程图片功能.基本流程:1.获取目标网站图片地址.2.读取图片内容.3.创建要保存图片的路径并命名图片名称.4.写入图片内容.5.完成.我们通过写几个函数来实现这一过程.函数make_dir()建立目录.判断要保存的图片文件目录是否存在,如果不存在则创建目录,并

PHP+jQuery 长文章分页类 ( 支持 url / ajax 分页方式 )

/* ******* 环境:Apache2.2.8 ( 2.2.17 ) + PHP5.2.6 ( 5.3.3 ) + MySQL5.0.51b ( 5.5.8 ) + jQuery-1.8 ******* 其它组件:jQuery-1.8.3.min.js + Smarty 3.1.18 + TinyMCE 4.1.6 ******* Date:2014-10-20 ******* Author:小dee ******* Blog:http://www.cnblogs.com/dee0912/*

Spring 远程调用工具类RestTemplateUtils

RestTemplateUtils.java package utils; import java.util.Map; import org.springframework.http.HttpEntity; import org.springframework.http.HttpHeaders; import org.springframework.http.HttpMethod; import org.springframework.http.ResponseEntity; import or

实验二：函数重载、模板、简单类的定义和实现

[实验结论] #函数重载编写重载函数add(),实现对int型,double型,Complex型数据的加法.在main()函数中定义不同类型数据,调用测试. #include<iostream> using namespace std; struct Complex { double real; double imaginary; }; //函数声明 int add(int a, int b); double add(double a,double b); Complex add(Comp

问题 A: 简单类及成员实例（C#）

题目描述简单类及成员实例.定义了如下图所示类Student,根据下图和给出代码,补写缺失的代码. using System; namespace sample{ class Student { public string studentid;//学号 public string studentname;//姓名 private string birthplace;//籍贯 private DateTime birthdate;//出生日

(转）Spring 远程调用工具类RestTemplateUtils

出处:https://www.cnblogs.com/jonban/p/rest.html Spring 远程调用Rest服务工具类,包含Get.Post.Put.Delete四种调用方式. 依赖jar <dependency> <groupId>org.springframework</groupId> <artifactId>spring-context</artifactId> <version>5.0.9.RELEASE<

【转载】(一)基于阿里云的MQTT远程控制(Android 连接MQTT服务器,ESP8266连接MQTT服务器实现远程通信控制----简单的连接通信)

如果不了解MQTT的可以看这篇文章 http://www.cnblogs.com/yangfengwu/p/7764667.html http://www.cnblogs.com/yangfengwu/p/8026014.html 关于钱的问题,其实自己是花钱买的云服务,虽然自己现在能支付的起,但是呢为了尽量减少支出,自已还有好多好多文章要写,好多好多元器件要买,所以哈会在自己的淘宝上卖源码2元价格,一元捐出,一元自己留着当亲们支付云服务的费用了如果看不懂也没关系,跟着做就可以了,做完以后您