XX之家的爬虫之旅

[导读] 因为本人公司正处于P2P的行业，分析行业数据，对平台的运营决策有着很大的作用，因此需要爬XX之家的相关数据。

1、分析

通过右键查看页面源代码发现页面结构为表格布局，因此设想可以分为四个步骤来采集数据：1、使用爬虫将页面抓取下来；2、对页面数据进行解析；3、入库；4、写个定时服务每天定时抓取。因为公司网站也使用了PHP最近也学习了一点，听说curl非常适合用来爬去网页，决定用PHP程序来抓取。

2、抓取页面

有一个小插曲，刚开始抓取的时候，返回的页面信息都是404.html，最后分析发现网站对非浏览器的请求进行了屏蔽，直接跳转404。后台加绿色代码的部分，成功抓取数据。

function crawl($url){
         $ch = curl_init();
         curl_setopt($ch, CURLOPT_URL, $url);
         curl_setopt($ch, CURLOPT_HEADER, false);

curl_setopt($ch, CURLOPT_USERAGENT, ‘Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322)‘);

         curl_setopt($curl, CURLOPT_POST, 1);
         curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
         $result=curl_exec($ch);
         curl_close($ch);
         return $result;
    }

3、解析数据

查看页面源代码发现，第一行为title去掉，最后两列为平台的连接和关注，均过滤掉。第一列的id需要根据下列的连接来截取，中间的所有数据会有汉子的单位和一些特殊字符，使用preg_replace进行替换，最后安装XX平台的数据，进行拼接入库的SQL返回，done。

function  analyze($dom,$satTime){
        $html = new simple_html_dom();
        $sql= "insert into XXX_data (XXPlatId,platName,averageMonth,dayTotal,averageRate,investNumber,averageInvestMoney,averageBorrowTime,borrowNumer,borrowBidNumber,averageBorrowMoney,firstTenInvestRate,firstTenBorrowRate,bidEndTime,registerTime,registerMoney,leverageFund,invest30total,repay60total,repayTotal,statisticsTime,excuteTime) values ";
        $html->load($dom);
        $istitle=0;
        foreach($html->find(‘tr‘) as $tr){
          $istitle=$istitle+1;
          if($istitle==1){
            continue;
          }
          $sql.="(";
          $count=0;
          foreach($tr->find(‘td‘) as $element){
              $count=$count+1;
              if($count==1){
                $href=$element->next_sibling()->find(‘a‘, 0)->href;
                $href=strstr($href, ‘.‘, TRUE);
                $href=strstr($href,‘-‘);
                $sql.="‘".substr($href,1)."‘,";
              }elseif($count==2){
                $val=$element->find(‘a‘, 0)->innertext;
                $sql.="‘".$val."‘,";
              }elseif($count<21){
                $patterns = array();
                $patterns[0] = ‘/([\x80-\xff]*)/i‘;
                $patterns[1] = ‘/[%*]/‘;
                $val=preg_replace($patterns,‘‘,$element->innertext);
                $sql.="‘".$val."‘,";
              }
          }
          $sql.="‘".$satTime."‘,‘".date(‘Y-m-d H:i:s‘)."‘),";
        }
        $sql = substr($sql,0,strlen($sql)-1);
        $sql = strip_tags($sql);
        return $sql;
    }

4、入库

通过网上的查找学习，发现PHP操作mysql比起java来说很简单，几句代码搞定.

 function  save($sql){
        $con = mysql_connect("192.168.0.1","root","root");
        if (!$con){
           die(‘Could not connect: ‘ . mysql_error());
        }
        mysql_select_db("xx_data", $con);
        mysql_query("set names utf8");
        mysql_query($sql);
        mysql_close($con);
    }

5、批量爬取

通过分析数据的查询条件，每次的查询都是根据url后缀的日期来查询当日交易数据， http://XXX/indexs.html?startTime=2015-04-01&endTime=2015-04-01，因为只需要遍历历史日期来拼接URl就用来爬取历史的所有交易。

function execute(){
      $starttime="2014-04-15";
      $endtime="2015-04-15";
      for($start = strtotime($starttime); $start <= strtotime($endtime);$start += 86400){
         $date=date(‘Y-m-d‘,$start);
         $url="http://shuju.XX.com/indexs.html?startTime=".$date."&endTime=".$date;
         //第一步 抓取
         $dom=crawl($url);
         //第二步  解析
         $sql=analyze($dom,$date);
         //第三步 入库
         save($sql);
     }
      echo  "execute end";
    }

     execute();

6、设置定时服务

设置定时任务来每天固定时间来抓取最新的数据，以免每次手工来执行，php也有自己的定时任务，但是网上看了下实现起来太复杂，因此利用linux的crontab来实现，linux下面输入crontab –e 进入编辑状态，添加一条定时利用curl来调用，至此爬虫功能完毕。

    30 09 * * * curl  http://192.168.0.1/crawl.php

此程序仅供学习交流，如果有需要完整源码的朋友可以单独联系我。

时间： 2024-11-09 02:50:01

XX之家的爬虫之旅

XX之家的爬虫之旅的相关文章

在XX之家工作的那些事（2）-- 专利文档引发的不爽

链家新房爬虫

高频访问IP限制 --Openresty(nginx + lua) [反爬虫之旅]

大众点评网商家数据采集爬虫实现源码

某家简单爬虫记录

链家网爬虫同步VS异步执行时间对比

链家二手房成交爬虫

有趣的Node爬虫，数据导出成Excel

爬虫（heritrix框架）