phpQuery对数据信息的采集进一步学习

前提:需要下载:phpQuery/phpQuery.php

链接:http://www.cnblogs.com/wuheng1991/p/5145398.html

1.对于规则的部分

<?php
header(‘Content-Type:text/html;charset=UTF-8‘);
include ‘./phpQuery/phpQuery.php‘;
set_time_limit(10000);
$id = isset($_GET[‘id‘]) ? intval($_GET[‘id‘]) : 1;

if($id > 46){
   echo "finish!";
   exit;
}
// exit;
echo "当前 id=".$id;
echo "<br/>";
$url = "http://www.genetex.com/web/Search/SearchList.aspx?&Category=494&Page=".$id."&SpecialGroupID=240";
echo "当前的URL:";
echo $url."<br/>";

phpQuery::newDocumentFile($url);

$artList = pq(".sear_list");
$count = count($artList);
// var_dump($count);
$i = 0;
foreach($artList as $li){
    $url = ‘‘;
    $tr = ‘‘;
    $tr = pq($li)->eq(0)->find("li")->eq(1)->find("a")->attr(‘href‘);
    $tr = trim($tr);
    // var_dump($tr);
    $url = "http://www.genetex.com/";
    $arr = explode("/",$tr);
    // exit;
    $end = trim($arr[2]);

    $rel_url = $url.$end;
    $rel_url = $rel_url."\r\n";
    file_put_contents(‘url.txt‘,$rel_url,FILE_APPEND); //用于装载执行失败的数据
    echo ‘<br/>‘;

}

?>

<script>
function JumpUrl(){
    location.href=‘?id=<?php echo ($id+1);?>‘;
}
setTimeout(JumpUrl,0);
</script>

2.对于不规则的部分。(通过产品链接,得到产品相关信息)

<?php
header(‘Content-Type:text/html;charset=UTF-8‘);
include ‘./phpQuery/phpQuery.php‘;
set_time_limit(10000);
$id = isset($_GET[‘id‘]) ? intval($_GET[‘id‘]) : 0;

$arr = file(‘./url.txt‘);
// var_dump($arr);

if($id > 453){
   echo "finish!";
   exit;
}

$curr_url = trim($arr[$id]);
unset($arr);
var_dump($curr_url);
echo ‘<br/>‘;
echo ‘<hr/>‘;
phpQuery::newDocumentFile($curr_url);

$arrAll = array();

$artList = pq(".table_style_1");
$app = pq(".sear_list_2")->eq(0)->find("span")->eq(1)->html();
$app = trim($app);
echo "Application Information:".$app;
echo ‘<hr/>‘;
$arrAll[‘xing_num‘]= ‘‘;
$arrAll[‘reference‘] = ‘‘;
// var_dump($artList);
foreach($artList as $key =>$li){

    if($key == 0){

    $tr1 = ‘‘;  ####  Catalog Number,Full Name,Storage Buffer
    $tr2 = ‘‘;  ####  Catalog Number,Full Name,Storage Buffer 对应的属性值
    $tr3 = ‘‘;  ####  Product Name , Product Description,Storage Instruction
    $tr4 = ‘‘;  ####  Product Name , Product Description,Storage Instruction 对应的属性值
    $tr5 = ‘‘;  ####  Applications,Background, Notes
    $tr6 = ‘‘;  ####  Applications,Background, Notes 对应的属性值

    $tr1 = pq($li)->eq(0)->find("tr")->eq(0)->find("td")->eq(0)->html();
    $tr1 = trim($tr1);
    $tr1 = str_replace("\r\n","",$tr1);

    $tr2 = pq($li)->eq(0)->find("tr")->eq(0)->find("td")->eq(1)->html();
    $tr2 = trim($tr2);

    $tr2_a = pq($li)->eq(0)->find("tr")->eq(0)->find("td")->eq(1)->find(".recommend")->eq(0)->html();
    $tr2_a = trim($tr2_a);
    // $tr2_a = str_replace("\r\n","",$tr2_a);

    $tr2_xx = ‘‘;  ###星星
    $tr2_xx = pq($li)->eq(0)->find("tr")->eq(0)->find("td")->eq(1)->find(".recommend")->eq(0)
    ->find("div")->eq(1)->html();
    $tr2_xx = strip_tags($tr2_xx);
    $tr2_xx = str_replace("\r\n","",$tr2_xx);
    $tr2_xx = str_replace("(","",$tr2_xx);
    $tr2_xx = str_replace(")","",$tr2_xx);
    $tr2_xx = str_replace(" ","",$tr2_xx);

    $arrAll[‘xing_num‘] = $tr2_xx;

    // echo "<hr/>";
    // echo var_dump($tr2_xx);
    // echo "<hr/>";
    $tr2_yy = ‘‘;  ### refercence
    $tr2_yy = pq($li)->eq(0)->find("tr")->eq(0)->find("td")->eq(1)->find(".recommend")->eq(0)
    ->find("div")->eq(2)->html();
    $tr2_yy = strip_tags($tr2_yy);
    $tr2_yy = str_replace("\r\n","",$tr2_yy);
    $tr2_yy = str_replace("(","",$tr2_yy);
    $tr2_yy = str_replace(")","",$tr2_yy);
    $tr2_yy = str_replace(" ","",$tr2_yy);

    $arrAll[‘reference‘] = $tr2_yy;
    // echo "<hr/>";
    // echo var_dump($tr2_yy);
    // echo "<hr/>";

    $tr2 = str_replace($tr2_a,"",$tr2);
    $tr2 = strip_tags($tr2);
    $tr2 = str_replace(" ","",$tr2);
    $tr2 = str_replace("\r\n","",$tr2);
    $tr2 = trim($tr2);

    $tr3 = pq($li)->eq(0)->find("tr")->eq(1)->find("td")->eq(0)->html();
    $tr3 = trim($tr3);
    $tr3 = str_replace("\r\n","",$tr3);

    if($key == 0){
    $tr4 = pq($li)->eq(0)->find("tr")->eq(1)->find("td")->eq(1)->find(‘strong‘)->eq(0)->html();
    }else{
    $tr4 = pq($li)->eq(0)->find("tr")->eq(1)->find("td")->eq(1)->html();
    }
    // $tr4 = trim($tr4);
    // $tr4 = str_replace(" ","",$tr4);
    $tr4 = str_replace("\r\n","",$tr4);
    $tr4 = trim($tr4);

    $tr5 = pq($li)->eq(0)->find("tr")->eq(2)->find("td")->eq(0)->html();
    $tr5 = trim($tr5);
    $tr5 = str_replace("\r\n","",$tr5);

    $tr6 = pq($li)->eq(0)->find("tr")->eq(2)->find("td")->eq(1)->html();
    $tr6 = str_replace(" ","",$tr6);
    $tr6 = str_replace("\r\n","",$tr6);
    $tr6 = trim($tr6);

    // echo $tr1."  ;  ".$tr2."   ;     ".$tr3."   ;   ".$tr4."  ;   ".$tr5." ;  ".$tr6;
    // echo ‘<hr/>‘;
    // echo ‘<br/>‘;
 }
    // exit;
}

$arrAll[‘Application_Information‘] = $app;
$arrAll[‘catalog_number‘] = $tr2;
$arrAll[‘product_name‘] = $tr4;
$arrAll[‘applications‘] = $tr6;

$arrAll[‘position_controls‘] = "";    

$art = pq("#tab1 table");
// var_dump($art);
foreach($art as $k =>$v){

if($k == 0){
    // var_dump($v);
for($ii = 0;$ii < 5;$ii ++){
$a_1 = ‘‘;
$a_1 = pq($v)->eq(0)->find("tr")->eq(7+$ii)->find("td")->eq(0)->html();
$a_1 = trim($a_1);
if($a_1 == "Positive Controls"){

$a_2 = pq($v)->eq(0)->find("tr")->eq(7+$ii)->find("td")->eq(1)->html();
$a_2 = trim($a_2);
$arrAll[‘position_controls‘] = $a_2;

} 

}

// var_dump($a_1);
// exit;
}

}

// var_dump($arrAll);
$arrAll[‘full_name‘] = ‘‘;
$arrAll[‘product_description‘] = ‘‘;
$arrAll[‘synonyms‘] = ‘‘;
$arrAll[‘host‘] = ‘‘;
$arrAll[‘clonality‘] = ‘‘;

$arrAll[‘isotype‘] = ‘‘;
$arrAll[‘species_reactivity‘] = ‘‘;
$arrAll[‘conjugation‘] = ‘‘;

// echo ‘<br/>‘;
// echo ‘<hr/>‘;

foreach($artList as $key =>$li){

    if($key == 1){

    $tr1 = ‘‘;  ####  Catalog Number,Full Name,Storage Buffer
    $tr2 = ‘‘;  ####  Catalog Number,Full Name,Storage Buffer 对应的属性值
    $tr3 = ‘‘;  ####  Product Name , Product Description,Storage Instruction
    $tr4 = ‘‘;  ####  Product Name , Product Description,Storage Instruction 对应的属性值
    $tr5 = ‘‘;  ####  Applications,Background, Notes
    $tr6 = ‘‘;  ####  Applications,Background, Notes 对应的属性值

    $tr1 = pq($li)->eq(0)->find("tr")->eq(0)->find("td")->eq(0)->html();
    $tr1 = trim($tr1);
    $tr1 = str_replace("\r\n","",$tr1);

    $tr2 = pq($li)->eq(0)->find("tr")->eq(0)->find("td")->eq(1)->html();
    $tr2 = trim($tr2);

    $tr2_a = pq($li)->eq(0)->find("tr")->eq(0)->find("td")->eq(1)->find(".recommend")->eq(0)->html();
    $tr2_a = trim($tr2_a);
    // $tr2_a = str_replace("\r\n","",$tr2_a);

    $tr2 = str_replace($tr2_a,"",$tr2);
    $tr2 = strip_tags($tr2);
    // $tr2 = str_replace(" ","",$tr2);
    $tr2 = str_replace("\r\n","",$tr2);
    $tr2 = trim($tr2);

    $tr3 = pq($li)->eq(0)->find("tr")->eq(1)->find("td")->eq(0)->html();
    $tr3 = trim($tr3);
    $tr3 = str_replace("\r\n","",$tr3);

    if($key == 0){
    $tr4 = pq($li)->eq(0)->find("tr")->eq(1)->find("td")->eq(1)->find(‘strong‘)->eq(0)->html();
    }else{
    $tr4 = pq($li)->eq(0)->find("tr")->eq(1)->find("td")->eq(1)->html();
    }
    // $tr4 = trim($tr4);
    // $tr4 = str_replace(" ","",$tr4);
    $tr4 = str_replace("\r\n","",$tr4);
    $tr4 = trim($tr4);

    for($jj = 0;$jj < 3;$jj ++){
        $tr5 = ‘‘;
        $tr5 = pq($li)->eq(0)->find("tr")->eq(2+$jj)->find("td")->eq(0)->html();
           $tr5 = str_replace("\r\n","",$tr5);
           $tr5 = trim($tr5);
           if($tr5 == "Synonyms"){

               $tr6 = pq($li)->eq(0)->find("tr")->eq(2+$jj)->find("td")->eq(1)->html();
            // $tr6 = str_replace(" ","",$tr6);
            $tr6 = str_replace("\r\n","",$tr6);
            $tr6 = trim($tr6);
            $arrAll[‘synonyms‘] = $tr6;
           }

    }

       ####  host  #####

      for($jj = 0;$jj < 4;$jj ++){
        $tr7 = ‘‘;
        $tr7 = pq($li)->eq(0)->find("tr")->eq(3+$jj)->find("td")->eq(0)->html();
           $tr7 = str_replace("\r\n","",$tr7);
           $tr7 = trim($tr7);
           if($tr7 == "Host"){

               $tr8 = pq($li)->eq(0)->find("tr")->eq(3+$jj)->find("td")->eq(1)->html();
            // $tr6 = str_replace(" ","",$tr6);
            $tr8 = str_replace("\r\n","",$tr8);
            $tr8 = trim($tr8);
            $arrAll[‘host‘] = $tr8;
           }

    }

    ##### clonality ####
      for($jj = 0;$jj < 4;$jj ++){
        $tr9 = ‘‘;
        $tr9 = pq($li)->eq(0)->find("tr")->eq(4+$jj)->find("td")->eq(0)->html();
           $tr9 = str_replace("\r\n","",$tr9);
           $tr9 = trim($tr9);
           if($tr9 == "Clonality"){

               $tr10 = pq($li)->eq(0)->find("tr")->eq(4+$jj)->find("td")->eq(1)->html();
            // $tr6 = str_replace(" ","",$tr6);
            $tr10 = str_replace("\r\n","",$tr10);
            $tr10 = trim($tr10);
            $arrAll[‘clonality‘] = $tr10;
           }

    }

    #### isotype ######
     for($jj = 0;$jj < 4;$jj ++){
        $tr11 = ‘‘;
        $tr11 = pq($li)->eq(0)->find("tr")->eq(5+$jj)->find("td")->eq(0)->html();
           $tr11 = str_replace("\r\n","",$tr11);
           $tr11 = trim($tr11);
           if($tr11 == "Isotype"){

               $tr12 = pq($li)->eq(0)->find("tr")->eq(5+$jj)->find("td")->eq(1)->html();
            // $tr6 = str_replace(" ","",$tr6);
            $tr12 = str_replace("\r\n","",$tr12);
            $tr12 = trim($tr12);
            $arrAll[‘isotype‘] = $tr12;
           }

    }

    #### species_reactivity #####

     for($jj = 0;$jj < 6;$jj ++){
        $tr13 = ‘‘;
        $tr13 = pq($li)->eq(0)->find("tr")->eq(6+$jj)->find("td")->eq(0)->html();
           $tr13 = str_replace("\r\n","",$tr13);
           $tr13 = trim($tr13);
           if($tr13 == "Species Reactivity"){

               $tr14 = pq($li)->eq(0)->find("tr")->eq(6+$jj)->find("td")->eq(1)->html();
            // $tr6 = str_replace(" ","",$tr6);
            $tr14 = str_replace("\r\n","",$tr14);
            $tr14 = trim($tr14);
            $arrAll[‘species_reactivity‘] = $tr14;
           }

    }

    ######## conjugation ######
      for($jj = 0;$jj < 6;$jj ++){
        $tr15 = ‘‘;
        $tr15 = pq($li)->eq(0)->find("tr")->eq(7+$jj)->find("td")->eq(0)->html();
           $tr15 = str_replace("\r\n","",$tr15);
           $tr15 = trim($tr15);
           if($tr15 == "Conjugation"){

               $tr16 = pq($li)->eq(0)->find("tr")->eq(7+$jj)->find("td")->eq(1)->html();
            // $tr6 = str_replace(" ","",$tr6);
            $tr16 = str_replace("\r\n","",$tr16);
            $tr16 = trim($tr16);
            $arrAll[‘conjugation‘] = $tr16;
           }

    }

    // echo $tr1."  ;  ".$tr2."   ;     ".$tr3."   ;   ".$tr4."  ;   ".$tr5." ;  ".$tr6;
    // echo ‘<hr/>‘;
    // echo ‘<br/>‘;
 }
    // exit;
}

$arrAll[‘full_name‘] = $tr2;
$arrAll[‘product_description‘] = $tr4;
$arrAll[‘researchArea‘] = "";

// echo ‘<br/>‘;
// echo ‘<hr/>‘;

foreach($artList as $key =>$li){
    if($key == 2){

      $tr17 = pq($li)->eq(0)->find("tr")->eq(3)->find("td")->eq(0)->html();
      $tr17 = trim($tr17);
      $tr17 = str_replace("\r\n","",$tr17);

      $tr18 = pq($li)->eq(0)->find("tr")->eq(4)->find("td")->eq(0)->html();

      $tr18 = strip_tags($tr18);
       $tr18 = str_replace("More Hide","",$tr18);
       $tr18 = str_replace("\r\n","",$tr18);
       $tr18 = trim($tr18);
      // echo  $tr18;
      // echo ‘<br/>‘;
      // echo ‘<hr/>‘;
    }
}

// echo ‘<br/>‘;
// echo "★★★★☆";
// echo ‘<br/>‘;
$arrAll[‘researchArea‘] = $tr18;
$arrAll[‘url‘] = $curr_url;
var_dump($arrAll);

$html = "";
$html .= $arrAll[‘Application_Information‘]."\t".$arrAll[‘catalog_number‘]."\t".$arrAll[‘product_name‘]."\t";
$html .= $arrAll[‘applications‘]."\t".$arrAll[‘position_controls‘]."\t".$arrAll[‘full_name‘]."\t";
$html .= $arrAll[‘product_description‘]."\t".$arrAll[‘synonyms‘]."\t".$arrAll[‘host‘]."\t";
$html .= $arrAll[‘clonality‘]."\t".$arrAll[‘isotype‘]."\t".$arrAll[‘species_reactivity‘]."\t";
$html .= $arrAll[‘conjugation‘]."\t".$arrAll[‘researchArea‘]."\t".$arrAll[‘xing_num‘]."\t";
$html .= $arrAll[‘reference‘]."\r\n";

echo $html;
echo ‘<br/>‘;
// file_put_contents(‘info.txt‘,$html,FILE_APPEND); //用于装载执行失败的数据
$fp = fopen(‘file.csv‘, ‘a‘);
fputcsv($fp,$arrAll);

unset($arrAll);
unset($artList);
?>
<script>
function JumpUrl(){
    location.href=‘?id=<?php echo ($id+1);?>‘;
}
setTimeout(JumpUrl,0);
</script>

3.对知识点的补充

a.使用phpQuery采集数据,实际上是通过链接地址,找到源码,再将源码转换为jQuery语法。然后通过jQuery语法获得数据

b.对于规则的部分,采集数据较容易。

c.对于不规则的部分,需要进行判断,较为复杂些,不过如果jQuery学的好,这也不是难事。

d.对函数:fputcsv() 的学习总结。如:

$fp = fopen(‘file.csv‘, ‘a‘);
fputcsv($fp,$arrAll);

其中,$arrAll是一维数组。fputcsv(),会将函数的值写入到csv文件中,一次写一行,如果数组为空,则不会写入,一旦数组中的元素不为空,就可以写入。

参看:http://php.net/manual/zh/function.fputcsv.php

时间: 2024-08-06 04:09:46

phpQuery对数据信息的采集进一步学习的相关文章

vue学习记录:vue引入,validator验证,数据信息,vuex数据共享

最近在学习vue,关于学习过程中所遇到的问题进行记录,包含vue引入,validator验证,数据信息,vuex数据共享,传值问题记录 1.vue 引入vue vue的大致形式如下: <template> </template> <script> </script> <style> </style> 要引入其他vue ,需要 import <template> <div> <Header></

黑客获取数据信息的目的和进攻手段

进入微软.亚马逊,谷歌等美国IT企业工作人才项目,起薪40万,百度搜索(MUMCS) 黑客使用进攻取证获取凭证,如用户名和密码.这些都允许他们访问敏感数据同时能够隐瞒自己的身份,以拖延攻击时被发现的时间并避免暴露自己的行踪.黑客寻找这种以半永久记忆的形式获取存在如 RAM 内存或交换文件中的动态/非静态数据.一旦黑客获得暂时存储在明文中的用户 ID 和密码,他们就可以进入下一个等级的访问,进一步获取资源,如内部网站.文档管理系统和 SharePoint 站点,本文来自网届网. 以下为原文: "一

Docker 学习笔记【2】 Docker 基础操作实,Docker仓库、数据卷,网络基础学习

Docker 学习笔记[3] Docker 仓库实操,创建私有仓库,实操数据卷.数据卷容器,实操 网络基础  ---------高级网络配置和部分实战案例学习 ============================================================= Docker 学习笔记[1]Docker 相关概念,基本操作--------实操记录开始 ============================================================= 被

委托的进一步学习3

嘿嘿,今天的晚上是平安夜,预祝大家节日快乐!在这个冰冷的冬天,给自己一点温暖不论怎么样,生活中的我们要心情愉悦哦,下面就来总结一下我们今天学习的内容,其实我们今天是学习了委托以及对Linq的初步认识吧,总结一下今天学习的内容吧. 一.Lamda表达式在委托中的使用 delegate string MyDel(string n,string p); public class Program { static void Main(string[] args) { #region Lamda表达式 #

Tsar 服务器系统和应用信息的采集报告工具

Tsar介绍 Tsar是淘宝的一个用来收集服务器系统和应用信息的采集报告工具,如收集服务器的系统信息(cpu,mem等),以及应用数据(nginx.swift等),收集到的数据存储在服务器磁盘上,可以随时查询历史信息,也可以将数据发送到nagios报警. Tsar能够比较方便的增加模块,只需要按照tsar的要求编写数据的采集函数和展现函数,就可以把自定义的模块加入到tsar中. Tsar安装 Tsar目前托管在github上,下载编译安装步骤: $git clone git://github.c

20155234第九周《信息安全系统设计基础》学习总结

20155234第九周<信息安全系统设计基础>学习总结 第六章 存储器结构层次 CPU执行指令,而存储器系统为CPU存放指令和数据. 存储器系统是一个线性的字节数组. 实际上,存储器系统是一个具有不同的容量.成本和访问时间的存数设备的层次结构.靠近CPU的小的.成本和访问时间的存储设备的层次结构. 高速缓存存储器是作为主存储器的数据和指令的缓冲区域. 主存暂时存放存储在容量较大的.慢速磁盘上的数据,而这些磁盘常常又作为存储在通过网络连接的其他机器的磁盘或磁带的数据的缓冲区. 局部性的基本属性:

大数据新手的0基础学习路线,从菜鸟到高手的成长之路

大数据作为一个新兴的热门行业,吸引了很多人,但是对于大数据新手来说,按照什么路线去学习,才能够学习好大数据,实现从大数据菜鸟到高手的转变.这是很多想要学习大数据的朋友们想要了解的. 今天我们就来和大家分享下大数据新手从0开始学习大数据,实现菜鸟到高手的转变的学习路线.希望能够帮助想要学习大数据的朋友. 如果你想要学好大数据最好加入一个好的学习环境,可以来这个Q群529867072 这样大家学习的话就比较方便,还能够共同交流和分享资料 以下是大数据新手学习路线的正文: Linux:因为大数据相关软

最全的抖音数据信息获取

最近开发了一套抖音采集程序,目前提供如下接口. 1.抖音综合搜索数据信息接口 2.抖音视频搜索数据信息接口 3.抖音用户信息搜索数据信息接口 4.获取抖音首页推荐列表数据信息接口 5获取抖音对应城市的推荐列表数据信息接口 6.获取抖音用户信息数据信息接口 7.获取抖音用户作品(抖音用户视频)数据信息接口 8.获取抖音用户动态数据信息接口 9.获取抖音用户关注用户列表数据信息接口 注意:关注列表请求太频繁会导致不返回数据 10.获取抖音用户粉丝列表数据信息接口 11.获取抖音评论列表数据信息接口

用python+selenium获取北上广深成五地PM2.5数据信息并按空气质量排序

从http://www.pm25.com/shenzhen.html抓取北京,深圳,上海,广州,成都的pm2.5指数,并按照空气质量从优到差排序,保存在txt文档里 #coding=utf-8 from selenium import webdriver from time import sleep class PM: def __init__(self): self.dr = webdriver.Chrome() self.pm25_info = self.get_pm25_info() de