php利用simple_html_dom类，获取页面内容，充当爬虫角色

PHP脚本扮演爬虫的角色，可能大家第一时间想到可能会是会正则，个人对正则的规则老是记不住，表示比较难下手，今天工作中有个需求需要爬取某个网站上的一些门店信息

无意间在网上看到一个比较好的类库叫：simple_html_dom

github下载地址：https://github.com/samacs/simple_html_dom

最重要的一步：你得先了解别人网站的结构，知道从哪个tab开始是你想要的数据

下面演示下过程吧

实现过程我分了三步

1、将门店信息的经纬度，名称等一些重要信息先插入本地表

[php] view plain copy

set_time_limit(0);
$host = ‘127.0.0.1‘;
$user = ‘root‘;
$user_pwd = ‘‘;
$database = ‘dataname‘;
$conn = mysql_connect($host,$user,$user_pwd) or die(‘sss‘);
mysql_select_db($database,$conn) or die(‘dddd‘);
mysql_query(‘set names utf8‘);
include(‘./simple_html_dom-master/simple_html_dom.php‘);
$url = ‘需要爬取的网站URL‘;
$html = file_get_html($url);
$n = 1;
foreach($html->find(‘li[data-counts=0]‘) as $e){
$storeid = $e->storeid;
$star = $e->level.‘.0‘;
$work_time = $e->time;
$mapx = $e->mapx;
$mapy = $e->mapy;
$nickname = $e->mapname;
$mapadd = $e->mapadd;
$maptel = $e->maptel;
$time = date(‘Y-m-d H:i:s‘);
$query = "INSERT INTO `store` (`storeid`,`star`,`work_time`,`longitude`,`latitude`,`create_time`,`nickname`,`address`,`tel`)
VALUES ($storeid,‘".$star."‘,‘".$work_time."‘,‘".$mapx."‘,‘".$mapy."‘,‘".$time."‘,‘".$nickname."‘,‘".$mapadd."‘,‘".$maptel."‘)";
$res = mysql_query($query);
//echo $query;exit();
if($res){
echo ‘成功导入第‘.$n.‘个门店<br>‘;
$n++;
}else{
die(‘失败<br>‘);
}
}

2、跳入站点的另一个页面获取门店LOGO图片

[php] view plain copy

$query = "SELECT storeid FROM store order by id desc";
$row = mysql_query($query);
while($rows = mysql_fetch_array($row)){
$url = ‘http://别人站点域名/‘.$rows[‘storeid‘].‘.jhtml‘;
$html = file_get_html($url);
foreach($html->find(‘div.onlyOnePic‘) as $e){
//获取img的src属性
$img = $e->firstChild()->src;
//将远程图片保存到本地
$content = file_get_contents($img);
file_put_contents(‘./store/‘.$rows[‘storeid‘].‘.jpeg‘, $content);
}
}

3、更新表中对应门店的LOGO字段

[php] view plain copy

$query = "SELECT storeid FROM store order by id desc";
$row = mysql_query($query);
$n = 1;
while($rows = mysql_fetch_array($row)){
$img = "https://我自己站点域名/".$rows[‘storeid‘].".jpeg";
$sql = "UPDATE store set img_url=‘".$img."‘ where storeid=".$rows[‘storeid‘];
$res = mysql_query($sql);
if($res){
echo ‘成功更新第‘.$n.‘个门店<br>‘;
$n++;
}else{
echo ‘失败‘;
}
}

OK，功能实现了，不过还没有更深入的了解这个类库的其他功能，这里也只是做个记录，方便以后需要的时候用

时间： 2024-10-14 19:01:46

php利用simple_html_dom类，获取页面内容，充当爬虫角色的相关文章

PHP curl获取页面内容，不直接输出到页面，CURLOPT_RETURNTRANSFER参数设置

使用PHP curl获取页面内容或提交数据,有时候希望返回的内容作为变量储存,而不是直接输出.这个时候就必需设置curl的CURLOPT_RETURNTRANSFER选项为1或true. 1.curl获取页面内容, 直接输出例子: <?php $url = 'http://52php.cnblogs.com'; $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, $url); curl_setopt($ch, CURLOPT_SSL_VERIFYPE

POST信息模拟登录获取页面内容

最近项目里有一个是要模拟登录后,访问固定页面获取内容的要求,一开始用JQ AJAX好像不支持跨域请求.后使用.net中HttpWebRequest对象来获取.一开始访问总是无法在第二个页面正常访问,好像没通过登录验证,用postman模拟提交正常,后查询出原是忘记在第二次请求没把cookies关联上,关联上后请求正常. string wurl=""; string username="haxinet"; string userpwd="haxinet&qu

页面静态化-http get请求获取页面内容代码

1 import org.apache.http.HttpEntity; 2 import org.apache.http.HttpException; 3 import org.apache.http.HttpResponse; 4 import org.apache.http.client.HttpClient; 5 import org.apache.http.client.methods.HttpGet; 6 import org.apache.http.client.methods.H

android 利用TrafficStats类获取本应用的流量

public void getData() { // PackageManager 包管理类 PackageManager packageManager = BrownserActivity.this.getPackageManager(); int PackageUid = 0; BigDecimal numRx = new BigDecimal("0"); BigDecimal numTx = new BigDecimal("0"); /** * 循环抓紧所有应

python 携带cookie获取页面内容

有时会遇到爬取的页面需要登录,这就要带上cookie了. 下面记录了几种携带cookie的方法 # coding=utf-8 import requests s = requests.Session() login_data = {'username': 'teacher', 'password': 'teacher'} # 方法1 # resp1 = s.post('http://192.168.2.132/login/', data=login_data) # r = s.get('http

PHP - 简单获取页面内容

无需使用CURL和Http_Client库,PHP中原生的file()和file_get_contents()函数均可实现,只要文件名参数为URL即可.下面是来自PHP帮助文件的说明(来自于fopen()函数): 如果 filename 是 "scheme://..." 的格式,则被当成一个 URL,PHP 将搜索协议处理器(也被称为封装协议)来处理此模式.如果该协议尚未注册封装协议,PHP 将发出一条消息来帮助检查脚本中潜在的问题并将 filename 当成一个普通的文件名继续执行下

利用WWW类获取Text并且在unityUGUI的Text中显示

先创建一个txt的文本,另存为修改一下格式改成UEF—8的.然后放入服务器或者本地都行.然后进行读取. 上代码 using UnityEngine.UI; using System.IO; public class GetBundle : MonoBehaviour { public Text mytext; void Start () { StartCoroutine(LoadText()); } IEnumerator LoadText() { string path = "http://1

JAVA通过url获取页面内容

String address = "http://sports.sina.com.cn/nba/live.html?id=2015050405"; URL url = new URL(address); HttpURLConnection connection = (HttpURLConnection)url.openConnection(); InputStreamReader input = new InputStreamReader(connection.getInputStre

js 获取页面内容可见区域的高度和宽度

var h = window.innerHeight || document.documentElement.clientHeight || document.body.clientHeight; //height var w = window.innerWidth || document.documentElement.clientWidth || document.body.clientWidth; //width