PowerShell 抓取网页表格

今天无意中看到了传教士写的一篇博文http://www.cnblogs.com/piapia/p/5367556.html（PowerShell中的两只爬虫），很受启发，自己试着抓了一下，成功地抓取了网页的表格。因为我是英文版的系统，中文系统的界面转换成字符串都成了乱码，因此测试都是在英文网页上操作的。

PowerShell 5里面有一个新的函数叫做ConvertFrom-String, 他的作用是把字符串转换成对象。其中一个参数是可以根据指定的模板，把对应的那一部分字符串匹配出来生成对象，我们可以利用这个功能抓取网页中的表格。

详细帮助文档链接

https://technet.microsoft.com/library/dn807178(v=wps.640).aspx

首先看个基本例子

$a=
@‘
1 2 3 4
5 6 7 8
9 2 2 3
‘@
$t=
@‘
{Co1*:1} {Co2:2} {Co3:3} {Co4:4}
{Co1*:5} 6 7 8
‘@
$c=$a | ConvertFrom-String -Delimiter "\r\n"
$d=$a | ConvertFrom-string -TemplateContent $t

同样的字符串，第一个我用分隔符回车换行来生成一个对象；第二个我用自定义的模板格式来进行匹配。注意属性定义的格式写法 {}隔开，然后第一个需要{属性名字*：}，后面不需要加*,至少需要匹配2行数据才行。

可以看见第一个对象有3个属性，P1是1 2 3 4，P2 是 4 5 6 7 ，P3是9 2 2 3;

第二个对象则是根据每一列来自动匹配的（已经有一个模板匹配了前2行）

接下来我们来看2个实例。

第一个例子是这个网页，里面有一个澳洲代理服务器的列表，如下所示，我想抓出来

http://www.proxylisty.com/country/Australia-ip-list

基本思路：invoke-restmethod直接抓取整个网页，自动转换为string对象。

然后设计对应的模板。因为是html文件，转换为string以后对应的html代码都在里面。因此关键是怎么把这些带有html代码的表格模板弄出来。

很简单，网页都可以查看html的源代码，下面一大段html的代码可以直接从网页上复制粘贴对应的2行表格代码即可，稍加修改添加属性名字就行了。

然后根据模板匹配就会自动生成对应的表格对象了

$web = ‘http://www.proxylisty.com/country/Australia-ip-list‘
$template = 
@‘
<tr>
<td>{IP*:203.56.188.145}</td>
<td><a href=‘http://www.proxylisty.com/port/8080-ip-list‘ title=‘Port 8080 Proxy List‘>{Port:8080}</a></td>
<td>HTTP</td>
<td><a style=‘color:red;‘ href=‘http://www.proxylisty.com/anonymity/High anonymous / Elite proxy-ip-list‘ title=‘High anonymous / Elite proxy Proxy List‘>High anonymous / Elite proxy</a></td>
<td>No</td>
<td><a href=‘http://www.proxylisty.com/country/Australia-ip-list‘ title=‘Australia IP Proxy List‘><img style=‘margin: 0px 5px 0px 0px; padding: 0px;‘ src=‘http://www.proxylisty.com/assets/flags/AU.png‘ title=‘Australia IP Proxy List‘/>Australia</a></td>
<td>13 Months</td>
<td>2.699 Sec</td>
<td><div id="progress-bar" class="all-rounded">
<div title=‘50%‘ id="progress-bar-percentage" class="all-rounded" style="width: 50%">{Reliability:50%}</div></div></td>
</tr>
<tr>
<td>{IP*:103.25.182.1}</td>
<td><a href=‘http://www.proxylisty.com/port/8081-ip-list‘ title=‘Port 8081 Proxy List‘>{Port:8081}</a></td>
<td>HTTP</td>
<td><a style=‘color:red;‘ href=‘http://www.proxylisty.com/anonymity/Anonymous proxy-ip-list‘ title=‘Anonymous proxy Proxy List‘>Anonymous proxy</a></td>
<td>No</td>
<td><a href=‘http://www.proxylisty.com/country/Australia-ip-list‘ title=‘Australia IP Proxy List‘><img style=‘margin: 0px 5px 0px 0px; padding: 0px;‘ src=‘http://www.proxylisty.com/assets/flags/AU.png‘ title=‘Australia IP Proxy List‘/>Australia</a></td>
<td>15 Months</td>
<td>7.242 Sec</td>
<td><div id="progress-bar" class="all-rounded">
<div title=‘55%‘ id="progress-bar-percentage" class="all-rounded" style="width: 55%">{Reliability:55%}</div></div></td>
</tr>
‘@
$temp=Invoke-RestMethod  -uri $web 
$result = ConvertFrom-String -TemplateContent $template   -InputObject  $temp 
$result  | sort reliability

成功抓取

类似的，豆子最近比较关注健康食物，我想看看低GI的食物有哪些

http://ultimatepaleoguide.com/glycemic-index-food-list

需要把下面这个表格抓出来

[email protected]‘
<tr>
<td valign="top">{Food*:Banana cake, made with sugar}</td>
<td valign="top">{GI:47}</td>
<td valign="top">{Size:60}</td>
</tr>
<tr>
<td valign="top">{Food*:Banana cake, made without sugar}</td>
<td valign="top">{GI:55}</td>
<td valign="top">{Size:60}</td>
</tr>
‘@
$web2=‘http://ultimatepaleoguide.com/glycemic-index-food-list/‘
$temp=Invoke-RestMethod  -uri $web2 
$result1 = ConvertFrom-String -TemplateContent $t2   -InputObject  $temp     
$result1  | Out-GridView

成功！

这种方式很有用，尤其是需要获取网页某些列表信息的时候，当然，如果网页本身就提供RESTFUL的接口，可以直接获取JSON格式的内容那就更省事了。

时间： 2024-10-29 19:08:32

PowerShell 抓取网页表格

PowerShell 抓取网页表格的相关文章

对抓取网页的脚本的研究

Asp.net 使用正则和网络编程抓取网页数据(有用)

python多线程实现抓取网页

抓取网页链接

PHP利用Curl实现多线程抓取网页和下载文件

抓取网页中的内容、如何解决乱码问题、如何解决登录问题以及对所采集的数据进行处理显示的过程

Java抓取网页数据（原网页+Javascript返回数据）

[Python]网络爬虫（一）：抓取网页的含义和URL基本构成

file_get_contents抓取网页乱码的解决