spidering hacks 学习笔记(二)
8 :Hack8 Installing Perl Modules
安装方式:
linux,mac,unix下通过:CPAN(Comprehensive Perl Archive Network)
windows下(PPM(Programmer‘s Package Manager)
通过安装LWP模块举例,(全称:The World - Wide Web library for
Perl)
terminal下(我用的是ubuntu):
( 1 )sudo perl - MCPAN - e "install libwww-perl" ;
( 2 )sudo perl - MCPAN - e shell
install wwwlib - perl
手动安装:(不详细说明了!perl学习里面有)
一般的: perl Makefile.PL 将模块安装在 / usr / local / bin 中
but If you have little more than user access to the system,你应该强制安装在
/ usr / hqh / bin
(perl Makefile.PL LIB = / home / hqh / lib)
9 :Hack9 Simply Fetching with LWP::Simple
coding:
#! /usr/bin/perl -w
#上面那句里面-w: -w表示使用严格的语法控制
use strict;
use LWP::Simple;
my $url = "http://www.baidu.com" ;
my $content = get($url);
die "count not get $url"
unless defined $content;
if ($content = ~ m / baidu / i) {
print
"有\"baidu\"这个字符串\n" ;
} else
{ print
"木有\"baidu\"这个字符串\n" ; }
#复习下m// s/// tr/// 三个函数
$ str = "i love perl,oh year!" ;
if ($ str = ~m / lo / ){
print
"have lo\n" ;
}
$ str = "i love perl,oh year!" ;
if ($ str = ~ / lo / ){
print
"have lo as well\n" ;
}
#m可以去掉哦!!
$name = "my name is huangqihao haha" ;
$name = ~s / name / handsome name / ;
print
"$name\n" ;
$name = ~s / m / heihei / ;
print
"$name\n" ;
##看到没有,s只替换第一个m,把m替换为heihei,如果$name 中的m全部替换呢?
$name = ~s / m / heihei / g;
print
"$name\n" ;
#看到没有,发生了!
#Perl 的正则表达式中如果出现 () ,则发生匹配或替换后 () 内的模式被 Perl 解释器自动依次赋给系统 $1, $2 .....
$office = "hangzhou wenyixilu " ;
$office = ~s / (yi)(xi)(lu) / <$ 2 >,<$ 3 >,<$ 1 > / ;
print
"$office\n" ;
#解释下:yi赋值给$1,xi赋值给$2,lu赋值给$3;之后用 <xi>取替换yi,<lu>替换xi, <yi>替换lu
#tr
$car = "my car‘s bland is bora" ;
$car = ~s / bora / bmw / ;
print
"$car\n" ;
$car = ~tr / bmw / BMW / ;
print
"$car\n" ;
LWP::Simple 里面有一个head函数,返回一小部分HTTP的head,而get.head返回所有
10 :Hack10 More Involved Requests with LWP::UserAgent
LWP::UserAgent is
a class
for virtual browsers, which you use for
performing
requests, and
HTTP::Response is
a class
for the responses ( or
error messages) that you get back from
those requests.
11 :Hack11 Adding HTTP Headers to Your Request
Q1:why:
Add more functionality to your programs, or
mimic common browsers, to circumvent server?side
filtering of unknown user agents
Q2:how:
$response = $browser?>get($url)
exa:
"you‘re telling the remote server which types of data you‘re willing to Accept"
change the User?Agent:
$browser?>agent( ‘Mozilla/4.76 [en] (Win98; U)‘ )
#! /usr/bin/perl
#11 hack11: Adding HTTP Headers to Your Request
= xxx
#复习下request和response的简单过程,LWP:UserAgent这个类取、去new一个对象,这个对象就继承了类的方法和属性
use LWP::UserAgent;
my $browser = LWP::UserAgent - >new;
$url = "http://www.qq.com" ;
my $response = $browser - >get($url);
if ($response - >content = ~ / qq / ){
print
"response have \‘qq\‘" ;
}
else {
print
"no \‘qq\‘" ;
}
#增加header内容,看看书中的代码,了解下header都包含什么内容哦,用到的函数是$response=$browser->get($url,....)
#看看书里面headers的结构:
my @ns_headers =
(
‘User?Agent‘
= > ‘Mozilla/4.76 [en] (Win98; U)‘ ,
‘Accept‘
= > ‘image / gif, image / x?xbitmap, image / jpeg,
image / pjpeg, image / png, * / * ‘,
‘Accept?Charset‘
= > ‘iso?8859?1,*‘ ,
‘Accept?Language‘
= > ‘en?US‘ ,
);
#分析:user-agent:表示浏览器的版本;
#accept:表示接收的数据类型;
#accept-charset:字符集;
#accept-language:语言编码;
#ok,如果你只要change浏览器版本,那么就用LWP::UserAgent 中的agent方法
# $browser?>agent(‘Mozilla/4.76 [en] (Win98; U)‘);
#12.Hack12 Posting Form Data with LWP
= xxx
exm:http: / / www.google.com / search?num = 100 &hl = en&q = % 22three + blind + mice % 22
分析 ?后面的num表示每页返回的数量
hl表示语言
q表示 encoded equivalents
= cut
#!/usr/bin/perl ?w
use strict;
use LWP ;
my $word =
shift;
$word or
die "Usage: perl altavista_post.pl [keyword]\n" ;
my $browser = LWP::UserAgent - >new;
my $url =
‘http://www.altavista.com/web/results‘ ;
my $response =
$browser>post( $url,
[ ‘q‘
= > $word, # the Altavista query string
‘pg‘
= > ‘q‘ , ‘avkw‘
= > ‘tgz‘ , ‘kl‘
= > ‘XX‘ ,
]);
#改变post请求方式,其实post类似与更新,get相当于查询,获取
#既然改变了post请求方式,那么就看看返回的结果是不是符合request的格式
13.Hack13
Authentication, Cookies, and
Proxies
= xxx
#说了那么多authentication,其实就是说
$browser?>credentials(
‘servername:portnumber‘ ,
‘realm?name‘ ,
‘username‘
= > ‘password‘
);
#在request之前,需要做以上的工作哦!
exa:
$browser?>credentials(
‘www.unicode.org:80‘ ,
‘Unicode?MailList?Archives‘ ,
‘unicode?ml‘
= > ‘unicode‘
);
cookies:
从硬盘中读入cookies文件
use HTTP::Cookies;
$browser?>cookie_jar( HTTP::Cookies?>new(
‘file‘
= > ‘/some/where/cookies.lwp‘ ,
‘autosave‘
= > 1 ,
));
从网上读入cookies,然后存入硬盘
use HTTP::Cookies; # yes, loads HTTP::Cookies::Netscape too
$browser?>cookie_jar( HTTP::Cookies::Netscape?>new(
‘file‘
= > ‘c:/Program Files/Netscape/Users/DIR?NAME?HERE/cookies.txt‘ ,
));
use LWP::UserAgent;
my $browser =
LWP::UserAgent?>new;
$browser?>env_proxy
奶奶的,书中不介绍proxy的有关方法了,叫我自己取看,你也太懒了!
14.hack14 :Handling Relative and
Absolute URLs
用URI这个类
url - >scheme 返回例如http,ftp之类的
url - >host 返回 www.baidu.com之类的
url0 - >new_abs taking a URL string that is
most likely relative and
getting back an absoulute URL
use URI; my $ abs
= URI?>new_abs($maybe_relative, $base)
这个hack还介绍了如何匹配 http的网址
= cut
15 Hack15 Secured Access and
Browser Attributes
里面介绍,如果你要取一个银行的网站,一般会安装一个SSL(secure socket layer),在browser and
server之间
区分 secured site 一般看前面 https
就是说你要安装HTTPS support,你去这里看看参考哦!
还介绍了browser的其他方法
16 Hack16 Respecting Your Scrapee‘s Bandwidth
time2str($response?>last_modified),这个方法返回相应url最近modified的时间
#!/usr/bin/perl ?w
use strict;
use LWP 5.64 ;
use HTTP::Date;
my $url =
‘http://disobey.com/amphetadesk/‘ ;
my $date =
"Thu, 31 Oct 2002 01:05:16 GMT" ;
my % headers =
( ‘If?Modified?Since‘
= > $date );
my $browser =
LWP::UserAgent?>new;
my $response =
$browser?>get( $url, % header)
这段coding主要用于判断在 $date之后有没有再更改过!
ETags:Instead of a date, it returns a unique string based on the content you‘re
downloading. 就是基于内容对应的独立字符串
Compressed Data:
说了一大段:就是压缩嘛,后来又说了一大段,解压缩嘛,ok,so easy,上书中的代码
use strict;
use Compress::Zlib;
use LWP 5.64 ;
my $url =
‘http://www.disobey.com/‘ ;
my % headers =
( ‘Accept?Encoding‘
= > ‘gzip; deflate‘
);
my $browser =
LWP::UserAgent?>new;
my $response =
$browser?>get( $url, % headers );
my $data =
$response?>content;
if
(my $encoding =
$response?>content_encoding) ) {
$data =
Compress::Zlib::memGunzip($data) if
$encoding = ~ / gzip / i;
$data =
Compress::Zlib::uncompress($data) if
$encoding = ~ / deflate / i;
}
|