1. Background
Though it‘s always difficult to give child a perfect name, parent never give up trying. One of my friends met a problem now. his baby girl just came to the world, he want to make a perfect name for her. he found a web page, in which he can input a baby name, baby birthday and birth time, then the web page will return 2 scores to indicate whether the name is a good or bad for the baby according to China‘s old philosophy --- "The Book of Changes (易经)". The 2 scores, we just naming it score1 and score2, are ranged from 0 to 100. My friend asked me that could it possible to make a script that input thousands of popular names in batches, he then can select baby name among top score names, such as names with both score1 and score2 over 95.
The website
2. Analysis and Plan
Chinese name is usually consist of family name and given name. Usually family name is one or two Chinese characters, my friend‘s family name is one Chinese character. Given name is also usually one or two Chinese characters. Recently, given name with two Chinese characters is more popular. My friend want to make a given name with 2 characters. As the baby girl‘s family name is known, be same with her father, I just need to make thousands of given names that are suitable for girl and automatically input at the website, finally obtain the displayed score1, score2.
3. Step
A,Obtain Chinese characters that suitable for naming a girl
Traditionally, there are some characters for naming a girl. I just find the
#spider的代码# -*- coding: utf-8 -*- import scrapy from getName.items import GetnameItem class DownnameSpider(scrapy.Spider): name = ‘downName‘ start_urls = [‘http://xh.5156edu.com/xm/nu.html‘] #默认的http request # 默认的http request返回的http response的处理函数,是个回调函数。 def parse(self, response): item = GetnameItem() item[‘ming‘] = response.xpath(‘//a[@class="fontbox"]/text()‘).extract() yield item
#定义了一个item来存获取的字import scrapy class GetnameItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() ming = scrapy.Field()
4. 难点详解,技巧介绍
进入Chrome的Developer Tools,找到源码中包含对象的那一列,右键,选择copy,选择copy到XPath
item[‘ming‘] = response.xpath(‘//a[@class="fontbox"]/text()‘).extract()
自动输入的过程,并不是要找到输入格,往表单里面填写数据,然后模拟去点提交。而是直接模拟HTTP REQUEST向目标WEB PAGE发送数据,比较常见的两种方式,一个是HTTP GET,一个是HTTP POST,通过观察目标网站的链接,我们发现目标网站“https://www.threetong.com/ceming/“是采用了POST,那么是向哪个WEB PAGE发送数据,并且发送的数据表单格式是什么呢,这里就又可以用我们的好朋友,Chrome的Developer Tools了。
然后可以打开Chrome Developer Tools,查看源码(Elements选项)
在这个页面,我们点击Chrome的Developer Tool,进入到Network,选择xingmingceshi.php这个网页,点击右侧的Headers,就可以看到这个页面的详细信息了。
可以看到怎么进入到这个网页的过程,包括Request URL,Request Method是POST,往下面拉,可以看到提交的表单信息
那么,我们只要模拟HTTP POST REQUEST往Request URL发送表单信息就可以了
# -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # http://doc.scrapy.org/en/latest/topics/items.html# 定义item来保存最终生成的数据 import scrapy class DaxiangnameItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() score1 = scrapy.Field() score2 = scrapy.Field() name = scrapy.Field()
# -*- coding: utf-8 -*- import scrapy import csv# 引进上面定义的item from daxiangName.items import DaxiangnameItem class CemingSpider(scrapy.Spider): name = ‘ceming‘ # 这是scrapy的一个默认入口函数,程序会从这里开始运行 def start_requests(self): # 使用一个双循环,打开两个csv文件,每次读取一个字,注意文件的编码,为了防止汉字乱码,使用UTF-8 with open(getattr(self,‘file‘,‘./ming1.csv‘),encoding=‘UTF-8‘) as f: reader = csv.DictReader(f) for line in reader: #print(line[‘\ufeffmingzi‘]) with open(getattr(self,‘file2‘,‘./ming2.csv‘),encoding=‘UTF-8‘) as f2: reader2 = csv.DictReader(f2) for line2 in reader2: #print(line) #print(line2)#注意下面的编码,因为从csv读出来,前面多了一个符号,是使用print函数测试出来的 mingzi = line[‘\ufeffming1‘]+line2[‘\ufeffming2‘] #print(mingzi)#下面这个函数是核心函数,scrapy定义的模拟发送http post request的函数 FormRequest = scrapy.http.FormRequest( url=‘https://www.threetong.com/ceming/baziceming/xingmingceshi.php‘, formdata={‘isbz‘:‘1‘, ‘txtName‘:u‘刘‘, ‘name‘:mingzi, ‘rdoSex‘:‘0‘, ‘data_type‘:‘0‘, ‘cboYear‘:‘2017‘, ‘cboMonth‘:‘7‘, ‘cboDay‘:‘30‘, ‘cboHour‘:u‘20-戌时‘, ‘cboMinute‘:u‘39分‘, }, callback=self.after_login #这是指定回调函数,就是发送request之后返回的结果到哪个函数来处理。 ) yield FormRequest #这里很重要,在scrapy中,所有要搜索网页的http request会有一个池子,通过yield函数形成一个iterator generator,往发送池里面积累 def after_login(self, response): ‘‘‘#save response body into a file filename = ‘source.html‘ with open(filename, ‘wb‘) as f: f.write(response.body) self.log(‘Saved file %s‘ % filename) ‘‘‘ # 这里就是从返回的数据中获取分数,下面有个正则表达式的小技巧来获取整数和带有小数点的数字 score1 = response.xpath(‘/html/body/div[6]/div/div[2]/div[3]/div[1]/span[1]/text()‘).re(‘[\d.]+‘) score2 = response.xpath(‘/html/body/div[6]/div/div[2]/div[3]/div[1]/span[2]/text()‘).re(‘[\d.]+‘) name = response.xpath(‘/html/body/div[6]/div/div[2]/div[3]/ul[1]/li[1]/text()‘).extract() #print(score1) #print(score2) print(name) # 只保留所谓好的分数 if float(score1[0]) >= 90 and float(score2[0]) >= 90: item = DaxiangnameItem() item[‘score1‘] = score1 item[‘score2‘] = score2 item[‘name‘] = name yield item # 这里是输出的池子,形成一个输出的iterator generator,在运行的时候使用-0参数输出所有的items
5. 后记