手机客户端通常会安装了一些类似360安全卫士,手机安全卫士等等诸如此类的软件,这些软件可以标识过滤一些电话号码是诈骗电话、骚扰电话或广告推销............
由于公司是线商业务,很多号码配置给客户作为电话销售使用从而被标识为各种性质的标记,需要把一些不利于被人接受的号码过滤掉。考虑到号码众多,一个个查看显然工作量大,时效低。从而考虑到用python爬虫收集数据
实例:
#!/usr/bin/env python
#coding:utf-8
#author : soul
import sys
reload(sys)
import requests
import urllib,urllib2
from bs4 import BeautifulSoup
sys.setdefaultencoding(‘utf-8‘)
f = open("/home/py/test1.txt","rw+") #打开test1.txt号码数据表
w = open("/home/py/result.txt","rw+") #打开一个空文本,用于写入结果值。
a = f.readlines() #一行行读取test1.txt数据表
abc = ‘诈骗电话‘
for h in a:
i = h.strip()
url = ‘https://www.so.com/s?q=%s‘ % i #url 为https://www.so.com/s?q=02081452010 将号码赋值给i
page = urllib2.urlopen(url)
number = i
soup = BeautifulSoup(page)
for e in soup.findAll(‘span‘,{‘style‘:‘background-color:#e76639‘}):
result1 = e.get_text().split("|")
for term1 in result1:
xingzhi = term1
for f in soup.findAll(‘b‘):
result2 = f.get_text().split("|")
for term2 in result2:
biaoshi = term2
if xingzhi == abc:
print ‘\033[33mThe number %s marked as %s about %s\033[0m‘ % (number,xingzhi,biaoshi)
result = ‘%s %s %s‘ % (number,xingzhi,biaoshi)
w.writelines(result + "\n") #用writelines 将结果写入result.txt
xingzhi = ‘未收录‘ #初始化 将xingzhi 设置为未收录 标识为0
biaoshi = 0
test1.txt
结果: