python之验证码识别特征向量提取和余弦相似性比较

0.目录

1.参考
2.没事画个流程图
3.完整代码
4.改进方向

1.参考

https://en.wikipedia.org/wiki/Cosine_similarity

https://zh.wikipedia.org/wiki/%E4%BD%99%E5%BC%A6%E7%9B%B8%E4%BC%BC%E6%80%A7

Cosine similarity
Given two vectors of attributes, A and B, the cosine similarity, cos(θ),
is represented using a dot product and magnitude as...
余弦相似性通过测量两个向量的夹角的余弦值来度量它们之间的相似性。0度角的余弦值是1，
余弦相似度通常用于正空间，因此给出的值为0到1之间。
范数(norm)，是具有“长度”概念的函数。二维度的向量的欧氏范数就是箭号的长度。

Python 破解验证码

python3验证码机器学习

这两篇文章在计算矢量大小的时候函数参数都写成 concordance调和，而不用 coordinate坐标，为何？？？

欧氏距离和余弦相似度

numpy中提供了范数的计算工具：linalg.norm()

所以计算cosθ起来非常方便（假定A、B均为列向量）：

num = float(A.T * B) #若为行向量则 A * B.T
denom = linalg.norm(A) * linalg.norm(B)
cos = num / denom #余弦值

2.没事画个流程图

流程图 Graphviz - Graph Visualization Software

3.完整代码

#!/usr/bin/env python
# -*- coding: UTF-8 -*
import os
import time
import re
from urlparse import urljoin

import requests
ss = requests.Session()
ss.headers.update({‘user-agent‘:‘Mozilla/5.0 (Windows NT 6.1; WOW64; rv:54.0) Gecko/20100101 Firefox/54.0‘})

from PIL import Image
# https://www.liaoxuefeng.com/wiki/0014316089557264a6b348958f449949df42a6d3a2e542c000/001431918785710e86a1a120ce04925bae155012c7fc71e000
# 和StringIO类似，可以用一个bytes初始化BytesIO，然后，像读文件一样读取：
from io import BytesIO
from string import ascii_letters, digits

import numpy as np 

# ip_port_type_tuple_list = []

class Mimvp():
    def __init__(self, num_width=None, feature_vectors=None, white_before_black=2, threshhold=100, max_nums=None, filepath=None, page=None):
        self.ip_port_type_tuple_list = []

        #fluent p189
        if feature_vectors is None:
            self.feature_vectors = []
        else:
            self.feature_vectors = list(feature_vectors)

        self.num_width = num_width
        self.white_before_black = white_before_black
        self.threshhold = threshhold
        self.max_nums = max_nums

        self.filepath = filepath

        if page is None:
            self.url = ‘http://proxy.%s.com/free.php?proxy=in_hp‘%‘mimvp‘
        else:
            self.url = ‘http://proxy.%s.com/free.php?proxy=in_hp&sort=&page=%s‘ %(‘mimvp‘, page)   

    def get_mimvp(self):

        # 预处理提取特征组需要取得 self.port_src_list
        if self.feature_vectors == []:
            self.extract_features()

        self.load_mimvp()
        self.get_port_list()
        self.merge_result()
        return self.ip_port_type_tuple_list

    def load_mimvp(self):
        resp = ss.get(self.url)
        self.ip_list = re.findall(r"class=‘tbl-proxy-ip‘.*?>(.*?)<", resp.text)
        self.port_src_list = re.findall(r"class=‘tbl-proxy-port‘.*?src=(.*?)\s*/>", resp.text)      #图片链接
        self.type_list = re.findall(r"class=‘tbl-proxy-type‘.*?>(.*?)<", resp.text)

    def get_port_list(self):
        self.port_list = []
        for src in self.port_src_list:
            port = self.get_port(src)
            self.port_list.append(port)

    def get_port(self, src):
        img = self.load_image_from_src(src)
        split_imgs = self.split_image(img)

        port = ‘‘
        for split_img in split_imgs:
            vector = self.build_vector(split_img)
            compare_results = []
            for t in self.feature_vectors:
                cos = self.cos_similarity(vector, t.values()[0])
                compare_results.append((cos, t.keys()[0]))
            # print sorted(compare_results, reverse=True)
            port += sorted(compare_results, reverse=True)[0][1]
        print port
        return port

    def load_image_from_src(self, src):
        src = urljoin(self.url, src)
        print src,
        resp = ss.get(src)

        fp = BytesIO(resp.content)
        img = Image.open(fp)
        return img

    def split_image(self, img):
        gray = img.convert(‘L‘)

        if self.num_width is None:
            img.show()
            print gray.getcolors()
            self.num_width = int(raw_input(‘num_width:‘))
            self.white_before_black = int(raw_input(‘white_before_black:‘))
            self.threshhold = int(raw_input(‘BLACK < (threshhold) < WHITE:‘))

        gray_array = np.array(gray)
        bilevel_array = np.where(gray_array<self.threshhold,1,0)  #标记黑点为1，方便后续扫描

        left_list = []
        # 从左到右按列求和
        vertical = bilevel_array.sum(0)
        # print vertical
        # 从左到右按列扫描，2白1黑确定为数字左边缘
        for i,c in enumerate(vertical[:-self.white_before_black]):
            if self.white_before_black == 1:
                if vertical[i] == 0 and vertical[i+1] != 0:
                    left_list.append(i+1)
            else:
                if vertical[i] == 0 and vertical[i+1] == 0 and vertical[i+2] != 0:
                    left_list.append(i+2)
            if len(left_list) == self.max_nums:
                break

        # 分割可见图片
        # bilevel = Image.fromarray(bilevel_array)    #0/1 手工提取特征 show显示黑块 还没保存gif
        bilevel = Image.fromarray(np.where(gray_array<self.threshhold,0,255))
        # the left, upper, right, and lower pixel
        split_imgs = [bilevel.crop((each_left, 0, each_left+self.num_width, img.height)) for each_left in left_list]

        return split_imgs

    def build_vector(self, img):
        # img = Image.open(img)
        img_array = np.array(img)
        # 先遍历w，再遍历h，总共w+h维度，不需要/255，标记黑点个数等多余处理
        return list(img_array.sum(0)) + list(img_array.sum(1))

    def cos_similarity(self, a, b):
        A = np.array(a)
        B = np.array(b)
        dot_product = float(np.dot(A, B))   # A*(B.T) 达不到目的
        magnitude_product = np.linalg.norm(A) * np.linalg.norm(B)
        cos = dot_product / magnitude_product
        return cos

    def merge_result(self):
        for ip, port, _type in zip(self.ip_list, self.port_list, self.type_list):
            if ‘/‘ in _type:
                self.ip_port_type_tuple_list.append((ip, port, ‘both‘))
            elif _type == ‘HTTPS‘:
                self.ip_port_type_tuple_list.append((ip, port, ‘HTTPS‘))
            else:
                self.ip_port_type_tuple_list.append((ip, port, ‘HTTP‘))

    def extract_features(self):
        if self.filepath is not None:
            img_list = self.load_images_from_filepath()
        else:
            self.load_mimvp()
            img_list = self.load_images_from_src_list()
        for img in img_list:
            split_imgs = self.split_image(img)
            for split_img in split_imgs:
                split_img.show()
                print split_img.getcolors()
                input = raw_input(‘input:‘)
                vector = self.build_vector(split_img)
                item = {input: vector}
                if item not in self.feature_vectors:
                    print item
                    self.feature_vectors.append(item)

        for i in sorted(self.feature_vectors):
            print i,‘,‘   

    def load_images_from_filepath(self):
        img_list = []
        postfix = [‘jpg‘, ‘png‘, ‘gif‘, ‘bmp‘]
        for filename in [i for i in os.listdir(self.filepath) if i[-3:] in postfix]:
            file = os.path.join(self.filepath, filename)
            img_list.append(Image.open(file))
        return img_list

    def load_images_from_src_list(self):
        img_list = []
        for src in self.port_src_list:
            img = self.load_image_from_src(src)
            img_list.append(img)
        return img_list

if __name__ == ‘__main__‘:

    feature_vectors = [
    {‘0‘: [4845, 5865, 5865, 5865, 5865, 4845, 1530, 1530, 1530, 1530, 1530, 1530, 1530, 1530, 1020, 1020, 1020, 1020, 1020, 1020, 1020, 1020, 1020, 1020, 1530, 1530, 1530, 1530, 1530, 1530, 1530]} ,
    {‘1‘: [5865, 5865, 3825, 6120, 6120, 6375, 1530, 1530, 1530, 1530, 1530, 1530, 1530, 1530, 1275, 1020, 1020, 1275, 1275, 1275, 1275, 1275, 1275, 255, 1530, 1530, 1530, 1530, 1530, 1530, 1530]} ,
    {‘2‘: [5100, 5610, 5610, 5610, 5610, 5355, 1530, 1530, 1530, 1530, 1530, 1530, 1530, 1530, 510, 1020, 1020, 1275, 1020, 1275, 1275, 1275, 1275, 0, 1530, 1530, 1530, 1530, 1530, 1530, 1530]} ,
    {‘3‘: [5355, 5865, 5610, 5610, 5610, 4590, 1530, 1530, 1530, 1530, 1530, 1530, 1530, 1530, 510, 1020, 1020, 1275, 765, 1275, 1275, 1020, 1020, 510, 1530, 1530, 1530, 1530, 1530, 1530, 1530]} ,
    {‘4‘: [5610, 5865, 5865, 5865, 3825, 6120, 1530, 1530, 1530, 1530, 1530, 1530, 1530, 1530, 1275, 1020, 1020, 1020, 1020, 1020, 0, 1275, 1275, 1275, 1530, 1530, 1530, 1530, 1530, 1530, 1530]} ,
    {‘5‘: [4845, 5610, 5610, 5610, 5610, 5100, 1530, 1530, 1530, 1530, 1530, 1530, 1530, 1530, 0, 1275, 1275, 1275, 255, 1275, 1275, 1275, 1020, 510, 1530, 1530, 1530, 1530, 1530, 1530, 1530]} ,
    {‘6‘: [4590, 5610, 5610, 5610, 5610, 5355, 1530, 1530, 1530, 1530, 1530, 1530, 1530, 1530, 765, 1275, 1275, 1275, 255, 1020, 1020, 1020, 1020, 510, 1530, 1530, 1530, 1530, 1530, 1530, 1530]} ,
    {‘7‘: [6120, 6120, 6120, 5100, 5355, 5610, 1530, 1530, 1530, 1530, 1530, 1530, 1530, 1530, 0, 1275, 1275, 1275, 1275, 1275, 1275, 1275, 1275, 1275, 1530, 1530, 1530, 1530, 1530, 1530, 1530]} ,
    {‘8‘: [4590, 5610, 5610, 5610, 5610, 4590, 1530, 1530, 1530, 1530, 1530, 1530, 1530, 1530, 510, 1020, 1020, 1020, 510, 1020, 1020, 1020, 1020, 510, 1530, 1530, 1530, 1530, 1530, 1530, 1530]} ,
    {‘9‘: [5610, 5610, 5610, 5610, 5610, 4590, 1530, 1530, 1530, 1530, 1530, 1530, 1530, 1530, 510, 1020, 1020, 1020, 255, 1275, 1275, 1275, 1275, 765, 1530, 1530, 1530, 1530, 1530, 1530, 1530]} ,
    ]   

    # def __init__(self, feature_vectors=None, filepath=None, page=None):

    obj = Mimvp(num_width=6, feature_vectors=feature_vectors)
    # obj = Mimvp()
    # obj = Mimvp(filepath=‘temp/‘)   

    ip_port_type_tuple_list = obj.get_mimvp()

    from pprint import pprint
    pprint(ip_port_type_tuple_list)

4.改进方向

记录每个分割数字的x轴实际长度，这样的话考虑到不对图片的上下留白做处理，每个实际数字的h固定，而w不定，因此建立特征向量的时候改为先遍历h，再遍历w。

考虑到在比较余弦相似性的时候由于叉乘需要两个向量具有相同的维度，这里需要每次取最小维度再比较。

如此，在建立特征向量集的时候不需要提前指定每张分割数字为固定宽度。

时间： 2024-12-21 03:50:02

python之验证码识别特征向量提取和余弦相似性比较的相关文章

Python - PIL-pytesseract-tesseract验证码识别

N天前实现了简单的验证识别,这玩意以前都觉得是高大上的东西,一直没有去研究,这次花了点时间研究了一下,当然只是一些基础的东西,高深的我也不会,分享一下给大家吧. 关于python验证码识别库,网上主要介绍的为pytesser及pytesseract,其实pytesser的安装有一点点麻烦,所以这里我不考虑,直接使用后一种库. 要安装pytesseract库,必须先安装其依赖的PIL及tesseract-ocr,其中PIL为图像处理库,而后面的tesseract-ocr则为google的ocr识别

关于利用python进行验证码识别的一些想法

转载:@小五义http://www.cnblogs.com/xiaowuyi 用python加“验证码”为关键词在baidu里搜一下,可以找到很多关于验证码识别的文章.我大体看了一下,主要方法有几类:一类是通过对图片进行处理,然后利用字库特征匹配的方法,一类是图片处理后建立字符对应字典,还有一类是直接利用ocr模块进行识别.不管是用什么方法,都需要首先对图片进行处理,于是试着对下面的验证码进行分析. 一.图片处理这个验证码中主要的影响因素是中间的曲线,首先考虑去掉图片中的曲线.考

python 豆瓣验证码识别总结

总结: pytesseract 识别比较标准的图片识别成功率还是不错的. 验证码的图片识别需要先处理好再用pytesseract 识别 from PIL import Image # 图片处理import pytesseract # 识别 im = Image.open('/home/yuexinpeng/profit.jpg')out = imaa = pytesseract.image_to_string(out)print(aa) # 滤波处理去掉背景色thre

python+tesseract验证码识别的一点小心得

由于公司需要,最近开始学习验证码的识别我选用的是tesseract-ocr进行识别,据说以前是惠普公司开发的排名前三的,现在开源了.到目前为止已经出到3.0.2了当然了,前期我们还是需要对验证码进行一些操作,让他对机器更友好,这样才能提高识别率. 步骤基本上是这样的第一步对验证码进行灰度图以及二值化需要用到pil库可以pip下载代码如下 def binarization(image): #转成灰度图 imgry = image.convert('L') #二值化,阈值可以根据情况修改

基于python语言的tensorflow的‘端到端’的字符型验证码识别源码整理(github源码分享)

基于python语言的tensorflow的‘端到端’的字符型验证码识别 1 Abstract 验证码(CAPTCHA)的诞生本身是为了自动区分自然人和机器人的一套公开方法, 但是近几年的人工智能技术的发展,传统的字符验证已经形同虚设. 所以,大家一方面研究和学习此代码时,另外一方面也要警惕自己的互联网系统的web安全问题. Keywords: 人工智能,Python,字符验证码,CAPTCHA,识别,tensorflow,CNN,深度学习 2 Introduction 全自动区

字符型图片验证码识别完整过程及Python实现

1 摘要验证码是目前互联网上非常常见也是非常重要的一个事物,充当着很多系统的防火墙功能,但是随时OCR技术的发展,验证码暴露出来的安全问题也越来越严峻.本文介绍了一套字符验证码识别的完整流程,对于验证码安全和OCR识别技术都有一定的借鉴意义. 2 关键词关键词:安全,字符图片,验证码识别,OCR,Python,SVM,PIL 3 免责声明本文研究所用素材来自于某旧Web框架的网站完全对外公开的公共图片资源. 本文只做了该网站对外公开的公共图片资源进行了爬取, 并未越权

Python验证码识别处理实例

一.准备工作与代码实例 1.PIL.pytesser.tesseract (1)安装PIL:下载地址:http://www.pythonware.com/products/pil/(CSDN下载) 下载后是一个exe,直接双击安装,它会自动安装到C:\Python27\Lib\site-packages中去, (2)pytesser:下载地址:http://code.google.com/p/pytesser/,(CSDN下载) 下载解压后直接放C:\Python27\Lib\site-pack

Python验证码识别处理实例(转载)

版权声明:本文为博主林炳文Evankaka原创文章,转载请注明出处http://blog.csdn.net/evankaka 一.准备工作与代码实例 1.PIL.pytesser.tesseract (1)安装PIL:下载地址:http://www.pythonware.com/products/pil/(CSDN下载) 下载后是一个exe,直接双击安装,它会自动安装到C:\Python27\Lib\site-packages中去, 个人补充:上面是32位,个人查到64位地址 http://ww

Python验证码识别处理

阅读目录准备工作验证实例大部分的系统在用户登录时都要求用户输入验证码,验证码的类型的很多,有字母数字的,有汉字的,甚至还要用户输入一条算术题的答案的,对于系统来说使用验证码可以有效果的防止采用机器猜测方法对口令的刺探,在一定程度上增加了安全性准备工作处理验证码需要PIL库.pytesser库的支持 1.安装PIL库官网下载 ,下载后是exe应用程序,直接双击安装,它会自动安装到Python的lib\site-packages目录下 2.安装pytesser库官网下载 | 博客园下

python之验证码识别 特征向量提取和余弦相似性比较