验证码识别之w3cschool字符图片验证码(easy级别)

起因:

最近在练习解析验证码,看到了这个网站的验证码比较简单,于是就拿来解析一下攒攒经验值,并无任何冒犯之意...

验证码所在网页: https://www.w3cschool.cn/checkmphone?type=findpwd

验证码地址: https://www.w3cschool.cn/scode

1. 分析规律

打开这个页面: https://www.w3cschool.cn/scode,不断的按F5刷新观察,可以发现,虽然每次字符内容、位置会变化,但是字体的样式是一直不变的,对于这种字体样式不变的,去噪去的好是可以做到识别率100%的。

然后再看噪音,下载下来一张图在Windows自带的画图中打开:

基本上都是噪点,对于噪点只需要判断8邻域判断就可以了,观察了几幅图像应该都是噪点,但是我并不确定到底有没有噪块,还有鉴于对于8邻域我已经快写吐了,所以这里采用连通域来去除噪音。(没有看到噪块的情况下可以使用8邻域试下,比较简单这里就不展开讲啦。在我写这段话的时候我觉得我真是太蠢了为什么放着简单的8邻域不用而非要用连通域呢...)

然后就是注意到背景色还会变化,所以没办法直接确定背景色到底是啥色,这需要程序能够自动识别出背景色。这个比较简单,只需要在计算连通域的时候将最大连通域标记为背景色就可以了。

总结:

1. 字体样式无变化,意味着特征极其稳定,识别率高

2. 有噪音,可以使用连通域来过滤

3. 背景色随机,需要能够识别并统一白色,最大连通域标记为背景色

提示:一般验证码的链接地址都没有UA检查,访问次数限制之类的,可以直接打开其所在链接快速刷新观察规律。

2. 下载样本

不管三七二十一,先下载一些样本到本地来慢慢观察再说:

/**
 * 验证码下载路径
 */
public static final String CAPTCHA_URL = "https://www.w3cschool.cn/scode?rand=";

public static void download(String saveDirectory, int howMany) {

	Random random = new Random();
	ExecutorService executorService = Executors.newFixedThreadPool(10);

	while (howMany-- > 0) {
		executorService.submit(() -> {
			Response response = null;
			try {
				long currentMillis = System.currentTimeMillis();
				Request request = Request.Get(CAPTCHA_URL + currentMillis);
				response = request.connectTimeout(2000).socketTimeout(2000).execute();
				response.saveContent(new File(saveDirectory + random.nextLong() + ".png"));
				System.out.println("download...");
			} catch (IOException e) {
				e.printStackTrace();
			} finally {
				if (response != null) {
					response.discardContent();
				}
			}
		});
	}

	try {
		executorService.shutdown();
		executorService.awaitTermination(Long.MAX_VALUE, TimeUnit.SECONDS);
	} catch (InterruptedException e) {
		e.printStackTrace();
	}

}

这里下载了5000张图片:

这里下这么多是因为等下我要从这些图片中自动生成一个字典,如果下得少了我怕会漏掉某些字符。

3. 过滤噪音

然后就是对下载下来的图片进行处理,把噪音去掉:

/**
 * 去噪点,使用连通域大小来判断
 *
 * @param originalCaptcha 原始的验证码图片
 * @param areaSizeFilter  连通域小于等于此大小的将被过滤掉
 * @return
 */
public static BufferedImage noiseClean(BufferedImage originalCaptcha, int areaSizeFilter) {

	// 会有一些干扰边,把边缘部分切割丢掉
	int edgeDropWidth = 15;
	BufferedImage captcha = originalCaptcha.getSubimage(edgeDropWidth / 2, edgeDropWidth / 2,  //
			originalCaptcha.getWidth() - edgeDropWidth, originalCaptcha.getHeight() - edgeDropWidth);

	int w = captcha.getWidth();
	int h = captcha.getHeight();
	int[][] book = new int[w][h];

	// 连通域最大的色块将被认为是背景色,这样实现了自动识别背景色
	Map<Integer, Integer> flagAreaSizeMap = new HashMap<>();
	int currentFlag = 1;
	int maxAreaSizeFlag = currentFlag;
	int maxAreaSizeColor = 0XFFFFFFFF;

	// 标记
	for (int i = 0; i < w; i++) {
		for (int j = 0; j < h; j++) {

			if (book[i][j] != 0) {
				continue;
			}

			book[i][j] = currentFlag;
			int currentColor = captcha.getRGB(i, j);
			int areaSize = waterFlow(captcha, book, i, j, currentColor, currentFlag);

			if (areaSize > flagAreaSizeMap.getOrDefault(maxAreaSizeFlag, 0)) {
				maxAreaSizeFlag = currentFlag;
				maxAreaSizeColor = currentColor;
			}

			flagAreaSizeMap.put(currentFlag, areaSize);
			currentFlag++;
		}
	}

	// 复制
	BufferedImage resultImage = new BufferedImage(w, h, BufferedImage.TYPE_INT_RGB);
	for (int i = 0; i < w; i++) {
		for (int j = 0; j < h; j++) {
			int currentColor = captcha.getRGB(i, j);
			if (book[i][j] == maxAreaSizeFlag //
					|| (currentColor & 0XFFFFFF) == (maxAreaSizeColor & 0XFFFFFF) //
					|| flagAreaSizeMap.get(book[i][j]) <= areaSizeFilter) {
				resultImage.setRGB(i, j, 0XFFFFFFFF);
			} else {
				resultImage.setRGB(i, j, currentColor);
			}
		}
	}
	return resultImage;
}

/**
 * 将图像抽象为颜色矩阵
 *
 * @param img
 * @param book
 * @param x
 * @param y
 * @param color
 * @param flag
 * @return
 */
private static int waterFlow(BufferedImage img, int[][] book, int x, int y, int color, int flag) {

	if (x < 0 || x >= img.getWidth() || y < 0 || y >= img.getHeight()) {
		return 0;
	}

	// 这个1统计的是当前点
	int areaSize = 1;
	for (int i = -1; i <= 1; i++) {
		for (int j = -1; j <= 1; j++) {
			int nextX = x + i;
			int nextY = y + j;

			if (nextX < 0 || nextX >= img.getWidth() || nextY < 0 || nextY >= img.getHeight()) {
				continue;
			}

			// 如果这一点没有被访问过,并且颜色相同
			//				if (book[nextX][nextY] == 0 && isSimilar(img.getRGB(nextX, nextY), color, 0)) {
			if (book[nextX][nextY] == 0 && (img.getRGB(nextX, nextY) & 0XFFFFFF) == (color & 0XFFFFFF)) {
				book[nextX][nextY] = flag;
				areaSize += waterFlow(img, book, nextX, nextY, color, flag);
			}

		}
	}

	return areaSize;
}

这是前面那张图经过去噪音之后的效果,因为噪音比较少,所以效果还可以:

4. 分割字符

接下来就是将上面干净的图片切割为单个字符了,但是切割出来的结果会有很多,难道我要一个一个的去挑出来我需要的字典吗,感觉有点蠢,所以我决定让程序自动推举出字典来,只需要在切割出字符之后保存之前对字符图片进行一个去重操作就可以了,这里为了方便对图片进行一个压缩,将小图压缩为了一个整数:

/**
 * 切割字符
 *
 * @param img
 * @return
 */
public static List<BufferedImage> mattingCharacter(BufferedImage img) {
	List<BufferedImage> list = new ArrayList<>();

	int w = img.getWidth();
	int h = img.getHeight();

	boolean lastColumnIsBlack = true;
	int beginColumn = -1;

	for (int i = 0; i < w; i++) {

		boolean currentColumnIsBlack = true;
		for (int j = 0; j < h; j++) {
			if ((img.getRGB(i, j) & 0XFFFFFF) != 0XFFFFFF) {
				currentColumnIsBlack = false;
			}
		}

		// 进入字符区域
		if (lastColumnIsBlack && !currentColumnIsBlack) {
			beginColumn = i;
		} else if (!lastColumnIsBlack && currentColumnIsBlack) {
			// 离开字符区域
			BufferedImage charImage = img.getSubimage(beginColumn, 0, i - beginColumn, h);
			BufferedImage trimCharImage = trimUpAndDown(charImage);
			list.add(trimCharImage);
		}

		lastColumnIsBlack = currentColumnIsBlack;

	}

	return list;
}

private static BufferedImage trimUpAndDown(BufferedImage img) {
	int w = img.getWidth();
	int h = img.getHeight();

	// 计算上方空白
	int upBeginLine = -1;
	for (int i = 0; i < h; i++) {

		boolean currentColumnIsBlack = true;
		for (int j = 0; j < w; j++) {
			if ((img.getRGB(j, i) & 0XFFFFFF) != 0XFFFFFF) {
				currentColumnIsBlack = false;
			}
		}

		if (!currentColumnIsBlack) {
			upBeginLine = i;
			break;
		}

	}

	// 计算下方空白
	int downBeginLine = -1;
	for (int i = h - 1; i >= 0; i--) {

		boolean currentColumnIsBlack = true;
		for (int j = 0; j < w; j++) {
			if ((img.getRGB(j, i) & 0XFFFFFF) != 0XFFFFFF) {
				currentColumnIsBlack = false;
			}
		}

		if (!currentColumnIsBlack) {
			downBeginLine = i;
			break;
		}
	}

	return img.getSubimage(0, upBeginLine, w, downBeginLine - upBeginLine + 1);
}

/**
 * 计算图像的哈希值,即将图片内容压缩为一个整数
 * <p>
 * NOTE: 适用于小图像
 *
 * @param img
 * @return
 */
public static int imgHashCode(BufferedImage img) {
	StringBuilder sb = new StringBuilder();
	for (int i = 0; i < img.getWidth(); i++) {
		for (int j = 0; j < img.getHeight(); j++) {
			sb.append(i).append("|").append(j).append("|").append(img.getRGB(i, j) & 0XFFFFFF).append("|");
		}
	}
	return sb.toString().hashCode();
}

下面是保存时去重的代码:

/**
 * 得到字符字典
 *
 * @param srcDirectory
 * @param destDirectory
 */
public static void splitCharacter(String srcDirectory, String destDirectory) {
	File file = new File(srcDirectory);
	File[] imgFileArray = file.listFiles();
	Map<Integer, BufferedImage> charDictionary = new HashMap<>();
	for (File imgFile : imgFileArray) {
		BufferedImage image = null;
		try {
			image = ImageIO.read(imgFile);
		} catch (IOException e) {
			e.printStackTrace();
		}
		List<BufferedImage> charList = W3cSchoolCaptchaUtil.mattingCharacter(image);
		charList.forEach(x -> {
			int hashcode = W3cSchoolCaptchaUtil.imgHashCode(x);
			System.out.println(hashcode);
			charDictionary.put(hashcode, x);
		});
		System.out.println("split...");
	}
	charDictionary.forEach((k, v) -> {
		try {
			ImageIO.write(v, "png", new File(destDirectory + k + ".png"));
			System.out.println("write...");
		} catch (IOException e) {
			e.printStackTrace();
		}
	});

}

这是自动推举出来的字符,目前字符内容和文件名字还没有对应,等下需要手动标记:

5. 生成字典

接下来人工标记,将文件的名字改为图片所表示的字符,改好之后的效果如下:

大写字母+数字应该是36个的,这里只有34个,是因为他们在生成验证码的时候讲容易混淆的0和O去掉了,啊,看来还是考虑到了用户体验的...

然后读取这个目录下的每个文件,对每个图片的内容做hash将一个图片映射为文件名对应的整数:

/**
 * 根据字符图片生成字符字典
 *
 * @param charDirectory
 */
public static void genDictionary(String charDirectory) {
	File[] charImgs = new File(charDirectory).listFiles();
	for (File charImgFile : charImgs) {
		try {
			BufferedImage charBufferedImage = ImageIO.read(charImgFile);
			int charHashCode = W3cSchoolCaptchaUtil.imgHashCode(charBufferedImage);
			System.out.printf("charMapping.put(%d, ‘%c‘);\n", charHashCode,
					charImgFile.getName().split("\\.")[0].charAt(0));
		} catch (IOException e) {
			e.printStackTrace();
		}
	}
}

打印内容是初始化Map的代码,直接粘过去初始化这个Map:

private static Map<Integer, Character> charMapping = new HashMap<>();

static {
	charMapping.put(1844796036, ‘1‘);
	charMapping.put(1594429278, ‘2‘);
	charMapping.put(-222305694, ‘3‘);
	charMapping.put(452270032, ‘4‘);
	charMapping.put(-1898118878, ‘5‘);
	charMapping.put(999670338, ‘6‘);
	charMapping.put(-965770966, ‘7‘);
	charMapping.put(-337170896, ‘8‘);
	charMapping.put(585835558, ‘9‘);
	charMapping.put(-724014232, ‘A‘);
	charMapping.put(-428164778, ‘B‘);
	charMapping.put(-886387444, ‘C‘);
	charMapping.put(1946490946, ‘D‘);
	charMapping.put(416715843, ‘E‘);
	charMapping.put(-917974862, ‘F‘);
	charMapping.put(-764688176, ‘G‘);
	charMapping.put(28434468, ‘H‘);
	charMapping.put(10891004, ‘I‘);
	charMapping.put(-2084516900, ‘J‘);
	charMapping.put(259070252, ‘K‘);
	charMapping.put(1209338035, ‘L‘);
	charMapping.put(486706942, ‘M‘);
	charMapping.put(983181712, ‘N‘);
	charMapping.put(1065112842, ‘P‘);
	charMapping.put(183746070, ‘Q‘);
	charMapping.put(782513722, ‘R‘);
	charMapping.put(-984311436, ‘S‘);
	charMapping.put(-1276745734, ‘T‘);
	charMapping.put(-796848932, ‘U‘);
	charMapping.put(-967446486, ‘V‘);
	charMapping.put(331594374, ‘W‘);
	charMapping.put(1503060590, ‘X‘);
	charMapping.put(-507424510, ‘Y‘);
	charMapping.put(468466871, ‘Z‘);
}

并基于之前写的代码编写解析验证码图片的方法:

/**
 * 解析传入的验证码
 *
 * @param captcha
 * @return
 */
public static String ocr(BufferedImage captcha) {
	BufferedImage noiseCleaned = noiseClean(captcha, 20);
	List<BufferedImage> charImageList = mattingCharacter(noiseCleaned);
	return charImageList.stream().map(x -> charMapping.get(imgHashCode(x)).toString()).collect(joining());
}

6. 验证解析效果

再写点代码验证之前的解析算法的正确性:

package bar.ocr.w3cschool;

import org.apache.http.client.fluent.Request;
import org.apache.http.client.fluent.Response;
import org.apache.http.message.BasicNameValuePair;

import javax.imageio.ImageIO;
import java.awt.image.BufferedImage;
import java.io.IOException;

/**
 * 用来验证之前写的代码的正确性
 *
 * @author CC11001100
 */
public class VerifyAccuracy {

	/**
	 * 发起一次验证,将结果是否成功返回,这里的结果只是为了验证验证码识别的结果
	 *
	 * @return
	 */
	private static boolean once() {

		Request request = Request.Get(DownloadCaptcha.CAPTCHA_URL + System.currentTimeMillis());
		Response response = null;
		String captchaString = "";
		try {
			response = request.connectTimeout(2000).socketTimeout(2000).execute();
			BufferedImage captchaImg = ImageIO.read(response.returnContent().asStream());
			captchaString = W3cSchoolCaptchaUtil.ocr(captchaImg);
			System.out.printf("captcha is: %s\n", captchaString);
		} catch (IOException e) {
			e.printStackTrace();
			return false;
		} finally {
			if (response != null) {
				response.discardContent();
			}
		}

		Request postSms = Request.Post("https://www.w3cschool.cn/sendsmscode");
		// 手机号改为不合法的,后端会有校验这样短信就不会被发出去,否则.... - -
		postSms.bodyForm(new BasicNameValuePair("mphone", "123456789"), //
				new BasicNameValuePair("type", "findpwd"), //
				new BasicNameValuePair("scode", captchaString));
		try {
			response = postSms.socketTimeout(2000).connectTimeout(2000).execute();
			String json = response.returnContent().asString();
			System.out.printf("response is: %s\n", json);
			return !json.contains("验证码错误");
		} catch (IOException e) {
			e.printStackTrace();
		} finally {
			if (response != null) {
				response.discardContent();
			}
		}

		return false;
	}

	public static void main(String[] args) {

		int totalTimes = 100;
		int successCount = 0;
		for (int i = 0; i < totalTimes; i++) {
			System.out.printf("%d :\n", i + 1);
			if (once()) {
				successCount++;
				System.out.println("ocr success");
			} else {
				System.out.println("ocr failed");
			}
			System.out.println();
		}
		System.out.printf("success times %d, accuracy is %g%%\n", successCount, 1.0 * successCount / totalTimes * 100);

	}

}

跑一下看看效果:

因为字体并没有任何的变化,所以通过直接比对是可以做到准确率100%的。

总结: 对于字体样式等没有变化的,不应该炫技搞训练啥的,直接比对就可以做到准确率100%了,当然去噪要做得好。

下面贴上完整代码:

DownloadCaptcha.java:
package bar.ocr.w3cschool;

import org.apache.http.client.fluent.Request;
import org.apache.http.client.fluent.Response;

import javax.imageio.ImageIO;
import java.awt.image.BufferedImage;
import java.io.File;
import java.io.IOException;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Random;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;

/**
 * @author CC11001100
 */
public class DownloadCaptcha {

	/**
	 * 验证码下载路径
	 */
	public static final String CAPTCHA_URL = "https://www.w3cschool.cn/scode?rand=";

	public static void download(String saveDirectory, int howMany) {

		Random random = new Random();
		ExecutorService executorService = Executors.newFixedThreadPool(10);

		while (howMany-- > 0) {
			executorService.submit(() -> {
				Response response = null;
				try {
					long currentMillis = System.currentTimeMillis();
					Request request = Request.Get(CAPTCHA_URL + currentMillis);
					response = request.connectTimeout(2000).socketTimeout(2000).execute();
					response.saveContent(new File(saveDirectory + random.nextLong() + ".png"));
					System.out.println("download...");
				} catch (IOException e) {
					e.printStackTrace();
				} finally {
					if (response != null) {
						response.discardContent();
					}
				}
			});
		}

		try {
			executorService.shutdown();
			executorService.awaitTermination(Long.MAX_VALUE, TimeUnit.SECONDS);
		} catch (InterruptedException e) {
			e.printStackTrace();
		}

	}

	/**
	 * 处理噪点噪块等
	 *
	 * @param srcDirectory
	 * @param destDirectory
	 */
	public static void processNoise(String srcDirectory, String destDirectory) {
		File file = new File(srcDirectory);
		File[] imgFileArray = file.listFiles();
		for (File imgFile : imgFileArray) {
			try {
				BufferedImage image = ImageIO.read(imgFile);
				BufferedImage noiseCleanImage = W3cSchoolCaptchaUtil.noiseClean(image, 20);
				ImageIO.write(noiseCleanImage, "png", new File(destDirectory + imgFile.getName()));
				System.out.println("process noise...");
			} catch (IOException e) {
				e.printStackTrace();
			}
		}
	}

	/**
	 * 得到字符字典
	 *
	 * @param srcDirectory
	 * @param destDirectory
	 */
	public static void splitCharacter(String srcDirectory, String destDirectory) {
		File file = new File(srcDirectory);
		File[] imgFileArray = file.listFiles();
		Map<Integer, BufferedImage> charDictionary = new HashMap<>();
		for (File imgFile : imgFileArray) {
			BufferedImage image = null;
			try {
				image = ImageIO.read(imgFile);
			} catch (IOException e) {
				e.printStackTrace();
			}
			List<BufferedImage> charList = W3cSchoolCaptchaUtil.mattingCharacter(image);
			charList.forEach(x -> {
				int hashcode = W3cSchoolCaptchaUtil.imgHashCode(x);
				System.out.println(hashcode);
				charDictionary.put(hashcode, x);
			});
			System.out.println("split...");
		}
		charDictionary.forEach((k, v) -> {
			try {
				ImageIO.write(v, "png", new File(destDirectory + k + ".png"));
				System.out.println("write...");
			} catch (IOException e) {
				e.printStackTrace();
			}
		});

	}

	/**
	 * 根据字符图片生成字符字典
	 *
	 * @param charDirectory
	 */
	public static void genDictionary(String charDirectory) {
		File[] charImgs = new File(charDirectory).listFiles();
		for (File charImgFile : charImgs) {
			try {
				BufferedImage charBufferedImage = ImageIO.read(charImgFile);
				int charHashCode = W3cSchoolCaptchaUtil.imgHashCode(charBufferedImage);
				System.out.printf("charMapping.put(%d, ‘%c‘);\n", charHashCode,
						charImgFile.getName().split("\\.")[0].charAt(0));
			} catch (IOException e) {
				e.printStackTrace();
			}
		}
	}

	public static void main(String[] args) {

		//		download("D:/test/ocr/w3cschool/original/", 5000);
		//		processNoise("D:/test/ocr/w3cschool/original", "D:/test/ocr/w3cschool/stage01/");
		//		splitCharacter("D:/test/ocr/w3cschool/stage01", "D:/test/ocr/w3cschool/stage02/");

		genDictionary("D:/test/ocr/w3cschool/stage03");

	}

}
W3cSchoolCaptchaUtil.java:
package bar.ocr.w3cschool;

import java.awt.image.BufferedImage;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

import static java.util.stream.Collectors.joining;

/**
 * @author CC11001100
 */
public class W3cSchoolCaptchaUtil {

	private static Map<Integer, Character> charMapping = new HashMap<>();

	static {
		charMapping.put(1844796036, ‘1‘);
		charMapping.put(1594429278, ‘2‘);
		charMapping.put(-222305694, ‘3‘);
		charMapping.put(452270032, ‘4‘);
		charMapping.put(-1898118878, ‘5‘);
		charMapping.put(999670338, ‘6‘);
		charMapping.put(-965770966, ‘7‘);
		charMapping.put(-337170896, ‘8‘);
		charMapping.put(585835558, ‘9‘);
		charMapping.put(-724014232, ‘A‘);
		charMapping.put(-428164778, ‘B‘);
		charMapping.put(-886387444, ‘C‘);
		charMapping.put(1946490946, ‘D‘);
		charMapping.put(416715843, ‘E‘);
		charMapping.put(-917974862, ‘F‘);
		charMapping.put(-764688176, ‘G‘);
		charMapping.put(28434468, ‘H‘);
		charMapping.put(10891004, ‘I‘);
		charMapping.put(-2084516900, ‘J‘);
		charMapping.put(259070252, ‘K‘);
		charMapping.put(1209338035, ‘L‘);
		charMapping.put(486706942, ‘M‘);
		charMapping.put(983181712, ‘N‘);
		charMapping.put(1065112842, ‘P‘);
		charMapping.put(183746070, ‘Q‘);
		charMapping.put(782513722, ‘R‘);
		charMapping.put(-984311436, ‘S‘);
		charMapping.put(-1276745734, ‘T‘);
		charMapping.put(-796848932, ‘U‘);
		charMapping.put(-967446486, ‘V‘);
		charMapping.put(331594374, ‘W‘);
		charMapping.put(1503060590, ‘X‘);
		charMapping.put(-507424510, ‘Y‘);
		charMapping.put(468466871, ‘Z‘);
	}

	/**
	 * 去噪点,使用连通域大小来判断
	 *
	 * @param originalCaptcha 原始的验证码图片
	 * @param areaSizeFilter  连通域小于等于此大小的将被过滤掉
	 * @return
	 */
	public static BufferedImage noiseClean(BufferedImage originalCaptcha, int areaSizeFilter) {

		// 会有一些干扰边,把边缘部分切割丢掉
		int edgeDropWidth = 15;
		BufferedImage captcha = originalCaptcha.getSubimage(edgeDropWidth / 2, edgeDropWidth / 2,  //
				originalCaptcha.getWidth() - edgeDropWidth, originalCaptcha.getHeight() - edgeDropWidth);

		int w = captcha.getWidth();
		int h = captcha.getHeight();
		int[][] book = new int[w][h];

		// 连通域最大的色块将被认为是背景色,这样实现了自动识别背景色
		Map<Integer, Integer> flagAreaSizeMap = new HashMap<>();
		int currentFlag = 1;
		int maxAreaSizeFlag = currentFlag;
		int maxAreaSizeColor = 0XFFFFFFFF;

		// 标记
		for (int i = 0; i < w; i++) {
			for (int j = 0; j < h; j++) {

				if (book[i][j] != 0) {
					continue;
				}

				book[i][j] = currentFlag;
				int currentColor = captcha.getRGB(i, j);
				int areaSize = waterFlow(captcha, book, i, j, currentColor, currentFlag);

				if (areaSize > flagAreaSizeMap.getOrDefault(maxAreaSizeFlag, 0)) {
					maxAreaSizeFlag = currentFlag;
					maxAreaSizeColor = currentColor;
				}

				flagAreaSizeMap.put(currentFlag, areaSize);
				currentFlag++;
			}
		}

		// 复制
		BufferedImage resultImage = new BufferedImage(w, h, BufferedImage.TYPE_INT_RGB);
		for (int i = 0; i < w; i++) {
			for (int j = 0; j < h; j++) {
				int currentColor = captcha.getRGB(i, j);
				if (book[i][j] == maxAreaSizeFlag //
						|| (currentColor & 0XFFFFFF) == (maxAreaSizeColor & 0XFFFFFF) //
						|| flagAreaSizeMap.get(book[i][j]) <= areaSizeFilter) {
					resultImage.setRGB(i, j, 0XFFFFFFFF);
				} else {
					resultImage.setRGB(i, j, currentColor);
				}
			}
		}
		return resultImage;
	}

	/**
	 * 将图像抽象为颜色矩阵
	 *
	 * @param img
	 * @param book
	 * @param x
	 * @param y
	 * @param color
	 * @param flag
	 * @return
	 */
	private static int waterFlow(BufferedImage img, int[][] book, int x, int y, int color, int flag) {

		if (x < 0 || x >= img.getWidth() || y < 0 || y >= img.getHeight()) {
			return 0;
		}

		// 这个1统计的是当前点
		int areaSize = 1;
		for (int i = -1; i <= 1; i++) {
			for (int j = -1; j <= 1; j++) {
				int nextX = x + i;
				int nextY = y + j;

				if (nextX < 0 || nextX >= img.getWidth() || nextY < 0 || nextY >= img.getHeight()) {
					continue;
				}

				// 如果这一点没有被访问过,并且颜色相同
				//				if (book[nextX][nextY] == 0 && isSimilar(img.getRGB(nextX, nextY), color, 0)) {
				if (book[nextX][nextY] == 0 && (img.getRGB(nextX, nextY) & 0XFFFFFF) == (color & 0XFFFFFF)) {
					book[nextX][nextY] = flag;
					areaSize += waterFlow(img, book, nextX, nextY, color, flag);
				}

			}
		}

		return areaSize;
	}

	//	/**
	//	 * 判断两个像素的相似性
	//	 *
	//	 * @param rgb1
	//	 * @param rgb2
	//	 * @param distance
	//	 * @return
	//	 */
	//	private static boolean isSimilar(int rgb1, int rgb2, int distance) {
	//		int r1 = rgb1 & 0XFF0000 >> 16;
	//		int g1 = rgb1 & 0X00FF00 >> 8;
	//		int b1 = rgb1 & 0X0000FF;
	//
	//		int r2 = rgb2 & 0XFF0000 >> 16;
	//		int g2 = rgb2 & 0X00FF00 >> 8;
	//		int b2 = rgb2 & 0X0000FF;
	//
	//		return (Math.abs(r1 - r2) <= distance) && (Math.abs(g1 - g2) <= distance) && (Math.abs(b1 - b2) <= distance);
	//	}

	/**
	 * 切割字符
	 *
	 * @param img
	 * @return
	 */
	public static List<BufferedImage> mattingCharacter(BufferedImage img) {
		List<BufferedImage> list = new ArrayList<>();

		int w = img.getWidth();
		int h = img.getHeight();

		boolean lastColumnIsBlack = true;
		int beginColumn = -1;

		for (int i = 0; i < w; i++) {

			boolean currentColumnIsBlack = true;
			for (int j = 0; j < h; j++) {
				if ((img.getRGB(i, j) & 0XFFFFFF) != 0XFFFFFF) {
					currentColumnIsBlack = false;
				}
			}

			// 进入字符区域
			if (lastColumnIsBlack && !currentColumnIsBlack) {
				beginColumn = i;
			} else if (!lastColumnIsBlack && currentColumnIsBlack) {
				// 离开字符区域
				BufferedImage charImage = img.getSubimage(beginColumn, 0, i - beginColumn, h);
				BufferedImage trimCharImage = trimUpAndDown(charImage);
				list.add(trimCharImage);
			}

			lastColumnIsBlack = currentColumnIsBlack;

		}

		return list;
	}

	private static BufferedImage trimUpAndDown(BufferedImage img) {
		int w = img.getWidth();
		int h = img.getHeight();

		// 计算上方空白
		int upBeginLine = -1;
		for (int i = 0; i < h; i++) {

			boolean currentColumnIsBlack = true;
			for (int j = 0; j < w; j++) {
				if ((img.getRGB(j, i) & 0XFFFFFF) != 0XFFFFFF) {
					currentColumnIsBlack = false;
				}
			}

			if (!currentColumnIsBlack) {
				upBeginLine = i;
				break;
			}

		}

		// 计算下方空白
		int downBeginLine = -1;
		for (int i = h - 1; i >= 0; i--) {

			boolean currentColumnIsBlack = true;
			for (int j = 0; j < w; j++) {
				if ((img.getRGB(j, i) & 0XFFFFFF) != 0XFFFFFF) {
					currentColumnIsBlack = false;
				}
			}

			if (!currentColumnIsBlack) {
				downBeginLine = i;
				break;
			}
		}

		return img.getSubimage(0, upBeginLine, w, downBeginLine - upBeginLine + 1);
	}

	/**
	 * 计算图像的哈希值,即将图片内容压缩为一个整数
	 * <p>
	 * NOTE: 适用于小图像
	 *
	 * @param img
	 * @return
	 */
	public static int imgHashCode(BufferedImage img) {
		StringBuilder sb = new StringBuilder();
		for (int i = 0; i < img.getWidth(); i++) {
			for (int j = 0; j < img.getHeight(); j++) {
				sb.append(i).append("|").append(j).append("|").append(img.getRGB(i, j) & 0XFFFFFF).append("|");
			}
		}
		return sb.toString().hashCode();
	}

	/**
	 * 解析传入的验证码
	 *
	 * @param captcha
	 * @return
	 */
	public static String ocr(BufferedImage captcha) {
		BufferedImage noiseCleaned = noiseClean(captcha, 20);
		List<BufferedImage> charImageList = mattingCharacter(noiseCleaned);
		return charImageList.stream().map(x -> charMapping.get(imgHashCode(x)).toString()).collect(joining());
	}

}

参考资料:

1. https://www.w3cschool.cn/checkmphone?type=findpwd

2. https://www.w3cschool.cn/scode

3. 图像验证码识别(五)——去除噪点

.

原文地址:https://www.cnblogs.com/cc11001100/p/8364016.html

时间: 2024-10-30 07:37:02

验证码识别之w3cschool字符图片验证码(easy级别)的相关文章

字符型图片验证码识别完整过程及Python实现

1   摘要 验证码是目前互联网上非常常见也是非常重要的一个事物,充当着很多系统的 防火墙 功能,但是随时OCR技术的发展,验证码暴露出来的安全问题也越来越严峻.本文介绍了一套字符验证码识别的完整流程,对于验证码安全和OCR识别技术都有一定的借鉴意义. 2   关键词 关键词:安全,字符图片,验证码识别,OCR,Python,SVM,PIL 3   免责声明 本文研究所用素材来自于某旧Web框架的网站 完全对外公开 的公共图片资源. 本文只做了该网站对外公开的公共图片资源进行了爬取, 并未越权 

python-使用内置库pytesseract实现图片验证码的识别

环境准备: 1.安装Tesseract模块 git文档地址:https://digi.bib.uni-mannheim.de/tesseract/ 百度网盘下载地址: 链接:https://pan.baidu.com/s/16RoJ19WynWOKI4Zpr0bKzA 提取码:5hst 下载后右击安装即可 2.配置环境变量: 编辑 系统变量里面 path,添加下面的安装路径:D:\Program Files\Tesseract-OCR(填写自己的实际安装路径) 3.安装python的第三方库:

基于SVM的字母验证码识别

基于SVM的字母验证码识别 摘要 本文研究的问题是包含数字和字母的字符验证码的识别.我们采用的是传统的字符分割识别方法,首先将图像中的字符分割出来,然后再对单字符进行识别.首先通过图像的初步去噪.滤波.形态学操作等一系列预处理过程,我们能够将图像中的噪点去除掉.为了将字符分割开来,我们利用Kmeans聚类算法对图像中的像素点聚成五类,分别代表五个字符,结果表明Kmeans算法的聚类准确度能够达到99.2%.对字符分割完成之后,我们采用支持向量机的算法对字符进行识别,通过调节参数能够使得准确率达到

tensorflow实现验证码识别案例

1.知识点 """ 验证码分析: 对图片进行分析: 1.分割识别 2.整体识别 输出:[3,5,7] -->softmax转为概率[0.04,0.16,0.8] ---> 交叉熵计算损失值 (目标值和预测值的对数) tf.argmax(预测值,2)验证码样例:[NAZP] [XCVB] [WEFW] ,都是字母的 """ 2.将数据写入TFRecords import tensorflow as tf import os os.env

【原创】用C#.NET开发通用的验证码识别组件

相信大家在开发过程中,基本都用到过验证码识别程序.一提到验证码识别,绝大多数兄弟想到的都是用C++的效率配上牛逼哄哄的二值化.边缘检测等算法来实现.但这种识别方式的依赖性太强,不可重用,无法扩展,假设对方稍微修改下验证码的变形算法(做过网站的都知道有多简单),可能你累死累活搞出来的识别程序就全部作废了. 这里讲个我们公司的例子,为了识别支付宝登录的验证码,公司花大价钱请了一位牛人B用C++写了个支付宝验证码识别的DLL并做了导出,供我们在.NET平台下直接调用.当我们项目开发快结束的时候,这货竟

django图片验证码和滑动验证

1. django-simple-captcha 模块 安装 django-simple-captcha pip install django-simple-captcha pip install Pillow 注册 和注册 app 一样,captcha 也需要注册到 settings 中.同时它也会创建自己的数据表,因此还需要数据同步. # settings.py INSTALLED_APPS = [ ... 'captcha', ] # 执行命令进行数据迁徙,会发现数据库中多了一个 capt

LoadRunner录制图片验证码

LoadRunner录制图片验证码 LoadRunner自身是无法捕获到图片验证码的,但是我们可以帮助LoadRunner来实现验证码的捕获. 1.图片验证码 图片验证码的产生来自服务器端,由服务器生成随机数,然后写入到图片中.虽然LR可以录制下图片,但是无法从图片中获取数据,因此表面上看LR是无法跳过这关了,那么我们从服务器下手,通常产生的随机数是保存在session中,所以我们可以想办法获取到这个session.    a.针对应用,采用不同的语言,本次讲的是java应用,首先写个*.jsp

基于python语言的tensorflow的‘端到端’的字符型验证码识别源码整理(github源码分享)

基于python语言的tensorflow的‘端到端’的字符型验证码识别 1   Abstract 验证码(CAPTCHA)的诞生本身是为了自动区分 自然人 和 机器人 的一套公开方法, 但是近几年的人工智能技术的发展,传统的字符验证已经形同虚设. 所以,大家一方面研究和学习此代码时,另外一方面也要警惕自己的互联网系统的web安全问题. Keywords: 人工智能,Python,字符验证码,CAPTCHA,识别,tensorflow,CNN,深度学习 2   Introduction 全自动区

验证码识别与生成类API调用的代码示例合集:六位图片验证码生成、四位图片验证码生成、简单验证码识别等

以下示例代码适用于 www.apishop.net 网站下的API,使用本文提及的接口调用代码示例前,您需要先申请相应的API服务. 六位图片验证码生成:包括纯数字.小写字母.大写字母.大小写混合.数字+小写.数字+大写.数字+大小写等情况. 四位图片验证码生成:包括纯数字.小写字母.大写字母.大小写混合.数字+小写.数字+大写.数字+大小写等情况. 简单验证码识别:验证码类型 : 数字+字母, 纯英文, 纯数字,计算题 英数_验证码识别:纯数字,纯英文,数字+英文 中英数_验证码识别:英文.数