这个java小爬虫, 功能很简单,只有一个,抓取网上的邮箱。用到了javaI/O,正则表达式。
public static void main(String[] args) throws IOException { // TODO Auto-generated method stub // List<String> list= getEmail(); List<String> list= getEmailFromWeb(); for (String string : list) { System.out.println(string); } } public static List<String> getEmail() throws IOException{ //1.读取源文件 BufferedReader bufferedReader= new BufferedReader(new FileReader("G:\\index.htm")); //2.对读取的数据进行规则的匹配 String regex_email= "\\[email protected]\\w+(\\.[a-zA-Z]{2,3}){1,3}";//[email protected] Pattern pattern= Pattern.compile(regex_email); String line = null; List<String> list= new ArrayList<>(); while ((line= bufferedReader.readLine())!=null) { Matcher matcher= pattern.matcher(line); while (matcher.find()) { list.add(matcher.group()); } } return list; } public static List<String> getEmailFromWeb() throws IOException{ //1.读取web源文件 URL url= new URL("http://news.qq.com/zt2015/wxghz/index.htm"); BufferedReader bufferedReader= new BufferedReader(new InputStreamReader(url.openStream())); //2.对读取的数据进行规则的匹配 String regex_email= "\\[email protected]\\w+(\\.[a-zA-Z]{2,3}){1,2}"; Pattern pattern= Pattern.compile(regex_email); String line = null; List<String> list= new ArrayList<>(); while ((line= bufferedReader.readLine())!=null) { Matcher matcher= pattern.matcher(line); while (matcher.find()) { list.add(matcher.group()); } } return list; }
output:
[email protected]
哈哈,爬的腾讯新闻里面的一个网页。
时间: 2024-10-14 20:49:43