由于最近做的项目需要从英文文本中提取出字符串进行话题的聚类,于是就花了一天的时间来学习Java正则表达式,一下几个小例子是我的一些小练笔,如有不合理之处,还望各位指教!!
1.此例是用来过滤掉英文文本中的网址,并将过滤后的字符串输出
首先需要先贴出来我需要过滤的英文文本,我将这些文本存在一个名为englishtxt.txt中,其内容为
1 www.baidu.com 2 银行挤兑:可能引发下一轮金融危机的盲点 http://mp.weixin.qq.com/s?__biz=MjM5MDY4Mzg2MA==&mid=200223248&idx=1&sn=a5b668754a60a8e07f335bd59521fb03#rd?… 3 Beijing CBD right now 01 pic.twitter.com/zCNP4CFrrk 4 I see more and more Chinese ask the same question online: what if most #MH370 passengers were Americans; how would the US government react? 5 10:27:01 Chinese Net friend expectations http://chinafree.greatzhonghua.org/showthread.php?tid=5377?… Chinese Net friend expectations -... 6 01:47:01 Times silly and fantastic notions, Gu Xiaojun Thought Yiu glorious http://chinafree.greatzhonghua.org/showthread.php?tid=4969?… T... 7 [強國空氣問題比愛滋更嚴重] China Smog at Center of <<Air Pollution Deaths Cited>> by WHO http://bloom.bg/1rqNRBP? /via @BloombergNews 8 [Android 高登仔] LIHK 已重生,你會花 HK$10 買嗎? https://play.google.com/store/apps/details?id=com.lihk.hkgolden.app.reborn?… 9 #Taiwan protests: Water cannons are an indiscriminate tool for dispersing protesters & can result in serious injury 10 NASA 的新太空衣... http://jscfeatures.jsc.nasa.gov/z2/? 11 PHOTOS: Marijuana through the years http://ow.ly/uXzuq? (AP Photo/DEA) pic.twitter.com/4LSP4nlLMQ 12 Protest in Taiwan http://blog.flickr.net/en/2014/03/24/protest-in-taiwan/?… /via @flickr 13 [原來昨天說的那位嬰兒已經...] Baby born on board diverted Cathay flight dies http://www.scmp.com/news/hong-kong/article/1456417/baby-born-board-diverted-cathay-flight-dies?… /via @SCMP_News 14 What does Apple think about the lack of diversity in emojis? We have their response. http://on.mtv.com/OWu6D7? /via @MTVact 15 Linkin Park releases customizable music video powered by Xbox‘s Project Spark http://www.theverge.com/2014/3/25/5546982/linkin-park-releases-customizable-music-video-powered-by-xboxs?… 16 Full draw for @afcasiancup 2015 is here pic.twitter.com/nrYJo1mm9G #AC2015 17 Interesting draw RT @afcasiancup: Group B: Saudi Arabia, China PR, DPR Korea, Uzbekistan #AC2015 18 Finally: @emirates are activating their Twitter account. 19 Interior Minister Prince Mohammed bin Naif launches new ministry site aboard what appears like a private jet —SPA pic.twitter.com/NDSGJVbXTs
从该文本文档中我们可以看出,文本中存在大量的网址,如果直接拿来进行话题聚类的话,会产生大量的噪声数据,于是需要去除这些网址,于是我的代码如下
1 import java.io.BufferedReader; 2 import java.io.File; 3 import java.io.FileNotFoundException; 4 import java.io.FileReader; 5 import java.io.IOException; 6 import java.util.regex.Matcher; 7 import java.util.regex.Pattern; 8 9 public class URLMatcher { 10 public static void main(String[] args) throws IOException { 11 BufferedReader br = new BufferedReader(new FileReader(new File("D://englishtxt.txt"))); 12 System.out.println("开始从文本中读数据"); 13 String line = br.readLine(); 14 while(line!=null) 15 { 17 String value = line.replaceAll("(http://|https://|ftp://)?(\\w+\\.)+\\w+(:\\d*)?([^#\\s]*)","").replaceAll("[\\/?:;[email protected]#$%^&*+()【】<<>>...-]", ""); 18 StringBuilder strb = new StringBuilder(); 19 Pattern ptn = Pattern.compile("\\w+"); 20 Matcher mch = ptn.matcher(value); 21 while(mch.find()) 22 { 23 strb.append(mch.group()); 24 strb.append(" "); 25 } 26 System.out.println(strb.toString()); 27 line = br.readLine(); 28 } 29 30 } 31 }
上面代码的功能不仅能够过滤掉大量的网址,还可以去除一些特殊的标点符号
运行的结果如下:
开始从文本中读数据 rd Beijing CBD right now I see more and more Chinese ask the same question online what if most MH passengers were Americans how would the US government react Chinese Net friend expectations Chinese Net friend expectations Times silly and fantastic notions Gu Xiaojun Thought Yiu glorious T China Smog at Center of Air Pollution Deaths Cited by WHO via BloombergNews Android LIHK HK I Taiwan protests Water cannons are an indiscriminate tool for dispersing protesters can result in serious injury NASA PHOTOS Marijuana through the years AP PhotoDEA Protest in Taiwan via flickr f Baby born on board diverted Cathay flight dies via SCMP News What does Apple think about the lack of diversity in emojis We have their response via MTVact Linkin Park releases customizable music video powered by Xbox s Project Spark Full draw for afcasiancup is here AC Interesting draw RT afcasiancup Group B Saudi Arabia China PR DPR Korea Uzbekistan AC Finally emirates are activating their Twitter account Interior Minister Prince Mohammed bin Naif launches new ministry site aboard what appears like a private jet SPA
从上面的结果可以看出,网址基本都被过滤出来了。
2.下面的这个小例子是用来匹配美国的安全码
代码如下:
String safeNum = "This is a safe num 999-99-9999,this is the second num 456003348,this is the third num 456-909090,this is the forth num 45677-0764"; Pattern ptn = Pattern.compile("\\d{3}\\-?\\d{2}\\-?\\d{4}"); Matcher mch = ptn.matcher(safeNum); while(mch.find()) { System.out.println(mch.group()); }
最后的输出结果为:
999-99-9999 456003348 456-909090 45677-0764
3.这个小例子是用来匹配英文中的日期
String strDate = "this is a date June 26,1951"; Pattern ptn = Pattern.compile("([a-zA-Z]+)\\s[0-9]{1,2},\\s*[0-9]{4}"); Matcher mch = ptn.matcher(strDate); while(mch.find()) { System.out.println(mch.group()); }
输出结果为:
June 26,1951
以上的这3个小例子就是我在学正则表达式的时候做的小练笔,希望对大家的学习有所帮助!!
时间: 2024-11-04 07:15:09