[Python正则表达式] 字符串中xml标签的匹配

  现在有一个需求,比如给定如下数据:

2009-2-12 9:22:2 #### valentine s day 2011 #### sex is good for you ####  Making love pleasures life genuinely good say researchers does healthy sex life boost mood growing evidence boosts physical increasing longevity reducing risk erectile dysfunction heart attack month researchers Nottingham University concluded men kept regular sex life lower risk developing prostate cancer onversely sexual activity times month increase risk In fact research suggest men particularly older men benefit healthy effects sex Feel good hormones help explain benefits mood boosting explanation obvious thing clear applies men women need having sex regularly don want lose ability Use lose advice given older men Finnish scientists recently followed men aged years sex week start study twice likely develop erectile dysfunction see below week sex times week lowered risk fourfold women older oestrogen levels drop says Dr Peter Bowen Simpkins consultant gynaecologist London Women linic spokesperson Royal ollege Obstetricians Gynaecologists hormone key woman sexual enjoyment lower levels make sex uncomfortable explains American research menopausal women sex week oestrogen levels twice high abstaining counterparts Agencies
2009-2-13 8:37:21 #### valentine s day 2012 #### you should believe in love at first sight  ####  evidently case love sight swam new comer caressingly overtures affection harles Darwin described female mallard duck infatuated male pintail duck duck different species make mistakes Darwin believed animals feel romantic love male blackbird female thrush black grouse pheasant stickleback fish creatures reported fell love Scientists endorsed view despite vast evidence Darwin right Hundreds articles written mate choice habit creatures express attraction assiduously avoiding fact animal literature uses terms favoritism including mate preference selective proceptivity individual preference favoritism sexual choice mate choice scientists recorded core elements romantic love creatures carefully creatures rhinos butterflies focus mating energy specific preferred individual time Focus special central component human romantic love creatures obsessively follow humans stroke kiss nip nuzzle pat tap lick tug chase chosen behaviors regularly humans sing dance strut preen beloved just like men women ourting creatures great small excessive energy sleeplessness core traits human romantic passion adversity heightens pursuit just barriers intensify romantic love animals possessive jealously guarding mate breeding time passed animals express magnetism seconds infatuated hours days weeks just Darwin said fall love sight Violet female pug Elizabeth Marshall Thomas wrote Violet feelings pug Bingo saying moment set eyes adored Wanting near lavish affection followed went sound voice bark instant attraction happened Thomas Jefferson Historian Fawn Brodie wrote Jefferson told advance Maria osway irrelevant man fell love single afternoon Today evidence Darwin right colleagues madly love brain scanner fMRI mapped brain pathways generate feelings romance dramatic activity occurs reward wanting brain gives lovers focus energy ecstasy motivation seek life greatest prize mating partner brain active mammals express attraction did Darwin manage continuity man beast suspect biology played role new book maintain humanity evolved broad basic styles thinking behaving based brain chemistry Men women dubbed Explorers especially express dopamine systems predisposing risks seek novelty curious creative energetic impulsive flexible optimistic Builders express serotonin predisposing calm social networking cautious loyal managerial traditional Directors high testosterone type analytical strategic direct decisive tough minded competitive good understanding math machines spatial rule based systems Negotiators women men particularly expressive estrogen giving broad holistic contextual view exquisite imagination intuition verbal skills emotional expressivity compassion Darwin believe Negotiator man predisposed vast physical emotional connections living creatures Darwin wrote books scientific papers phenomena varied orchids barnacles earthworms grand synthesizing theories natural selection sexual selection explain evolution proliferation life earth Biologist Richard Dawkins called set principles important idea occur human mind someday scientists laymen alike come understand Darwin mind including told years ago animals share human drive love

  要求按行把<></>标签内的字符串中的空格替换成下划线_,并且将数据转换形式,例:<X>A B C</X>需要转换成A_B_C/X

  由于正则表达式匹配是贪婪模式,即尽可能匹配到靠后,那么就非常麻烦,而且仅仅是用?是无法真正保证是非贪婪的。所以需要在正则匹配时给之前匹配好的字符串标一个名字。

python下,正则最终写出来是这样:

1 LABEL_PATTERN = re.compile(‘(<(?P<label>\S+)>.+?</(?P=label)>)‘)

  接下来我们需要做是在原字符串中找出对应的子串,并且记下他们的位置,接下来就是预处理出需要替换成的样子,再用一个正则就好了。

1 LABEL_CONTENT_PATTERN = re.compile(‘<(?P<label>\S+)>(.*?)</(?P=label)>‘)

  对字符串集合做整次的map,对每一个字符串进行匹配,再吧这两部分匹配结果zip在一起,就可以获得一个start-end的tuple,大致这样。

 1 (‘<LOCATION>LOS ANGELES</LOCATION>‘, ‘LOS_ANGELES/LOCATION‘)
 2 (‘<ORGANIZATION>Dec Xinhua Kings Speech</ORGANIZATION>‘, ‘Dec_Xinhua_Kings_Speech/ORGANIZATION‘)
 3 (‘<ORGANIZATION>Social Network Black Swan Fighter Inception Kings Speech</ORGANIZATION>‘, ‘Social_Network_Black_Swan_Fighter_Inception_Kings_Speech/ORGANIZATION‘)
 4 (‘<PERSON>Firth</PERSON>‘, ‘Firth/PERSON‘)
 5 (‘<PERSON>Helena Bonham</PERSON>‘, ‘Helena_Bonham/PERSON‘)
 6 (‘<PERSON>Geoffrey Rush</PERSON>‘, ‘Geoffrey_Rush/PERSON‘)
 7 (‘<PERSON>Tom Hooper</PERSON>‘, ‘Tom_Hooper/PERSON‘)
 8 (‘<PERSON>David Seidler</PERSON>‘, ‘David_Seidler/PERSON‘)
 9 (‘<ORGANIZATION>Alexandre Desplat Social Network Fighter</ORGANIZATION>‘, ‘Alexandre_Desplat_Social_Network_Fighter/ORGANIZATION‘)
10 (‘<ORGANIZATION>Alice Wonderland Burlesque Kids Right Red Tourist</ORGANIZATION>‘, ‘Alice_Wonderland_Burlesque_Kids_Right_Red_Tourist/ORGANIZATION‘)
11 (‘<ORGANIZATION>Firth Kings Speech James Franco Hours Ryan Gosling Blue Valentine Mark Wahlberg Fighter Jesse Eisenberg Social Network</ORGANIZATION>‘, ‘Firth_Kings_Speech_James_Franco_Hours_Ryan_Gosling_Blue_Valentine_Mark_Wahlberg_Fighter_Jesse_Eisenberg_Social_Network/ORGANIZATION‘)
12 (‘<PERSON>Halle Berry Frankie Alice Nicole Kidman</PERSON>‘, ‘Halle_Berry_Frankie_Alice_Nicole_Kidman/PERSON‘)
13 (‘<PERSON>Jennifer Lawrence</PERSON>‘, ‘Jennifer_Lawrence/PERSON‘)
14 (‘<ORGANIZATION>Winters Bone Natalie Portman Black Swan Michelle Williams Blue Valentine TV</ORGANIZATION>‘, ‘Winters_Bone_Natalie_Portman_Black_Swan_Michelle_Williams_Blue_Valentine_TV/ORGANIZATION‘)
15 (‘<PERSON>Grandin</PERSON>‘, ‘Grandin/PERSON‘)
16 (‘<LOCATION>BEIJING</LOCATION>‘, ‘BEIJING/LOCATION‘)
17 (‘<ORGANIZATION>Xinhua Sanlu Group</ORGANIZATION>‘, ‘Xinhua_Sanlu_Group/ORGANIZATION‘)
18 (‘<LOCATION>Gansu</LOCATION>‘, ‘Gansu/LOCATION‘)
19 (‘<ORGANIZATION>Sanlu</ORGANIZATION>‘, ‘Sanlu/ORGANIZATION‘)

  处理的代码如下:

 1 def read_file(path):
 2     if not os.path.exists(path):
 3         print ‘path : \‘‘+ path + ‘\‘ not find.‘
 4         return []
 5     content = ‘‘
 6     try:
 7         with open(path, ‘r‘) as fp:
 8             content += reduce(lambda x,y:x+y, fp)
 9     finally:
10         fp.close()
11     return content.split(‘\n‘)
12
13 def get_label(each):
14     pair = zip(LABEL_PATTERN.findall(each),
15                          map(lambda x: x[1].replace(‘ ‘, ‘_‘)+‘/‘+x[0], LABEL_CONTENT_PATTERN.findall(each)))
16     return map(lambda x: (x[0][0], x[1]), pair)
17
18 src = read_file(FILE_PATH)
19 pattern = map(get_label, src)

  接下来简单处理以下就好:

1 for i in range(0, len(src)):
2     for pat in pattern[i]:
3         src[i] = re.sub(pat[0], pat[1], src[i])

  所有代码:

 1 # -*- coding: utf-8 -*-
 2 import re
 3 import os
 4
 5 # FILE_PATH = ‘/home/kirai/workspace/sina_news_process/disworded_sina_news_attr_handled.txt‘
 6 FILE_PATH = ‘/home/kirai/workspace/sina_news_process/test.txt‘
 7 LABEL_PATTERN = re.compile(‘(<(?P<label>\S+)>.+?</(?P=label)>)‘)
 8 LABEL_CONTENT_PATTERN = re.compile(‘<(?P<label>\S+)>(.*?)</(?P=label)>‘)
 9
10 def read_file(path):
11     if not os.path.exists(path):
12         print ‘path : \‘‘+ path + ‘\‘ not find.‘
13         return []
14     content = ‘‘
15     try:
16         with open(path, ‘r‘) as fp:
17             content += reduce(lambda x,y:x+y, fp)
18     finally:
19         fp.close()
20     return content.split(‘\n‘)
21
22 def get_label(each):
23     pair = zip(LABEL_PATTERN.findall(each),
24                          map(lambda x: x[1].replace(‘ ‘, ‘_‘)+‘/‘+x[0], LABEL_CONTENT_PATTERN.findall(each)))
25     return map(lambda x: (x[0][0], x[1]), pair)
26
27 src = read_file(FILE_PATH)
28 pattern = map(get_label, src)
29
30 for i in range(0, len(src)):
31     for pat in pattern[i]:
32         src[i] = re.sub(pat[0], pat[1], src[i])
时间: 2024-10-16 09:54:14

[Python正则表达式] 字符串中xml标签的匹配的相关文章

python 正则表达式 贪婪模式的简介和匹配时的几种模式

看到一篇文章,关于python正则的,http://www.cnblogs.com/huxi/archive/2010/07/04/1771073.html 贪婪模式与非贪婪模式: 正则表达式通常用于在文本中查找匹配的字符串.Python里数量词默认是贪婪的(在少数语言里也可能是默认非贪婪),总是尝试匹配尽可能多的字符:非贪婪的则相反,总是尝试匹配尽可能少的字符.例如:正则表达式"ab*"如果用于查找"abbbc",将找到"abbb".而如果使用

js去除字符串中的标签

var str="<p>js去除字符串中的标签</p>"; var result=str.replace(/<.*?>/ig,""); console.log(result); 原文地址:https://www.cnblogs.com/Mrrabbit/p/8455139.html

python提取字符串中数字

题目:[这是一个复杂问题的简化]如下是一个字符串列表,提取字符串中第二个数字,并判断是否大于1000,如果是,从列表中删除这一行. 1000\t1002\n .....[省略].... 代码: <pre name="code" class="python">oldStr = "1000\t1002\n" newStr = oldStr #匹配目标数字左侧字符串 t=newStr.index("\t") newStr

IOS去掉字符串中HTML标签的方法

后台返回的字符串中带HTML标签,如果不用webView加载解析的话,就直接去掉. 1 -(NSString *)filterHTML:(NSString *)html 2 { 3 NSScanner * scanner = [NSScanner scannerWithString:html]; 4 NSString * text = nil; 5 while([scanner isAtEnd]==NO) 6 { 7 //找到标签的起始位置 8 [scanner scanUpToString:@

Python统计字符串中的中英文字符、数字空格,特殊字符

# -*- coding:utf8 -*- import string from collections import namedtuple def str_count(s): '''找出字符串中的中英文.空格.数字.标点符号个数''' count_en = count_dg = count_sp = count_zh = count_pu = 0 s_len = len(s) for c in s: # 英文 if c in string.ascii_letters: count_en +=

python之字符串中插入变量

方法一:也是 比较好用的,功能教齐全 s="{name} is {sex}" print(s.format(name="zzy",sex="girl")) # zzy is girl 如果要被替换的变量能在变量域中找到, 那么你可以结合使用 format_map() 和 vars() vars()找到所有局部变量 name="zxc" sex="boy" print(s.format_map(vars())

python 提取字符串中的指定字符 正则表达式

例1: 字符串: '湖南省长沙市岳麓区麓山南路麓山门' 提取:湖南,长沙 在不用正则表达式的情况下: address = '湖南省长沙市岳麓区麓山南路麓山门' address1 = address.split('省') # 用“省”字划分字符串,返回一个列表 address2 = address1[1].split('市') # 用“市”字划分address1列表的第二个元素,返回一个列表 print(address1) # 输出 ['湖南', '长沙市岳麓区麓山南路麓山门'] print(ad

python之字符串中有关%d,%2d,%02d的问题

在python中,通过使用%,实现格式化字符串的目的.(这与c语言一致) 其中,在格式化整数和浮点数时可以指定是否补0和整数与小数的位数. 首先,引入一个场宽的概念. 在C语言中场宽代表格式化输出字符的宽度. 例如: 可以在"%"和字母之间插进数字表示最大场宽. %3d 表示输出3位整型数,不够3位右对齐. %9.2f 表示输出场宽为9的浮点数,其中小数位为2,整数位为6,小数点占一位,不够9位右对齐. (注意:小数点前的数字必须大于小数点后的数字.小数点前的数值规定了打印的数字的总宽

python正则表达式re 中m.group和m.groups的解释

转载:http://www.cnblogs.com/kaituorensheng/archive/2012/08/20/2648209.html 先看代码instance: 1 >>> a="123abc456" 2 >>> import re 3 >>> print(re.search("([0-9]*)([a-z]*)([0-9]*)", a).group(0)) 4 123abc456 5 >>