[Python正则表达式] 字符串中xml标签的匹配 / 憋错料

　　现在有一个需求，比如给定如下数据：

2009-2-12 9:22:2 #### valentine s day 2011 #### sex is good for you ####  Making love pleasures life genuinely good say researchers does healthy sex life boost mood growing evidence boosts physical increasing longevity reducing risk erectile dysfunction heart attack month researchers Nottingham University concluded men kept regular sex life lower risk developing prostate cancer onversely sexual activity times month increase risk In fact research suggest men particularly older men benefit healthy effects sex Feel good hormones help explain benefits mood boosting explanation obvious thing clear applies men women need having sex regularly don want lose ability Use lose advice given older men Finnish scientists recently followed men aged years sex week start study twice likely develop erectile dysfunction see below week sex times week lowered risk fourfold women older oestrogen levels drop says Dr Peter Bowen Simpkins consultant gynaecologist London Women linic spokesperson Royal ollege Obstetricians Gynaecologists hormone key woman sexual enjoyment lower levels make sex uncomfortable explains American research menopausal women sex week oestrogen levels twice high abstaining counterparts Agencies
2009-2-13 8:37:21 #### valentine s day 2012 #### you should believe in love at first sight  ####  evidently case love sight swam new comer caressingly overtures affection harles Darwin described female mallard duck infatuated male pintail duck duck different species make mistakes Darwin believed animals feel romantic love male blackbird female thrush black grouse pheasant stickleback fish creatures reported fell love Scientists endorsed view despite vast evidence Darwin right Hundreds articles written mate choice habit creatures express attraction assiduously avoiding fact animal literature uses terms favoritism including mate preference selective proceptivity individual preference favoritism sexual choice mate choice scientists recorded core elements romantic love creatures carefully creatures rhinos butterflies focus mating energy specific preferred individual time Focus special central component human romantic love creatures obsessively follow humans stroke kiss nip nuzzle pat tap lick tug chase chosen behaviors regularly humans sing dance strut preen beloved just like men women ourting creatures great small excessive energy sleeplessness core traits human romantic passion adversity heightens pursuit just barriers intensify romantic love animals possessive jealously guarding mate breeding time passed animals express magnetism seconds infatuated hours days weeks just Darwin said fall love sight Violet female pug Elizabeth Marshall Thomas wrote Violet feelings pug Bingo saying moment set eyes adored Wanting near lavish affection followed went sound voice bark instant attraction happened Thomas Jefferson Historian Fawn Brodie wrote Jefferson told advance Maria osway irrelevant man fell love single afternoon Today evidence Darwin right colleagues madly love brain scanner fMRI mapped brain pathways generate feelings romance dramatic activity occurs reward wanting brain gives lovers focus energy ecstasy motivation seek life greatest prize mating partner brain active mammals express attraction did Darwin manage continuity man beast suspect biology played role new book maintain humanity evolved broad basic styles thinking behaving based brain chemistry Men women dubbed Explorers especially express dopamine systems predisposing risks seek novelty curious creative energetic impulsive flexible optimistic Builders express serotonin predisposing calm social networking cautious loyal managerial traditional Directors high testosterone type analytical strategic direct decisive tough minded competitive good understanding math machines spatial rule based systems Negotiators women men particularly expressive estrogen giving broad holistic contextual view exquisite imagination intuition verbal skills emotional expressivity compassion Darwin believe Negotiator man predisposed vast physical emotional connections living creatures Darwin wrote books scientific papers phenomena varied orchids barnacles earthworms grand synthesizing theories natural selection sexual selection explain evolution proliferation life earth Biologist Richard Dawkins called set principles important idea occur human mind someday scientists laymen alike come understand Darwin mind including told years ago animals share human drive love

　　要求按行把<></>标签内的字符串中的空格替换成下划线_，并且将数据转换形式，例：<X>A B C</X>需要转换成A_B_C/X

　　由于正则表达式匹配是贪婪模式，即尽可能匹配到靠后，那么就非常麻烦，而且仅仅是用?是无法真正保证是非贪婪的。所以需要在正则匹配时给之前匹配好的字符串标一个名字。

python下，正则最终写出来是这样：

1 LABEL_PATTERN = re.compile(‘(<(?P<label>\S+)>.+?</(?P=label)>)‘)

　　接下来我们需要做是在原字符串中找出对应的子串，并且记下他们的位置，接下来就是预处理出需要替换成的样子，再用一个正则就好了。

1 LABEL_CONTENT_PATTERN = re.compile(‘<(?P<label>\S+)>(.*?)</(?P=label)>‘)

　　对字符串集合做整次的map，对每一个字符串进行匹配，再吧这两部分匹配结果zip在一起，就可以获得一个start-end的tuple，大致这样。

 1 (‘<LOCATION>LOS ANGELES</LOCATION>‘, ‘LOS_ANGELES/LOCATION‘)
 2 (‘<ORGANIZATION>Dec Xinhua Kings Speech</ORGANIZATION>‘, ‘Dec_Xinhua_Kings_Speech/ORGANIZATION‘)
 3 (‘<ORGANIZATION>Social Network Black Swan Fighter Inception Kings Speech</ORGANIZATION>‘, ‘Social_Network_Black_Swan_Fighter_Inception_Kings_Speech/ORGANIZATION‘)
 4 (‘<PERSON>Firth</PERSON>‘, ‘Firth/PERSON‘)
 5 (‘<PERSON>Helena Bonham</PERSON>‘, ‘Helena_Bonham/PERSON‘)
 6 (‘<PERSON>Geoffrey Rush</PERSON>‘, ‘Geoffrey_Rush/PERSON‘)
 7 (‘<PERSON>Tom Hooper</PERSON>‘, ‘Tom_Hooper/PERSON‘)
 8 (‘<PERSON>David Seidler</PERSON>‘, ‘David_Seidler/PERSON‘)
 9 (‘<ORGANIZATION>Alexandre Desplat Social Network Fighter</ORGANIZATION>‘, ‘Alexandre_Desplat_Social_Network_Fighter/ORGANIZATION‘)
10 (‘<ORGANIZATION>Alice Wonderland Burlesque Kids Right Red Tourist</ORGANIZATION>‘, ‘Alice_Wonderland_Burlesque_Kids_Right_Red_Tourist/ORGANIZATION‘)
11 (‘<ORGANIZATION>Firth Kings Speech James Franco Hours Ryan Gosling Blue Valentine Mark Wahlberg Fighter Jesse Eisenberg Social Network</ORGANIZATION>‘, ‘Firth_Kings_Speech_James_Franco_Hours_Ryan_Gosling_Blue_Valentine_Mark_Wahlberg_Fighter_Jesse_Eisenberg_Social_Network/ORGANIZATION‘)
12 (‘<PERSON>Halle Berry Frankie Alice Nicole Kidman</PERSON>‘, ‘Halle_Berry_Frankie_Alice_Nicole_Kidman/PERSON‘)
13 (‘<PERSON>Jennifer Lawrence</PERSON>‘, ‘Jennifer_Lawrence/PERSON‘)
14 (‘<ORGANIZATION>Winters Bone Natalie Portman Black Swan Michelle Williams Blue Valentine TV</ORGANIZATION>‘, ‘Winters_Bone_Natalie_Portman_Black_Swan_Michelle_Williams_Blue_Valentine_TV/ORGANIZATION‘)
15 (‘<PERSON>Grandin</PERSON>‘, ‘Grandin/PERSON‘)
16 (‘<LOCATION>BEIJING</LOCATION>‘, ‘BEIJING/LOCATION‘)
17 (‘<ORGANIZATION>Xinhua Sanlu Group</ORGANIZATION>‘, ‘Xinhua_Sanlu_Group/ORGANIZATION‘)
18 (‘<LOCATION>Gansu</LOCATION>‘, ‘Gansu/LOCATION‘)
19 (‘<ORGANIZATION>Sanlu</ORGANIZATION>‘, ‘Sanlu/ORGANIZATION‘)

　　处理的代码如下：

 1 def read_file(path):
 2     if not os.path.exists(path):
 3         print ‘path : \‘‘+ path + ‘\‘ not find.‘
 4         return []
 5     content = ‘‘
 6     try:
 7         with open(path, ‘r‘) as fp:
 8             content += reduce(lambda x,y:x+y, fp)
 9     finally:
10         fp.close()
11     return content.split(‘\n‘)
12
13 def get_label(each):
14     pair = zip(LABEL_PATTERN.findall(each),
15                          map(lambda x: x[1].replace(‘ ‘, ‘_‘)+‘/‘+x[0], LABEL_CONTENT_PATTERN.findall(each)))
16     return map(lambda x: (x[0][0], x[1]), pair)
17
18 src = read_file(FILE_PATH)
19 pattern = map(get_label, src)

　　接下来简单处理以下就好：

1 for i in range(0, len(src)):
2     for pat in pattern[i]:
3         src[i] = re.sub(pat[0], pat[1], src[i])

　　所有代码：

 1 # -*- coding: utf-8 -*-
 2 import re
 3 import os
 4
 5 # FILE_PATH = ‘/home/kirai/workspace/sina_news_process/disworded_sina_news_attr_handled.txt‘
 6 FILE_PATH = ‘/home/kirai/workspace/sina_news_process/test.txt‘
 7 LABEL_PATTERN = re.compile(‘(<(?P<label>\S+)>.+?</(?P=label)>)‘)
 8 LABEL_CONTENT_PATTERN = re.compile(‘<(?P<label>\S+)>(.*?)</(?P=label)>‘)
 9
10 def read_file(path):
11     if not os.path.exists(path):
12         print ‘path : \‘‘+ path + ‘\‘ not find.‘
13         return []
14     content = ‘‘
15     try:
16         with open(path, ‘r‘) as fp:
17             content += reduce(lambda x,y:x+y, fp)
18     finally:
19         fp.close()
20     return content.split(‘\n‘)
21
22 def get_label(each):
23     pair = zip(LABEL_PATTERN.findall(each),
24                          map(lambda x: x[1].replace(‘ ‘, ‘_‘)+‘/‘+x[0], LABEL_CONTENT_PATTERN.findall(each)))
25     return map(lambda x: (x[0][0], x[1]), pair)
26
27 src = read_file(FILE_PATH)
28 pattern = map(get_label, src)
29
30 for i in range(0, len(src)):
31     for pat in pattern[i]:
32         src[i] = re.sub(pat[0], pat[1], src[i])

时间： 2024-10-16 09:54:14

[Python正则表达式] 字符串中xml标签的匹配

[Python正则表达式] 字符串中xml标签的匹配的相关文章

python 正则表达式贪婪模式的简介和匹配时的几种模式

js去除字符串中的标签

python提取字符串中数字

IOS去掉字符串中HTML标签的方法

Python统计字符串中的中英文字符、数字空格，特殊字符

python之字符串中插入变量

python 提取字符串中的指定字符正则表达式

python之字符串中有关%d,%2d,%02d的问题

python正则表达式re 中m.group和m.groups的解释