Python基础 - 正则表达式

正则表达式, 是用来描述, 匹配字符串规则的. 跟什么编程语言没啥关系, 这个太强大了. Python中, 内置 re 模块对正则有很强大的支持.

正则表达式基本语法

". " 任意单字符,除了\n
"* " 其前面子模式0或多次
‘+ ‘ 其前面子模式的**1或多次
‘- ‘ 在 [ ] 之间表示范围,如[0-9]
| 前or后的字符串
^ 后面的模式开头
‘$ ‘后面的模式结尾
? 前面0或1个字符, 也作为非贪婪限定词
?转义
\num 子模式编号,名字
\f 换页
\n 换行
\r 回车(回到一行的头部)
\b 单词头或单词尾
\B 非\b
\d 数字[0-9]
\D 非数字[^0-9]
\s 空白符,\t\n\f\v
\S 非空白符
\w 字母, 数字, 下划线, 中文
\W 除\w外的特殊字符
() 里面内容作为一个整体
{m} 前面子模式m次
{m,n} 前面的子模式m次到n次(闭区间)
[ ] 里面任意一个单字符
[^ x] 非里面的任意一个单字符
[a-zA-Z0-9_]
[^ a-z]

扩展语法

( ) 表示一个子模式, 将里面的内容作为一整体看待

import re
re.match(r'(cj){3}', 'cjcjcjcjcjcjxxx').group()  # 都加上原字符 r''

'cjcjcj'

(?P<groupname) 为子模式命名
(?#...) 注释
(?:...) 匹配但不捕获该匹配的表达式
(?<=...) 正则之后,...的内容出现则匹配,但不返回
(?=...)
(?<!...) 不匹配
(?!...)
这些不怎么用先忽略

正则表达式集锦

abcde 可匹配 abcde
[cj]python 可匹配 cpthon, jpython
[a-zA-Z0-9_] 大小写字母,数字,下划线
python|java 匹配 python or java
[^abc] 除了a,b,c外的一个字符
r‘(http://)?(www.)?python.org‘ 只能 python.org, http://python.org, http://www.python.org
^(http) 以http开头
(.com) 以. com 结尾
(pattern)* 0或多次
(pattern)+ 或多次, 至少一次
(pattern)? 0或1次
(pattern){m} m次
(pattern){m,n} m到n次闭区间
(a|b)*c 0或多个a或b, 后面紧跟一个字母c
ab{1,} 等价于ab+
^[a-zA-Z]{1}([A-Z0-9a-z._]{4,19} 长度为5-25之间,字母开头,后面跟字母or数字or下划线or点的字串
^(\w){6,20} 长度为6-20,可包含数字,字母,汉字,下划线,点
\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3} 是否为合法IP
(13[4-9]\d{8})|(15[01289]\d{8}) 移动号码
\[email protected]\w+(.\w+)+ 合法邮箱 [email protected]
(?=.* [a-z])(?=.* [A-Z])(?=.* \d)(?=.* [,.]{8,} 强密码检查,同时包含**大小写字母,数字,特殊字符*,且长度至少为8位**
(?!.* [‘";==?]+.+>) **包含.‘";= %? 的任意一个则匹配失败
(.)\1+ 匹配任意子模式至少1次
缓缓.........

re.match('\[email protected](\w+\.)+\w+ ', '[email protected]').group()

---------------------------------------------------------------------------

AttributeError                            Traceback (most recent call last)

<ipython-input-14-4b9af62021f0> in <module>
----> 1 re.match('\[email protected](\w+\.)+\w+ ', '[email protected]').group()

AttributeError: 'NoneType' object has no attribute 'group'

re.match('\[email protected]\w+(\.\w+)+', '[email protected]').group()

'[email protected]'

re 模块

match(pattern,string) 从字符开始出匹配,放回match对象或None, 需要调用group()显示一下
complie(pattern[,flags] 创建模式对象
search(pattern, string) 从左到右,一搜索到则返回match对象,否则None
findall(pattern, string) 以列表形式返回所有匹配内容
sub(pat, repl, string[,count=0]) 从string中pat到的字串用repl代替(字串or方法)[默认为0次]
split(pattern, string)
finditer(pattern,srting) 以列表形式返回所有匹配上的可迭代对象
purge() 清空正则表达式缓存
escape(string) 特殊正则字符转义

import re
re.findall('\d', 'sfs8sdfjsd8sdfjsd90dfs8')

['8', '8', '9', '0', '8']

text = 'alpha.beta....gama delta'

re.split('[\.]+', text )  # 按照模式(一个点,或多个点)分割字符串

['alpha', 'beta', 'gama delta']

re.split('[\.]+', text, maxsplit=1)  # 最多分割1次

['alpha', 'beta....gama delta']

re.findall('[a-zA-Z]+', text)  # 查找所有单词

['alpha', 'beta', 'gama', 'delta']

re.sub('{name}', 'chenjie', 'Dear {name}')  # 从string中用chenjie 取替换匹配的字符

'Dear chenjie'

re.sub('a|s|d', 'good', 'as')

'goodgood'

s = 'it is a good good good idea idea'    ???
re.sub(r'(\b\w+)\1',r'\1', s)

  File "<ipython-input-28-2eecdf4127bd>", line 1
    s = 'it is a good good good idea idea'    ???
                                              ^
SyntaxError: invalid syntax

re.sub('a', lambda x: x.group(0).upper(), 'aaa, aab, abcdas')   # 将所有 小写a 变成大写 a

'AAA, AAb, AbcdAs'

re.sub('[a-zA-Z]', lambda x: chr(ord(x.group(0))^32), 'aaa Bbc agbDs')  # 英文字母,大小写互换

'AAA bBC AGBdS'

re.subn('a', 'chenjie', 'aasksdf afk jfej ak fsd ')   # 返回新字符串和替换次数

('chenjiechenjiesksdf chenjiefk jfej chenjiek fsd ', 4)

re.escape('http://www.python.org')  # 字符串转义

'http\\:\\/\\/www\\.python\\.org'

re.match('yes|no', 'yesnofsfsdfdsf')  # 从字符头开始匹配, 匹配成功则返回match对象

<_sre.SRE_Match object; span=(0, 3), match='yes'>

re.match('\[email protected]\w+(\.\w+)+','[email protected]').group()  # match() 成功并group()出来

'[email protected]'

1. 删除多余空格, 如果连续多个,则保留一个,同时删除字符串两侧的所有空白字符

s = 'aaa     bbb,    cd, fff   fs    ,  '

' '.join(s.split())   # 先按空格分割,在join一下, 显然,不太搞得定, 不仅有空格, 还有,

'aaa bbb, cd, fff fs ,'

' '.join(re.split(',|\s+', s.strip()))  # re.split(',|\s+', s.strip())  按照 空格,或者逗号分割

'aaabbbcdffffs'

re.sub(',|\s+', ' ', s.strip())    # 直接用空格替换,这个厉害了

'aaa bbb  cd  fff fs  '

2. 删除字符串中指定的内容

email = '[email protected]'  # 想要删除 _marketing

m = re.search('_marketing', email) # marketing
email[:m.start()] + email[m.end(): ]   # 老方法,字符串拼接

'[email protected]'

re.sub('_marketing', '', email)  # 直接sub 替换, 找到_marketing 用 空白符替换(全部替换)

'[email protected]'

re.sub('a', 'b', 'aa,aa,gdsdaa')

'bb,bb,gdsdbb'

小结:sub()替换是无敌强, 就跟word的查找替换是一样的

3. 特定字符搜索

text = 'Beautiful is better than ugly.'

re.findall('\\bb.*?\\b', text)  #  以字母 b 开头的完整单词 \bxxx\b, 包起来, re则转义一下 \\b xxx \\b

['better']

re.findall(r'\bb.*?\b', text)  # 以后都加上 r, \bxx\b 包起来, 此处非贪心

['better']

re.findall(r'\bb.*\b', text)  # \b xxx \b  包起来, 贪心模式

['better than ugly']

re.findall(r'\Bh.+?\b', text)  # 不以h 开头,且剩余部分含有h的单词

['han']

import re
re.findall(r'\b\w+?\b', text)  # 所有单词

['Beautiful', 'is', 'better', 'than', 'ugly']

re.findall(r'\w+', text)

['Beautiful', 'is', 'better', 'than', 'ugly']

re.match('^B.*l$', text) # ^ $ 匹配的是字符串的开头和结尾, 不是其中的单词

re.split('\s', text)  #  使用任何空白字符分割

['Beautiful', 'is', 'better', 'than', 'ugly.']

re.findall(r'\d+\.\d+\.\d+', 'Python 3.6.1')  # x.x.x形式的数字

['3.6.1']

match 对象

m = re.match(r'(\w+) (\w+)', 'chen jie will be a great man')  # 中间还匹配了一个空格

m.group(0)  # 第一个子模式

'chen jie'

m.group(1)  # 第二个子模式

'chen'

m.group(1,2)

('chen', 'jie')

栗子

提取字符串中的电话号码

import re 

tel_number = '''Suppose my Phone No. is 0606-1234666,
                Yours number is 010-123456,
                his number is 025-8799342.'''

pattern = re.compile(r'(\d{3,4})-(\d{7,8})')   # 逗号后面不能有空格

result = pattern.search(tel_number, index)
if not result:
    print('no match')

print('=='*20)
print('success', result.group())

========================================
success 0606-1234666

用正则表达式批量检查网页文件是否包含iframe(内嵌)框架

import os
import re 

def delect_iframe(file):

    content = []  # 存放网页的列表
    with open(file, encoding='utf-8') as f:
        # 读取文件所有行,删除两侧的空白符, 然后添加到列表中
        for line in f:
            content.append(line.strip())

        # 将所有字符连接成字符串
        ''.join(content)
        # 正则
        result = re.findall(r'<iframe\s+src=.*?></iframe>', content)
        if result:
            return {file:result}
        return False

for file in (f for f in os.listdir('.') if f.endswith(('.html', '.htm'))):   # 遍历一个文件生成器
    result = delect_iframe(file)
    if not result:
        continue
    # 输出检查结果
    for k, v in r.items():
        print(k)
        for vv in v:
            print('\t', vv)

# print(result.group())

0606-1234666

原文地址：https://www.cnblogs.com/chenjieyouge/p/12337929.html

时间： 2025-01-10 04:08:55

Python基础 - 正则表达式的相关文章

Python基础----正则表达式和re模块

正则表达式就其本质而言,正则表达式(或 re)是一种小型的.高度专业化的编程语言,(在Python中)它内嵌在Python中,并通过 re 模块实现.正则表达式模式被编译成一系列的字节码,然后由用 C 编写的匹配引擎执行. 字符匹配(普通字符,元字符): 1 普通字符(完全匹配):大多数字符和字母都会和自身匹配 1 >>> import re 2 >>> res='hello world good morning' 3 >>> re.findall(

python基础——正则表达式

1.c语言中的转义字符转义字符意义 ASCII码值(十进制) \a 响铃(BEL) 007 \b 退格(BS) ,将当前位置移到前一列 008 \f 换页(FF),将当前位置移到下页开头 012 \n 换行(LF) ,将当前位置移到下一行开头 010 \r 回车(CR) ,将当前位置移到本行开头 013 \t 水平制表(HT) (跳到下一个TAB位置) 009 \v 垂直制表(VT) 011 \\ 代表一个反斜线字符''\' 092 \' 代表一个单引号(撇号)字符 039 \" 代表一个双

python 基础正则表达式

对于许多需要处理文本来说的技术工程师,必须对Python正则表达式有一个全面深入的认识,不但要深入理解下什么是Python正则表达式,还要对Python正则表达式字符有所认识. 此外,还有少数字符比较特殊,它们和自身并不匹配,而是跟其字面值之外的一些特殊的东西匹配,这些东西可能是字符集.重复次数或者位置等.常用的元字符包括: . ^ $ * + ? { } [ ] \ | ( ) 对于这些特殊字符,本文会陆续加以介绍.不过我们这里先了解一下用来匹配字符的元字符.首先,句点“.”这个元字符通常用于

Python基础----正则表达式爬虫应用，configparser模块和subprocess模块

正则表达式爬虫应用(校花网) 1 import requests 2 import re 3 import json 4 #定义函数返回网页的字符串信息 5 def getPage_str(url): 6 page_string=requests.get(url) 7 return page_string.text 8 9 hua_dic={} 10 def run_re(url): #爬取名字.学校和喜爱的人数 11 hua_str=getPage_str(url) 12 hua_list=r

python爬虫主要就是五个模块：爬虫启动入口模块，URL管理器存放已经爬虫的URL和待爬虫URL列表，html下载器，html解析器，html输出器同时可以掌握到urllib2的使用、bs4（BeautifulSoup）页面解析器、re正则表达式、urlparse、python基础知识回顾（set集合操作）等相关内容。

本次python爬虫百步百科,里面详细分析了爬虫的步骤,对每一步代码都有详细的注释说明,可通过本案例掌握python爬虫的特点: 1.爬虫调度入口(crawler_main.py) # coding:utf-8from com.wenhy.crawler_baidu_baike import url_manager, html_downloader, html_parser, html_outputer print "爬虫百度百科调度入口" # 创建爬虫类class SpiderMai