Python 字符串与正则表达式

字符串的基础方法已在 01 和 03 中介绍，本章聚焦于正则表达式以及字符串的进阶用法。

1. 字符串进阶

1.1 字符串的不可变性与内存

python

# 字符串是不可变对象，每次"修改"都会创建新对象
s = "hello"
print(id(s))    # 140234567890
s += " world"
print(id(s))    # 140234567950  （id 变了，是新对象）

# ⚠️ 循环拼接字符串效率低（每次产生新对象）
# ❌ 不推荐
result = ""
for i in range(5):
    result += str(i)
print(result)  # 01234

# ✅ 推荐：用 join
result = "".join(str(i) for i in range(5))
print(result)  # 01234

# ✅ 推荐：用列表收集再 join
parts = []
for i in range(5):
    parts.append(str(i))
result = "-".join(parts)
print(result)  # 0-1-2-3-4

1.2 字符串与字节串

python

# str：Unicode 字符序列（人类可读的文本）
# bytes：字节序列（计算机存储/传输的原始数据）

text = "你好Python"

# 编码：str → bytes
encoded = text.encode("utf-8")
print(encoded)       # b'\xe4\xbd\xa0\xe5\xa5\xbdPython'
print(type(encoded)) # <class 'bytes'>
print(len(text))     # 8   （8 个字符）
print(len(encoded))  # 12  （中文 3 字节 × 2 + 英文 6 字节）

# 解码：bytes → str
decoded = encoded.decode("utf-8")
print(decoded)  # 你好Python

# 不同编码的字节长度不同
print(len(text.encode("utf-8")))    # 12
print(len(text.encode("gbk")))      # 10  （GBK 中文 2 字节）
print(len(text.encode("utf-16")))   # 18  （含 2 字节 BOM）

# bytes 的操作与 str 类似
data = b"Hello, World!"
print(data.upper())        # b'HELLO, WORLD!'
print(data.split(b", "))   # [b'Hello', b'World!']
print(data.find(b"World")) # 7

# bytearray：可变的字节序列
ba = bytearray(b"hello")
ba[0] = ord("H")   # 修改单个字节
print(ba)           # bytearray(b'Hello')
ba.extend(b"!!!")
print(ba)           # bytearray(b'Hello!!!')

1.3 f-string 进阶技巧

python

name = "Alice"
score = 95.678
items = [1, 2, 3]

# 表达式
print(f"长度：{len(items)}")          # 长度：3
print(f"大写：{name.upper()}")        # 大写：ALICE
print(f"判断：{'及格' if score >= 60 else '不及格'}")  # 判断：及格

# 数字格式化
print(f"保留2位：{score:.2f}")        # 保留2位：95.68
print(f"百分比：{0.856:.1%}")         # 百分比：85.6%
print(f"千分位：{1234567:,}")         # 千分位：1,234,567
print(f"科学计数：{123456:.2e}")      # 科学计数：1.23e+05
print(f"二进制：{255:b}")             # 二进制：11111111
print(f"十六进制：{255:#x}")          # 十六进制：0xff
print(f"八进制：{255:#o}")            # 八进制：0o377

# 对齐与填充
print(f"{'左对齐':<10}|")    # 左对齐       |
print(f"{'右对齐':>10}|")    #        右对齐|
print(f"{'居中':^10}|")      #    居中    |
print(f"{'填充':*^10}")      # ****填充****
print(f"{42:05d}")            # 00042

# 日期格式化
from datetime import datetime
now = datetime.now()
print(f"{now:%Y-%m-%d %H:%M:%S}")   # 2026-03-18 14:30:25
print(f"{now:%Y年%m月%d日}")         # 2026年03月18日

# 调试模式（Python 3.8+）：变量名=值
x = 42
y = "hello"
print(f"{x = }")       # x = 42
print(f"{y = !r}")     # y = 'hello'  （!r 调用 repr）
print(f"{x + 1 = }")   # x + 1 = 43

# 嵌套大括号
print(f"{{转义大括号}}")           # {转义大括号}
width = 10
print(f"{'test':>{width}}")       # 嵌套变量作为格式参数：      test

# 多行 f-string
user = {"name": "Bob", "age": 30}
msg = (
    f"用户：{user['name']}\n"
    f"年龄：{user['age']}\n"
    f"状态：{'成年' if user['age'] >= 18 else '未成年'}"
)
print(msg)
# 用户：Bob
# 年龄：30
# 状态：成年

1.4 字符串模板（Template）

python

from string import Template

# 使用 $ 作为占位符（适合用户提供的模板，比 f-string/format 更安全）
t = Template("$name 你好，你的订单 $order_id 已发货")
result = t.substitute(name="Alice", order_id="A12345")
print(result)  # Alice 你好，你的订单 A12345 已发货

# safe_substitute：缺少变量时不报错
result = t.safe_substitute(name="Bob")
print(result)  # Bob 你好，你的订单 $order_id 已发货

# ${var} 语法用于变量名紧邻其他字符时
t = Template("文件名：${prefix}_data.csv")
print(t.substitute(prefix="2024"))  # 文件名：2024_data.csv

1.5 textwrap —— 文本换行与缩进

python

import textwrap

long_text = "Python 是一种广泛使用的高级编程语言，它的设计哲学强调代码的可读性和简洁性。Python 支持多种编程范式，包括面向对象、命令式、函数式和过程式编程。"

# 按指定宽度换行
wrapped = textwrap.fill(long_text, width=30)
print(wrapped)
# Python 是一种广泛使用的高级编程语
# 言，它的设计哲学强调代码的可读性和
# 简洁性。Python 支持多种编程范式，
# 包括面向对象、命令式、函数式和过程
# 式编程。

# 截断长文本
short = textwrap.shorten(long_text, width=30, placeholder="...")
print(short)  # Python 是一种广泛使用的高级...

# 去除公共缩进（常用于多行字符串）
code = """
    def hello():
        print("hello")
        return True
"""
print(textwrap.dedent(code))
#
# def hello():
#     print("hello")
#     return True
#

# 添加缩进
text = "第一行\n第二行\n第三行"
print(textwrap.indent(text, "    "))
#     第一行
#     第二行
#     第三行

print(textwrap.indent(text, "> "))
# > 第一行
# > 第二行
# > 第三行

2. 正则表达式基础

2.1 什么是正则表达式

正则表达式（Regular Expression，简称 regex）是一种描述字符串模式的微型语言，用于搜索、匹配和替换文本。

python

import re

# 最简单的正则：精确匹配字面字符串
text = "Hello, Python! Hello, World!"

result = re.findall("Hello", text)
print(result)  # ['Hello', 'Hello']

result = re.findall("Java", text)
print(result)  # []

2.2 元字符速查

元字符	含义	示例	匹配
`.`	任意字符（除换行）	`a.c`	abc, a1c, a c
`\d`	数字 `[0-9]`	`\d{3}`	123, 456
`\D`	非数字 `[^0-9]`	`\D+`	abc, !@#
`\w`	单词字符 `[a-zA-Z0-9_]`	`\w+`	hello, var_1
`\W`	非单词字符	`\W`	!, @, 空格
`\s`	空白字符 `[ \t\n\r\f\v]`	`\s+`	空格, 制表符
`\S`	非空白字符	`\S+`	hello
`^`	字符串开头	`^Hello`	以 Hello 开头
`$`	字符串结尾	`world$`	以 world 结尾
`\b`	单词边界	`\bcat\b`	cat（不匹配 catch）

python

import re

text = "电话 13812345678，邮编 100000，编号 AB-123"

# \d 匹配数字
print(re.findall(r"\d+", text))
# ['13812345678', '100000', '123']

# \w 匹配单词字符（含中文）
print(re.findall(r"\w+", text))
# ['电话', '13812345678', '邮编', '100000', '编号', 'AB', '123']

# \s 匹配空白
print(re.split(r"\s+", "hello   world   python"))
# ['hello', 'world', 'python']

# ^ 和 $
print(re.search(r"^电话", text))    # <re.Match object ...>（匹配到了）
print(re.search(r"^邮编", text))    # None（不在开头）

# \b 单词边界
text2 = "cat catch catfish scat"
print(re.findall(r"\bcat\b", text2))   # ['cat']  （只匹配独立的 cat）
print(re.findall(r"\bcat", text2))     # ['cat', 'cat', 'cat']（以 cat 开头的单词）

2.3 量词

量词	含义	示例	匹配
`*`	0 次或多次	`ab*c`	ac, abc, abbc
`+`	1 次或多次	`ab+c`	abc, abbc（不匹配 ac）
`?`	0 次或 1 次	`ab?c`	ac, abc
`{n}`	恰好 n 次	`\d{3}`	123
`{n,}`	至少 n 次	`\d{2,}`	12, 123, 1234
`{n,m}`	n 到 m 次	`\d{2,4}`	12, 123, 1234

python

import re

# * 零次或多次
print(re.findall(r"go*gle", "ggle gogle google gooogle"))
# ['ggle', 'gogle', 'google', 'gooogle']

# + 一次或多次
print(re.findall(r"go+gle", "ggle gogle google gooogle"))
# ['gogle', 'google', 'gooogle']

# ? 零次或一次
print(re.findall(r"colou?r", "color colour"))
# ['color', 'colour']

# {n} 精确次数
print(re.findall(r"\d{3}", "1 12 123 1234"))
# ['123', '123']  （1234 中的前 3 位也被匹配）

# {n,m} 范围
print(re.findall(r"\d{2,4}", "1 12 123 1234 12345"))
# ['12', '123', '1234', '1234']

2.4 贪婪与非贪婪

python

import re

text = "<h1>标题</h1><p>段落</p>"

# 贪婪模式（默认）：尽可能多地匹配
print(re.findall(r"<.+>", text))
# ['<h1>标题</h1><p>段落</p>']  （一次性匹配了全部）

# 非贪婪模式：在量词后加 ?，尽可能少地匹配
print(re.findall(r"<.+?>", text))
# ['<h1>', '</h1>', '<p>', '</p>']

# 各量词的非贪婪形式
# *?  +?  ??  {n,m}?

html = '<a href="url1">链接1</a><a href="url2">链接2</a>'
# 贪婪
print(re.findall(r'".*"', html))    # ['"url1">链接1</a><a href="url2"']
# 非贪婪
print(re.findall(r'".*?"', html))   # ['"url1"', '"url2"']

2.5 字符类

python

import re

# [...] 匹配其中任意一个字符
print(re.findall(r"[aeiou]", "hello world"))
# ['e', 'o', 'o']

# [^...] 排除字符
print(re.findall(r"[^aeiou\s]", "hello world"))
# ['h', 'l', 'l', 'w', 'r', 'l', 'd']

# 范围
print(re.findall(r"[a-z]+", "Hello World 123"))   # ['ello', 'orld']
print(re.findall(r"[A-Za-z]+", "Hello World 123")) # ['Hello', 'World']
print(re.findall(r"[0-9]+", "Hello World 123"))    # ['123']

# 中文范围
print(re.findall(r"[\u4e00-\u9fff]+", "Hello 你好 World 世界"))
# ['你好', '世界']

# 常见字符类简写
# [0-9]        →  \d
# [^0-9]       →  \D
# [a-zA-Z0-9_] →  \w
# [^a-zA-Z0-9_]→  \W
# [ \t\n\r\f\v]→  \s

3. re 模块核心函数

3.1 re.search() —— 搜索第一个匹配

python

import re

text = "订单号：ORD-2024-001，金额：¥199.50"

# search 在整个字符串中搜索第一个匹配
match = re.search(r"ORD-\d{4}-\d{3}", text)
if match:
    print(match.group())   # ORD-2024-001   匹配到的文本
    print(match.start())   # 4              起始位置
    print(match.end())     # 16             结束位置
    print(match.span())    # (4, 16)        (起始, 结束)

# 未匹配返回 None
match = re.search(r"INV-\d+", text)
print(match)  # None

# ⚠️ 实战中始终检查是否匹配成功
if match := re.search(r"¥([\d.]+)", text):  # 海象运算符
    print(f"金额：{match.group(1)}")  # 金额：199.50

3.2 re.match() —— 从开头匹配

python

import re

# match 只在字符串开头匹配
text = "Python 3.12 发布了"

match = re.match(r"Python", text)
print(match.group())  # Python

match = re.match(r"\d+", text)  # 开头不是数字
print(match)  # None

# match vs search
text = "版本：Python 3.12"
print(re.match(r"Python", text))    # None  （开头不是 Python）
print(re.search(r"Python", text))   # <re.Match ...>  （找到了）

# match 适合验证字符串是否符合某种格式
def is_valid_phone(phone):
    return re.match(r"^1[3-9]\d{9}$", phone) is not None

print(is_valid_phone("13812345678"))  # True
print(is_valid_phone("12345678901"))  # False（第二位不合法）
print(is_valid_phone("1381234567"))   # False（位数不够）

3.3 re.findall() —— 查找所有匹配

python

import re

text = "张三的电话是13800001111，李四的是13900002222，王五的是15000003333"

# 返回所有匹配的字符串列表
phones = re.findall(r"1[3-9]\d{9}", text)
print(phones)  # ['13800001111', '13900002222', '15000003333']

# 如果正则中有分组，返回分组内容
pairs = re.findall(r"(\w+)的电话是(\d+)", text)
print(pairs)
# [('张三', '13800001111'), ('李四', '13900002222')]

# 单个分组返回字符串列表
names = re.findall(r"(\w+)的", text)
print(names)  # ['张三', '李四', '王五']

# 想返回完整匹配而非分组，使用非捕获分组 (?:...)
emails = "alice@gmail.com bob@163.com"
result = re.findall(r"\w+@(?:gmail|163)\.com", emails)
print(result)  # ['alice@gmail.com', 'bob@163.com']

3.4 re.finditer() —— 迭代所有匹配

python

import re

text = "2024-01-15 发布 v1.0，2024-06-20 发布 v2.0，2024-12-01 发布 v3.0"

# finditer 返回迭代器，每个元素是 Match 对象（比 findall 信息更丰富）
for match in re.finditer(r"\d{4}-\d{2}-\d{2}", text):
    print(f"日期：{match.group()}，位置：{match.span()}")
# 日期：2024-01-15，位置：(0, 10)
# 日期：2024-06-20，位置：(19, 29)
# 日期：2024-12-01，位置：(38, 48)

3.5 re.sub() —— 替换

python

import re

# 基本替换
text = "我的电话是13812345678，备用号13912345678"
masked = re.sub(r"(\d{3})\d{4}(\d{4})", r"\1****\2", text)
print(masked)  # 我的电话是138****5678，备用号139****5678

# \1 和 \2 引用分组

# 替换所有数字为 *
print(re.sub(r"\d", "*", "密码是abc123def456"))
# 密码是abc***def***

# 限制替换次数
print(re.sub(r"\d", "*", "abc123def456", count=3))
# abc***def456

# 使用函数替换（动态决定替换内容）
def double_match(match):
    num = int(match.group())
    return str(num * 2)

result = re.sub(r"\d+", double_match, "苹果3个，香蕉5个，橙子8个")
print(result)  # 苹果6个，香蕉10个，橙子16个

# 更复杂的函数替换：单词首字母大写
text = "hello world python programming"
result = re.sub(r"\b[a-z]", lambda m: m.group().upper(), text)
print(result)  # Hello World Python Programming

# subn —— 返回 (替换结果, 替换次数)
result, count = re.subn(r"\d+", "N", "a1b2c3d4")
print(result)  # aNbNcNdN
print(count)   # 4

3.6 re.split() —— 分割

python

import re

# 按正则模式分割
text = "苹果, 香蕉;  西瓜，葡萄; 橙子"
result = re.split(r"[,;，；]\s*", text)
print(result)  # ['苹果', '香蕉', '西瓜', '葡萄', '橙子']

# 按多个空白分割
print(re.split(r"\s+", "hello   world\tpython\n!"))
# ['hello', 'world', 'python', '!']

# 限制分割次数
print(re.split(r"\s+", "a b c d e", maxsplit=2))
# ['a', 'b', 'c d e']

# 保留分隔符（分组会被保留）
print(re.split(r"(\s+)", "hello world python"))
# ['hello', ' ', 'world', ' ', 'python']

# 分割驼峰命名
camel = "getUserNameFromDatabase"
words = re.split(r"(?=[A-Z])", camel)  # 在大写字母前分割
print(words)  # ['get', 'User', 'Name', 'From', 'Database']

snake = "_".join(w.lower() for w in words)
print(snake)  # get_user_name_from_database

3.7 re.compile() —— 编译正则

python

import re

# 编译后的正则对象可以重复使用，效率更高
phone_pattern = re.compile(r"1[3-9]\d{9}")
email_pattern = re.compile(r"[\w.+-]+@[\w-]+\.[\w.]+")

texts = [
    "联系方式：13812345678",
    "邮箱：alice@example.com",
    "电话13900001111，邮箱bob@test.org",
]

for text in texts:
    phones = phone_pattern.findall(text)
    emails = email_pattern.findall(text)
    if phones:
        print(f"  电话：{phones}")
    if emails:
        print(f"  邮箱：{emails}")
#   电话：['13812345678']
#   邮箱：['alice@example.com']
#   电话：['13900001111']
#   邮箱：['bob@test.org']

# 编译对象拥有相同的方法
print(phone_pattern.search("call 13812345678").group())  # 13812345678
print(phone_pattern.findall("13800001111 和 13900002222"))  # ['13800001111', '13900002222']
print(phone_pattern.sub("***", "电话是13812345678"))  # 电话是***

# 查看正则的模式和标志
print(phone_pattern.pattern)  # 1[3-9]\d{9}
print(phone_pattern.flags)    # 32（对应的标志位）

4. 分组与引用

4.1 捕获分组 `()`

python

import re

# 用 () 创建分组，可以提取匹配的子串
text = "2024-06-15 发布了新版本"

match = re.search(r"(\d{4})-(\d{2})-(\d{2})", text)
if match:
    print(match.group())    # 2024-06-15   完整匹配
    print(match.group(0))   # 2024-06-15   同上
    print(match.group(1))   # 2024         第 1 组
    print(match.group(2))   # 06           第 2 组
    print(match.group(3))   # 15           第 3 组
    print(match.groups())   # ('2024', '06', '15')

# 一次获取多个分组
year, month, day = match.groups()
print(f"{year}年{month}月{day}日")  # 2024年06月15日

4.2 命名分组 `(?P<name>...)`

python

import re

text = "姓名：张三，年龄：25，城市：北京"

pattern = r"姓名：(?P<name>\w+)，年龄：(?P<age>\d+)，城市：(?P<city>\w+)"
match = re.search(pattern, text)

if match:
    print(match.group("name"))   # 张三
    print(match.group("age"))    # 25
    print(match.group("city"))   # 北京
    print(match.groupdict())     # {'name': '张三', 'age': '25', 'city': '北京'}

# 命名分组在 sub 中的引用
text = "2024-06-15"
result = re.sub(
    r"(?P<y>\d{4})-(?P<m>\d{2})-(?P<d>\d{2})",
    r"\g<y>年\g<m>月\g<d>日",
    text
)
print(result)  # 2024年06月15日

4.3 非捕获分组 `(?:...)`

python

import re

# 普通分组会影响 findall 的返回值
text = "192.168.1.1 和 10.0.0.1"
# 有分组时 findall 只返回分组内容
print(re.findall(r"(\d+)\.(\d+)\.(\d+)\.(\d+)", text))
# [('192', '168', '1', '1'), ('10', '0', '0', '1')]

# 使用非捕获分组 (?:...) —— 分组但不捕获
print(re.findall(r"(?:\d+\.){3}\d+", text))
# ['192.168.1.1', '10.0.0.1']  （返回完整匹配）

# 实用对比
# 匹配 http 或 https
urls = "http://a.com 和 https://b.com"
# 捕获分组
print(re.findall(r"(https?)://\S+", urls))  # ['http', 'https']（只返回分组）
# 非捕获分组
print(re.findall(r"(?:https?)://\S+", urls))  # ['http://a.com', 'https://b.com']

4.4 反向引用

python

import re

# \1 引用第 1 个分组匹配到的内容（要求前后一致）

# 匹配重复的单词
text = "the the quick brown fox fox jumped"
print(re.findall(r"\b(\w+)\s+\1\b", text))
# ['the', 'fox']  （匹配到连续重复的单词）

# 匹配成对的引号
text = '''他说"你好"，她说'再见'，这是"不匹配的''''
print(re.findall(r'(["\'])(.+?)\1', text))
# [('"', '你好'), ("'", '再见')]
# \1 确保开头和结尾用的是同一种引号

# 匹配 HTML 标签对
html = "<h1>标题</h1><p>段落</p><h1>不匹配</h2>"
print(re.findall(r"<(\w+)>(.+?)</\1>", html))
# [('h1', '标题'), ('p', '段落')]
# 注意 <h1>不匹配</h2> 没有被匹配到，因为标签不一致

5. 断言（零宽匹配）

断言匹配位置而非字符，不消耗字符串。

5.1 前瞻与后顾

语法	名称	含义
`(?=...)`	正向前瞻	后面紧跟 ...
`(?!...)`	负向前瞻	后面不跟 ...
`(?<=...)`	正向后顾	前面紧跟 ...
`(?<!...)`	负向后顾	前面不跟 ...

python

import re

# ===== 正向前瞻 (?=...) =====
# 匹配后面跟着 "元" 的数字
text = "苹果5元，香蕉3个，西瓜8元"
print(re.findall(r"\d+(?=元)", text))
# ['5', '8']  （只匹配数字，不包含"元"）

# ===== 负向前瞻 (?!...) =====
# 匹配后面不跟 "元" 的数字
print(re.findall(r"\d+(?!元)", text))
# ['3']

# ===== 正向后顾 (?<=...) =====
# 匹配前面是 ¥ 的数字
text = "价格 ¥199 和 $299 以及 ¥59"
print(re.findall(r"(?<=¥)\d+", text))
# ['199', '59']

# ===== 负向后顾 (?<!...) =====
# 匹配前面不是 ¥ 的数字
print(re.findall(r"(?<!¥)\d+", text))
# ['299']  （只有 $299 的数字部分）

# 综合示例：密码强度检查
# 要求：至少8位，含大写、小写和数字
def check_password(pwd):
    checks = {
        "至少8位": r".{8,}",
        "含大写字母": r"(?=.*[A-Z])",
        "含小写字母": r"(?=.*[a-z])",
        "含数字": r"(?=.*\d)",
    }
    results = {}
    for name, pattern in checks.items():
        results[name] = bool(re.search(pattern, pwd))
    return results

print(check_password("Abc12345"))
# {'至少8位': True, '含大写字母': True, '含小写字母': True, '含数字': True}

print(check_password("abc123"))
# {'至少8位': False, '含大写字母': False, '含小写字母': True, '含数字': True}

5.2 实用断言示例

python

import re

# 千分位格式化：在合适的位置插入逗号
# 正向前瞻 + 正向后顾
def add_commas(n):
    return re.sub(r"(?<=\d)(?=(\d{3})+$)", ",", str(n))

print(add_commas(1234567890))  # 1,234,567,890
print(add_commas(1000))        # 1,000
print(add_commas(42))          # 42

# 驼峰命名转蛇形命名
def camel_to_snake(name):
    # 在大写字母前插入下划线（但不在开头）
    s = re.sub(r"(?<=[a-z0-9])(?=[A-Z])", "_", name)
    return s.lower()

print(camel_to_snake("getUserName"))       # get_user_name
print(camel_to_snake("HTMLParser"))        # html_parser
print(camel_to_snake("simpleXMLParser"))   # simple_xml_parser

6. 标志位（Flags）

python

import re

# re.IGNORECASE (re.I) —— 忽略大小写
print(re.findall(r"python", "Python PYTHON python", re.I))
# ['Python', 'PYTHON', 'python']

# re.MULTILINE (re.M) —— 多行模式（^ 和 $ 匹配每行的开头和结尾）
text = """第一行 hello
第二行 world
第三行 python"""

print(re.findall(r"^\S+", text))         # ['第一行']（只匹配第一行开头）
print(re.findall(r"^\S+", text, re.M))   # ['第一行', '第二行', '第三行']

# re.DOTALL (re.S) —— 让 . 也匹配换行符
html = "<div>\n内容\n</div>"
print(re.findall(r"<div>(.+)</div>", html))         # []（. 默认不匹配换行）
print(re.findall(r"<div>(.+)</div>", html, re.S))   # ['\n内容\n']

# re.VERBOSE (re.X) —— 允许添加注释和空白，提高可读性
phone_pattern = re.compile(r"""
    ^1              # 以 1 开头
    [3-9]           # 第二位 3-9
    \d{9}           # 后面跟 9 位数字
    $               # 字符串结尾
""", re.VERBOSE)

print(phone_pattern.match("13812345678"))  # <re.Match ...>
print(phone_pattern.match("12345678901"))  # None

# 组合多个标志
pattern = re.compile(r"hello", re.I | re.M)

# 内联标志写法（在正则表达式内部）
print(re.findall(r"(?i)python", "Python PYTHON"))  # ['Python', 'PYTHON']

7. 常见正则实战

7.1 数据提取

python

import re

# ===== 提取邮箱 =====
text = "联系我：alice@gmail.com 或 bob.test@company.co.jp"
emails = re.findall(r"[\w.+-]+@[\w-]+(?:\.[\w-]+)+", text)
print(emails)  # ['alice@gmail.com', 'bob.test@company.co.jp']

# ===== 提取 URL =====
text = "访问 https://www.example.com/path?q=1 或 http://api.test.org"
urls = re.findall(r"https?://[\w./\-?=&]+", text)
print(urls)  # ['https://www.example.com/path?q=1', 'http://api.test.org']

# ===== 提取中文 =====
text = "Hello你好World世界123"
chinese = re.findall(r"[\u4e00-\u9fff]+", text)
print(chinese)  # ['你好', '世界']

# ===== 提取 HTML 标签内容 =====
html = '<p class="intro">段落内容</p><a href="url">链接文字</a>'
contents = re.findall(r">([^<]+)<", html)
print(contents)  # ['段落内容', '链接文字']

# 提取标签属性
attrs = re.findall(r'(\w+)="([^"]*)"', html)
print(attrs)  # [('class', 'intro'), ('href', 'url')]

# ===== 提取 key=value 格式 =====
config = "host=localhost port=8080 debug=true timeout=30"
pairs = re.findall(r"(\w+)=(\S+)", config)
config_dict = dict(pairs)
print(config_dict)
# {'host': 'localhost', 'port': '8080', 'debug': 'true', 'timeout': '30'}

7.2 数据验证

python

import re

def validate(name, value, pattern):
    if re.match(pattern, value):
        print(f"  ✅ {name}：{value}")
    else:
        print(f"  ❌ {name}：{value}")

# 手机号
validate("手机号", "13812345678", r"^1[3-9]\d{9}$")      # ✅
validate("手机号", "12345678901", r"^1[3-9]\d{9}$")      # ❌

# 邮箱
email_re = r"^[\w.+-]+@[\w-]+(?:\.[\w-]+)+$"
validate("邮箱", "test@example.com", email_re)            # ✅
validate("邮箱", "invalid@", email_re)                    # ❌

# IP 地址（简单版）
ip_re = r"^(?:\d{1,3}\.){3}\d{1,3}$"
validate("IP", "192.168.1.1", ip_re)                      # ✅
validate("IP", "999.999.999.999", ip_re)                  # ✅（简单版不检查范围）

# IP 地址（精确版）
ip_exact = r"^(?:(?:25[0-5]|2[0-4]\d|1\d{2}|[1-9]?\d)\.){3}(?:25[0-5]|2[0-4]\d|1\d{2}|[1-9]?\d)$"
validate("IP精确", "192.168.1.1", ip_exact)               # ✅
validate("IP精确", "999.999.999.999", ip_exact)           # ❌

# 日期格式
validate("日期", "2024-06-15", r"^\d{4}-\d{2}-\d{2}$")   # ✅
validate("日期", "2024/06/15", r"^\d{4}-\d{2}-\d{2}$")   # ❌

# 身份证号（简化版）
id_re = r"^\d{17}[\dXx]$"
validate("身份证", "11010119900101001X", id_re)            # ✅
validate("身份证", "12345", id_re)                         # ❌

7.3 文本清洗

python

import re

# 去除 HTML 标签
html = "<p>Hello <b>World</b></p><br/><img src='test.png'/>"
clean = re.sub(r"<[^>]+>", "", html)
print(clean)  # Hello World

# 压缩多余空白
text = "hello   world  \t python  \n programming"
clean = re.sub(r"\s+", " ", text).strip()
print(clean)  # hello world python programming

# 去除非中文字符
text = "Hello你好！World世界！@#$123"
chinese_only = re.sub(r"[^\u4e00-\u9fff]", "", text)
print(chinese_only)  # 你好世界

# 敏感词过滤
sensitive_words = ["坏词1", "坏词2", "敏感词"]
pattern = re.compile("|".join(re.escape(w) for w in sensitive_words))
text = "这里有坏词1和敏感词需要过滤"
filtered = pattern.sub("***", text)
print(filtered)  # 这里有***和***需要过滤

# 规范化电话号码格式
phones = ["138-1234-5678", "138 1234 5678", "13812345678", "(138)12345678"]
for phone in phones:
    normalized = re.sub(r"[\s\-\(\)]", "", phone)
    print(f"  {phone:>20s} → {normalized}")
#     138-1234-5678 → 13812345678
#     138 1234 5678 → 13812345678
#        13812345678 → 13812345678
#     (138)12345678 → 13812345678

7.4 日志解析

python

import re

log_lines = [
    '2024-06-15 10:30:45 [INFO] User alice logged in from 192.168.1.100',
    '2024-06-15 10:31:02 [ERROR] Database connection failed: timeout after 30s',
    '2024-06-15 10:31:15 [WARNING] Memory usage at 85%',
    '2024-06-15 10:32:00 [INFO] Request processed in 0.045s',
]

# 解析日志格式
log_pattern = re.compile(
    r"(?P<date>\d{4}-\d{2}-\d{2})\s+"
    r"(?P<time>\d{2}:\d{2}:\d{2})\s+"
    r"\[(?P<level>\w+)\]\s+"
    r"(?P<message>.+)"
)

for line in log_lines:
    match = log_pattern.search(line)
    if match:
        d = match.groupdict()
        print(f"  [{d['level']:>7s}] {d['date']} {d['time']} | {d['message']}")
#   [   INFO] 2024-06-15 10:30:45 | User alice logged in from 192.168.1.100
#   [  ERROR] 2024-06-15 10:31:02 | Database connection failed: timeout after 30s
#   [WARNING] 2024-06-15 10:31:15 | Memory usage at 85%
#   [   INFO] 2024-06-15 10:32:00 | Request processed in 0.045s

# 统计各级别日志数量
from collections import Counter
levels = [log_pattern.search(l).group("level") for l in log_lines]
print(Counter(levels))  # Counter({'INFO': 2, 'ERROR': 1, 'WARNING': 1})

8. 性能与最佳实践

8.1 性能优化

python

import re

# 1. 多次使用的正则应该预编译
# ❌ 每次调用都重新编译
for text in large_list:
    re.search(r"\d{3}-\d{4}", text)

# ✅ 编译一次，多次使用
pattern = re.compile(r"\d{3}-\d{4}")
for text in large_list:
    pattern.search(text)

# 2. 能用字符串方法就不要用正则
text = "hello world"

# ❌ 杀鸡用牛刀
re.search(r"world", text)
re.sub(r"world", "python", text)

# ✅ 字符串方法更快更直观
"world" in text
text.replace("world", "python")

# 适合用字符串方法的场景：
# - 固定字符串的查找/替换 → str.find/replace
# - 前缀后缀判断 → str.startswith/endswith
# - 按固定分隔符分割 → str.split
# - 大小写转换 → str.upper/lower
# - 去除空白 → str.strip

8.2 常见错误

python

import re

# 1. 忘记使用原始字符串
# ❌ \b 被 Python 解释为退格符
# pattern = "\bword\b"

# ✅ 使用 r"" 原始字符串
pattern = r"\bword\b"

# 2. 特殊字符需要转义
# 这些字符在正则中有特殊含义：. * + ? ^ $ { } [ ] ( ) | \
text = "价格是 $9.99（含税）"

# ❌ . 匹配任意字符
print(re.findall(r"\$\d+.\d+", text))   # ['$9.99'] （凑巧对了，但 . 匹配了任意字符）

# ✅ 转义 .
print(re.findall(r"\$\d+\.\d+", text))  # ['$9.99']

# 使用 re.escape() 自动转义
user_input = "price is $9.99?"
escaped = re.escape(user_input)
print(escaped)  # price\ is\ \$9\.99\?

# 3. 贪婪匹配导致过度匹配（前面已讲过，用 *? +? 解决）

9. 总结

re 模块速查

函数	说明	返回值
`re.search(p, s)`	搜索第一个匹配	Match 或 None
`re.match(p, s)`	从开头匹配	Match 或 None
`re.fullmatch(p, s)`	整个字符串完全匹配	Match 或 None
`re.findall(p, s)`	查找所有匹配	列表
`re.finditer(p, s)`	迭代所有匹配	迭代器(Match)
`re.sub(p, repl, s)`	替换	字符串
`re.subn(p, repl, s)`	替换并计数	(字符串, 次数)
`re.split(p, s)`	分割	列表
`re.compile(p)`	编译正则	Pattern 对象
`re.escape(s)`	转义特殊字符	字符串

正则语法速查

字符匹配：
  .         任意字符           \d  数字         \w  单词字符
  \s        空白字符           \b  单词边界     \D \W \S  取反

量词：
  *         0+                 +   1+           ?   0或1
  {n}       恰好n次           {n,m} n到m次      *? +? ??  非贪婪

定位：
  ^         开头               $   结尾

字符类：
  [abc]     匹配a/b/c          [^abc]  排除     [a-z]  范围

分组：
  (...)     捕获分组           (?:...)  非捕获   (?P<name>...)  命名
  \1        反向引用           (?P=name)  命名引用

断言：
  (?=...)   正向前瞻           (?!...)  负向前瞻
  (?<=...)  正向后顾           (?<!...)  负向后顾

标志：
  re.I      忽略大小写          re.M  多行模式
  re.S      点匹配换行          re.X  详细模式

Python 字符串与正则表达式 ​

1. 字符串进阶 ​

1.1 字符串的不可变性与内存 ​

1.2 字符串与字节串 ​

1.3 f-string 进阶技巧 ​

1.4 字符串模板（Template） ​

1.5 textwrap —— 文本换行与缩进 ​

2. 正则表达式基础 ​

2.1 什么是正则表达式 ​

2.2 元字符速查 ​

2.3 量词 ​

2.4 贪婪与非贪婪 ​

2.5 字符类 ​

3. re 模块核心函数 ​

3.1 re.search() —— 搜索第一个匹配 ​

3.2 re.match() —— 从开头匹配 ​

3.3 re.findall() —— 查找所有匹配 ​

3.4 re.finditer() —— 迭代所有匹配 ​

3.5 re.sub() —— 替换 ​

3.6 re.split() —— 分割 ​

3.7 re.compile() —— 编译正则 ​

4. 分组与引用 ​

4.1 捕获分组 () ​

4.2 命名分组 (?P<name>...) ​

4.3 非捕获分组 (?:...) ​

4.4 反向引用 ​

5. 断言（零宽匹配） ​

5.1 前瞻与后顾 ​

5.2 实用断言示例 ​

6. 标志位（Flags） ​

7. 常见正则实战 ​

7.1 数据提取 ​

7.2 数据验证 ​

7.3 文本清洗 ​

7.4 日志解析 ​

8. 性能与最佳实践 ​

8.1 性能优化 ​

8.2 常见错误 ​

9. 总结 ​

re 模块速查 ​

正则语法速查 ​

Python 字符串与正则表达式

1. 字符串进阶

1.1 字符串的不可变性与内存

1.2 字符串与字节串

1.3 f-string 进阶技巧

1.4 字符串模板（Template）

1.5 textwrap —— 文本换行与缩进

2. 正则表达式基础

2.1 什么是正则表达式

2.2 元字符速查

2.3 量词

2.4 贪婪与非贪婪

2.5 字符类

3. re 模块核心函数

3.1 re.search() —— 搜索第一个匹配

3.2 re.match() —— 从开头匹配

3.3 re.findall() —— 查找所有匹配

3.4 re.finditer() —— 迭代所有匹配

3.5 re.sub() —— 替换

3.6 re.split() —— 分割

3.7 re.compile() —— 编译正则

4. 分组与引用

4.1 捕获分组 `()`

4.2 命名分组 `(?P<name>...)`

4.3 非捕获分组 `(?:...)`

4.4 反向引用

5. 断言（零宽匹配）

5.1 前瞻与后顾

5.2 实用断言示例

6. 标志位（Flags）

7. 常见正则实战

7.1 数据提取

7.2 数据验证

7.3 文本清洗

7.4 日志解析

8. 性能与最佳实践

8.1 性能优化

8.2 常见错误

9. 总结

re 模块速查

正则语法速查