# Python中的正则表达式

## 1. 正则表达式语法

import re


text_to_search = '''
abcdefghijklmnopqurtuvwxyz
ABCDEFGHIJKLMNOPQRSTUVWXYZ
1234567890

Ha HaHa

MetaCharacters (Need to be escaped):
. ^ \$ * + ? { } [ ] \ | ( )

coreyms.com

321-555-4321
123.555.1234
123*555*1234

Mr. Schafer
Mr Smith
Ms Davis
Mrs. Robinson
Mr. T
'''


pattern = re.compile(r'abc')  #r" " for raw string.
matches = pattern.finditer(text_to_search)


matches是finditer()返回的一个包含所有非重叠匹配对象的迭代器。输出搜索结果可以用以下语句：

for match in matches:
print(match)  #
print(match.span())  # (1, 4)


.       - Any Character Except New Line (\n)
\d      - Digit (0-9)
\D      - Not a Digit (0-9)    ie: pattern = re.compile(r'\D')
\w      - Word Character (a-z, A-Z, 0-9, _)
\W      - Not a Word Character   ie: ie: pattern = re.compile(r'\W')
\s      - Whitespace (space, tab, newline)
\S      - Not Whitespace (space, tab, newline)


Quantifiers:
*       - 0 or More
+       - 1 or More
?       - 0 or One
{3}     - Exact Number
{3,4}   - Range of Numbers (Minimum, Maximum)


[]      - Matches Characters in brackets; ie: pattern = re.compile(r'[abc]')
匹配任意包含在[]中的一个字符
[^ ]    - Matches Characters NOT in brackets
IE: pattern = re.compile(r'[^b]at')  #匹配*at三个字母，*不为b
|       - Either Or
( )     - Group


pattern = re.compile(r'(Mr|Mrs|Ms)\.?\s[A-Z]\w*')


b      - Word Boundary
\B      - Not a Word Boundary
^       - Beginning of a String
IE: pattern = re.compile(r'^start', re.I)  # re.I 忽略大小写
- End of a String
IE: pattern = re.compile(r'end', re.I)  # re.I 忽略大小写


## 2. 不同的搜索方式对比

### re.search() vs. re.match()

Python 提供了两种基于正则表达式的不同原始操作：re.match() 只检查字符串开头的匹配，而 re.search() 则检查字符串中任何地方的匹配。

pattern = re.compile(r'c')
text_to_search = 'abcdef'
re.match(pattern, text_to_search)    # No match, no return
re.search(pattern, text_to_search)   # Match, ruturn match result
>>>


### re.findall(pattern, string, flags=0)

>>> re.findall(r'\bf[a-z]*', 'which foot or hand fell fastest')
['foot', 'fell', 'fastest']
>>> re.findall(r'(\w+)=(\d+)', 'set width=20 and height=10')
[('width', '20'), ('height', '10')]


## 3. 例子

### 3.1 电子邮件地址的匹配

emails = '''
CoreyMSchafer@gmail.com
corey.schafer@university.edu
corey-321-schafer@my-work.net
'''


pattern = re.compile(r'[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+')
matches = pattern.finditer(emails)
for match in matches:
print(match)


### 3.2 网址匹配

urls = '''
http://coreyms.com
https://www.nasa.gov
'''


pattern = re.compile(r'https?://(www\.)?(\w+)(\.\w+)')

subbed_urls = pattern.sub(r'\2\3', urls) # only match group(2) and (3).

print(subbed_urls)

# matches = pattern.finditer(urls)

# for match in matches:
#     group(0) is the who pattern
#     print(match.group(3)) # 将匹配模式分组()后可以分组匹配


## 4. 总结

• 引入re模块 import re
• 创建搜索模式pattern
• 利用re.search或者re.finditer()
• 输出搜索结果

-- 本文为[Python Tutorial: re Module - How to Write and Match Regular Expressions (Regex)](Python Tutorial: re Module - How to Write and Match Regular Expressions (Regex) - YouTube笔记

