首页猿问 Python：匹配字符串中的多个子字符串

Python：匹配字符串中的多个子字符串

Python

HUH函数 2021-10-26 18:45:01

我正在使用 Python，我想将给定的字符串与多个子字符串匹配。我试图以两种不同的方式解决这个问题。我的第一个解决方案是将子字符串与如下字符串匹配：str = "This is a test string from which I want to match multiple substrings"value = ["test", "match", "multiple", "ring"]temp = []temp.extend([x.upper() for x in value if x.lower() in str.lower()])print(temp)这导致 temp = ["TEST", "MATCH", "MULTIPLE", "RING"]然而，这不是我想要的结果。子串应该完全匹配，所以“ring”不应该与“string”匹配。这就是为什么我试图用正则表达式解决这个问题，像这样：str = "This is a test string from which I want to match multiple substrings"value = ["test", "match", "multiple", "ring"]temp = []temp.extend([x.upper() for x in value if regex.search(r"\b" + regex.escape(x) + r"\b", str, regex.IGNORECASE) is not None])print(temp)这导致 ["TEST", "MATCH", "MULTIPLE"]，正确的解决方案。尽管如此，该解决方案的计算时间太长。我必须对大约 100 万个字符串进行此检查，与使用第一个解决方案所需的 1.5 小时相比，使用正则表达式的解决方案需要数天才能完成。我想知道是否有办法使第一个解决方案起作用，或者使第二个解决方案运行得更快。提前致谢编辑：value也可以包含数字，或者像“test1 test2”这样的短语

查看完整描述

3 回答

慕森卡

TA贡献1806条经验获得超8个赞

在不查看实际数据的情况下很难提出最佳解决方案，但您可以尝试以下方法：

生成匹配所有值的单个模式。这样您只需要搜索字符串一次（而不是每个值一次）。
跳过转义值，除非它们包含特殊字符（如'^'或'*'）。
将结果直接分配给temp，避免使用进行不必要的复制temp.extend()。

import regex

# 'str' is a built-in name, so use 'string' instead

string = 'This is a Test string from which I want to match multiple substrings'

values = ['test', 'test2', 'Multiple', 'ring', 'match']

pattern = r'\b({})\b'.format('|'.join(map(regex.escape, values)))

# unique matches, lowercased

matches = set(map(str.lower, regex.findall(pattern, string, regex.IGNORECASE)))

# arrange the results as they appear in `values`

temp = [x.upper() for x in values if x.lower() in matches]

print(temp) # ['TEST', 'MULTIPLE', 'MATCH']

反对回复 2021-10-26

吃鸡游戏

TA贡献1829条经验获得超7个赞

想到了两种可能的优化：

预编译模式，re.compile所以它不会在每次调用时重新编译match。
与其匹配四个独立的正则表达式，不如创建一个匹配所有值的正则表达式。

import re

str = "This is a test string from which I want to match test1 test2 multiple substrings"

values = ["test", "match", "multiple", "ring", "test1 test2"]

pattern = re.compile("|".join(r"\b" + re.escape(x) + r"\b" for x in values))

temp = []

temp.extend([x.upper() for x in pattern.findall(str, re.IGNORECASE)])

print(temp)

结果：

['TEST', 'MATCH', 'TEST1 TEST2', 'MULTIPLE']

这种方法的潜在缺点：

输出的顺序可能不同。您的原始方法将结果按它们在中出现的顺序排列values。这种方法将结果按它们出现的顺序排列str。
temp如果在中出现多次，相同的值将出现多次str。与您的原始方法相反，该值在temp.
search一旦找到匹配就终止。findall总是搜索整个字符串。如果您希望大多数字符串匹配中的每个单词value，并且希望大多数匹配出现在字符串的早期，那么findall可能比search. 另一方面，如果您希望搜索经常出现None，那么findall速度可能会更快一些。

反对回复 2021-10-26

慕姐4208626

TA贡献1852条经验获得超7个赞

您可以str按空间拆分，然后将元素value与==

编辑：

所以你说一些字符串在values它们之前或之后可以有空格。你可以用这一行解决这个问题：

values = [i.strip() for i in values]

这将删除字符串前后的所有空白字符（在您的情况下，每个元素）。

此外，您提到如果str按空格拆分，某些单词'Hi, how are you?'会因拆分而留下标点符号 ->将导致['Hi,', 'how', 'are', 'you?']. 您可以通过使用字符串startswith()内置方法过滤所有以元素开头的单词来解决此问题，values如下所示：

str = ['Hi,', 'how', 'are', 'you?']`

values = ['how', 'you', 'time', 'space']

new_str = []

for word in str:

for j in values:

if word.startswith(j):

new_str.append(word)

# result -> ['how', 'you?']

然后你可以用一些正则表达式从结果列表中删除标点符号，但现在你将有一个更小的列表来迭代。删除所有标点符号后，您可以按照我在原始答案中的建议匹配整个字符串。

我希望现在更清楚了。

反对回复 2021-10-26

3 回答
0 关注
392 浏览

关注

添加回答

0/150

提交

取消

热搜

最近搜索清空

Python：匹配字符串中的多个子字符串

Python：匹配字符串中的多个子字符串

3 回答

添加回答