检查另一个字符串中是否存在多个字符串
- 2024-11-25 08:49:00
- admin 原创
- 149
问题描述:
如何检查数组中的任何字符串是否存在于另一个字符串中?
例如:
a = ['a', 'b', 'c']
s = "a123"
if a in s:
print("some of the strings found in s")
else:
print("no strings found in s")
我该如何替换该if a in s:
线以获得适当的结果?
解决方案 1:
您可以使用any
:
a_string = "A string is more than its parts!"
matches = ["more", "wholesome", "milk"]
if any(x in a_string for x in matches):
类似地,要检查是否找到列表中的所有all
字符串,请使用而不是any
。
解决方案 2:
any()
`True如果您想要的只是或,这是迄今为止最好的方法
False`,但如果您想具体知道哪些字符串/字符串匹配,您可以使用几个方法。
如果您想要第一个匹配项(False
默认为):
match = next((x for x in a if x in a_string), False)
如果您想获得所有匹配项(包括重复项):
matches = [x for x in a if x in a_string]
如果要获取所有非重复的匹配(不考虑顺序):
matches = {x for x in a if x in a_string}
如果您想按正确的顺序获取所有非重复的匹配:
matches = []
for x in a:
if x in a_string and x not in matches:
matches.append(x)
解决方案 3:
a
如果或中的字符串str
变长,则应小心谨慎。直接解决方案需要 O(S*(A^2)),其中S
是 的长度str
,A 是 中所有字符串的长度之和a
。要获得更快的解决方案,请查看用于字符串匹配的Aho-Corasick算法,该算法的运行时间为线性时间 O(S+A)。
解决方案 4:
只是为了增加一些多样性regex
:
import re
if any(re.findall(r'a|b|c', str, re.IGNORECASE)):
print 'possible matches thanks to regex'
else:
print 'no matches'
或者如果你的列表太长 -any(re.findall(r'|'.join(a), str, re.IGNORECASE))
解决方案 5:
一种令人惊讶的快速方法是使用set
:
a = ['a', 'b', 'c']
a_string = "a123"
if set(a) & set(a_string):
print("some of the strings found in a_string")
else:
print("no strings found in a_string")
如果a
不包含任何多字符值(在这种情况下使用上面any
列出的值),则此方法有效。如果是这样,则更简单地将其指定为字符串:。a
`a = 'abc'`
解决方案 6:
您需要对 a 的元素进行迭代。
a = ['a', 'b', 'c']
a_string = "a123"
found_a_string = False
for item in a:
if item in a_string:
found_a_string = True
if found_a_string:
print "found a match"
else:
print "no match found"
解决方案 7:
jbernadas 已经提到了Aho-Corasick 算法,以降低复杂性。
以下是在 Python 中使用它的一种方法:
从这里下载 aho_corasick.py
将其放在与主 Python 文件相同的目录中,并将其命名为
aho_corasick.py
使用以下代码尝试该算法:
from aho_corasick import aho_corasick #(string, keywords)
print(aho_corasick(string, ["keyword1", "keyword2"]))
请注意搜索区分大小写
解决方案 8:
python文档中推荐的regex模块支持此功能
words = {'he', 'or', 'low'}
p = regex.compile(r"L<name>", name=words)
m = p.findall('helloworld')
print(m)
输出:
['he', 'low', 'or']
一些实施细节:链接
解决方案 9:
我需要在性能至关重要的环境中执行此操作,因此我使用 Python 3.11 对我能找到并想到的所有可能的变体进行了基准测试。结果如下:
words =['test', 'èk', 'user_me', '<markup>', '[^1]']
def find_words(words):
for word in words:
if "_" in word or "<" in word or ">" in word or "^" in word:
pass
def find_words_2(words):
for word in words:
for elem in [">", "<", "_", "^"]:
if elem in word:
pass
def find_words_3(words):
for word in words:
if re.search(r"_|<|>|^", word):
pass
def find_words_4(words):
for word in words:
if re.match(r"S*(_|<|>|^)S*", word):
pass
def find_words_5(words):
for word in words:
if any(elem in word for elem in [">", "<", "_", "^"]):
pass
def find_words_6(words):
for word in words:
if any(map(word.__contains__, [">", "<", "_", "^"])):
pass
> %timeit find_words(words)
351 ns ± 6.24 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
> %timeit find_words_2(words)
689 ns ± 15.4 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
> %timeit find_words_3(words)
2.42 µs ± 43.9 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
> %timeit find_words_4(words)
2.75 µs ± 146 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
> %timeit find_words_5(words)
2.65 µs ± 176 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
> %timeit find_words_6(words)
1.64 µs ± 28.6 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
简单的链式
or
方法获胜(函数 1)对要测试的每个元素进行的基本迭代(函数 2)至少比使用 快 50%
any()
,甚至正则表达式搜索也比any()
不使用 的基本搜索快map()
,所以我完全不明白它为什么存在。更不用说,语法纯粹是算法性的,所以任何程序员都会理解它的作用,即使没有 Python 背景。re.match()
仅搜索从行首开始的模式(如果您来自 PHP/Perl 正则表达式,这可能会令人困惑),因此为了使其像 PHP/Perl 一样工作,您需要使用re.search()
或调整正则表达式以包含前面的字符,这会带来性能损失。
如果在编程时知道要搜索的子字符串列表,那么丑陋的链式搜索or
绝对是最佳选择。否则,请使用基本for
循环遍历要搜索的子字符串列表。any()
在这种情况下,使用正则表达式会浪费时间。
对于更加实际的应用程序(通过在列表中查找文件的扩展名来搜索文件是否为图像):
def is_image(word: str ) -> bool:
if ".bmp" in word or \n ".jpg" in word or \n ".jpeg" in word or \n ".jpe" in word or \n ".jp2" in word or \n ".j2c" in word or \n ".j2k" in word or \n ".jpc" in word or \n ".jpf" in word or \n ".jpx" in word or \n ".png" in word or \n ".ico" in word or \n ".svg" in word or \n ".webp" in word or \n ".heif" in word or \n ".heic" in word or \n ".tif" in word or \n ".tiff" in word or \n ".hdr" in word or \n ".exr" in word or \n ".ppm" in word or \n ".pfm" in word or \n ".nef" in word or \n ".rw2" in word or \n ".cr2" in word or \n ".cr3" in word or \n ".crw" in word or \n ".dng" in word or \n ".raf" in word or \n ".arw" in word or \n ".srf" in word or \n ".sr2" in word or \n ".iiq" in word or \n ".3fr" in word or \n ".dcr" in word or \n ".ari" in word or \n ".pef" in word or \n ".x3f" in word or \n ".erf" in word or \n ".raw" in word or \n ".rwz" in word:
return True
return False
IMAGE_PATTERN = re.compile(r".(bmp|jpg|jpeg|jpe|jp2|j2c|j2k|jpc|jpf|jpx|png|ico|svg|webp|heif|heic|tif|tiff|hdr|exr|ppm|pfm|nef|rw2|cr2|cr3|crw|dng|raf|arw|srf|sr2|iiq|3fr|dcr|ari|pef|x3f|erf|raw|rwz)")
extensions = [".bmp", ".jpg", ".jpeg", ".jpe", ".jp2", ".j2c", ".j2k", ".jpc", ".jpf", ".jpx", ".png", ".ico", ".svg", ".webp", ".heif", ".heic", ".tif", ".tiff", ".hdr", ".exr", ".ppm", ".pfm", ".nef", ".rw2", ".cr2", ".cr3", ".crw", ".dng", ".raf", ".arw", ".srf", ".sr2", ".iiq", ".3fr", ".dcr", ".ari", ".pef", ".x3f", ".erf", ".raw", ".rwz"]
(请注意,所有变体中的扩展都以相同的顺序声明)。
> %timeit is_image("DSC_blablabla_001256.nef") # found
536 ns ± 18.3 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
> %timeit is_image("DSC_blablabla_001256.noop") # not found
923 ns ± 43.8 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
> %timeit IMAGE_PATTERN.search("DSC_blablabla_001256.nef")
221 ns ± 24.3 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
> %timeit IMAGE_PATTERN.search("DSC_blablabla_001256.noop") # not found
207 ns ± 4.3 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
> %timeit any(ext in "DSC_blablabla_001256.nef" for ext in extensions) # found
1.53 µs ± 30.1 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
> %timeit any(ext in "DSC_blablabla_001256.noop" for ext in extensions) # not found
2.2 µs ± 25.1 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
通过更多的选项进行测试,正则表达式实际上比链式表达式更快、更清晰(就这一次……)or
。any()
但它仍然是最差的。
经验测试表明,性能阈值为 9 个需要测试的元素:
少于 9 个元素时,链式
or
速度更快,超过 9 个元素,正则表达式
search()
更快,恰好有 9 个元素,运行时间都在 225 纳秒左右。
解决方案 10:
a = ['a', 'b', 'c']
str = "a123"
a_match = [True for match in a if match in str]
if True in a_match:
print "some of the strings found in str"
else:
print "no strings found in str"
解决方案 11:
在另一个字符串列表中查找多个字符串的简洁方法是使用 set.intersection。这比大型集合或列表中的列表推导要快得多。
>>> astring = ['abc','def','ghi','jkl','mno']
>>> bstring = ['def', 'jkl']
>>> a_set = set(astring) # convert list to set
>>> b_set = set(bstring)
>>> matches = a_set.intersection(b_set)
>>> matches
{'def', 'jkl'}
>>> list(matches) # if you want a list instead of a set
['def', 'jkl']
>>>
解决方案 12:
有关如何获取字符串中所有可用列表元素的更多信息
a = ['a', 'b', 'c']
str = "a123"
list(filter(lambda x: x in str, a))
解决方案 13:
如果您想要精确匹配单词,那么请考虑对目标字符串进行单词标记。我使用nltk推荐的 word_tokenize :
from nltk.tokenize import word_tokenize
这是已接受答案中的标记字符串:
a_string = "A string is more than its parts!"
tokens = word_tokenize(a_string)
tokens
Out[46]: ['A', 'string', 'is', 'more', 'than', 'its', 'parts', '!']
接受的答案修改如下:
matches_1 = ["more", "wholesome", "milk"]
[x in tokens for x in matches_1]
Out[42]: [True, False, False]
和被接受的答案一样,单词“more”仍然匹配。但是,如果“mo”成为匹配字符串,被接受的答案仍然会找到匹配项。这是我不想要的行为。
matches_2 = ["mo", "wholesome", "milk"]
[x in a_string for x in matches_1]
Out[43]: [True, False, False]
使用词语标记,“mo”不再匹配:
[x in tokens for x in matches_2]
Out[44]: [False, False, False]
这就是我想要的附加行为。此答案还回答了此处的重复问题。
解决方案 14:
这取决于上下文,假设如果你想检查单个文字(如任何单个单词 a、e、w 等)就足够了
original_word ="hackerearcth"
for 'h' in original_word:
print("YES")
如果你想检查 original_word 中的任何字符:使用
if any(your_required in yourinput for your_required in original_word ):
如果你想要在 original_word 中输入所有你想要的,请使用所有简单的
original_word = ['h', 'a', 'c', 'k', 'e', 'r', 'e', 'a', 'r', 't', 'h']
yourinput = str(input()).lower()
if all(requested_word in yourinput for requested_word in original_word):
print("yes")
解决方案 15:
flog = open('test.txt', 'r')
flogLines = flog.readlines()
strlist = ['SUCCESS', 'Done','SUCCESSFUL']
res = False
for line in flogLines:
for fstr in strlist:
if line.find(fstr) != -1:
print('found')
res = True
if res:
print('res true')
else:
print('res false')
解决方案 16:
我会使用这种功能来提高速度:
def check_string(string, substring_list):
for substring in substring_list:
if substring in string:
return True
return False
解决方案 17:
还有另一种解决方案,即使用 set.using set.intersection
。只需一行代码即可。
subset = {"some" ,"words"}
text = "some words to be searched here"
if len(subset & set(text.split())) == len(subset):
print("All values present in text")
if subset & set(text.split()):
print("Atleast one values present in text")
解决方案 18:
我从另一个已关闭问题的链接中找到了这个问题:
Python:如何从列表中检查字符串中的子字符串?但在以上答案中没有看到该问题的明确解决方案。
给定一个子字符串列表和一个字符串列表,返回包含任意子字符串的唯一字符串列表。
substrings = ['hello','world','python']
strings = ['blah blah.hello_everyone','this is a-crazy_world.here',
'one more string','ok, one more string with hello world python']
# one-liner
list(set([strings_of_interest for strings_of_interest in strings for substring in substrings if substring in strings_of_interest]))
解决方案 19:
data = "firstName and favoriteFood"
mandatory_fields = ['firstName', 'lastName', 'age']
# for each
for field in mandatory_fields:
if field not in data:
print("Error, missing req field {0}".format(field));
# still fine, multiple if statements
if ('firstName' not in data or
'lastName' not in data or
'age' not in data):
print("Error, missing a req field");
# not very readable, list comprehension
missing_fields = [x for x in mandatory_fields if x not in data]
if (len(missing_fields)>0):
print("Error, missing fields {0}".format(", ".join(missing_fields)));