如何进行不区分大小写的字符串比较？-IT科技

摘要：问题描述：如何在 Python 中以不区分大小写的方式比较字符串？我想使用简单且 Pythonic 的代码来封装常规字符串与存储库字符串的比较。我还希望能够使用常规 Python 字符串在按字符串散列的字典中查找值。解决方案 1：假设 ASCII 字符串：string1 = 'Hello' string2 =...

问题描述：

如何在 Python 中以不区分大小写的方式比较字符串？

我想使用简单且 Pythonic 的代码来封装常规字符串与存储库字符串的比较。我还希望能够使用常规 Python 字符串在按字符串散列的字典中查找值。

解决方案 1：

假设 ASCII 字符串：

string1 = 'Hello'
string2 = 'hello'

if string1.lower() == string2.lower():
    print("The strings are the same (case insensitive)")
else:
    print("The strings are NOT the same (case insensitive)")

从 Python 3.3 开始，casefold()是一个更好的选择：

string1 = 'Hello'
string2 = 'hello'

if string1.casefold() == string2.casefold():
    print("The strings are the same (case insensitive)")
else:
    print("The strings are NOT the same (case insensitive)")

如果您想要一个更全面的解决方案来处理更复杂的unicode比较，请参阅其他答案。

解决方案 2：

以不区分大小写的方式比较字符串似乎很简单，但事实并非如此。我将使用 Python 3，因为 Python 2 在这里还不够完善。

首先要注意的是，Unicode 中的大小写删除转换并不简单。有些文本是这样的text.lower() != text.upper().lower()，例如"ß"：

>>> "ß".lower()
'ß'
>>> "ß".upper().lower()
'ss'

但是假设你想不分大小写地比较"BUSSE"和"Buße"。哎呀，你可能还想比较"BUSSE"和"BUẞE"相等——这是较新的大写形式。推荐的方式是使用casefold：

str.casefold （）
返回字符串的大小写折叠副本。大小写折叠字符串可用于不区分大小写的匹配。
大小写折叠类似于小写，但更激进，因为它旨在消除字符串中的所有大小写区别。[...]

不要只使用lower。如果casefold不可用，这样做会.upper().lower()有所帮助（但只是有点帮助）。

那么你应该考虑重音。如果你的字体渲染器很好，你可能会想"ê" == "ê"——但事实并非如此：

>>> "ê" == "ê"
False

这是因为后者的重音是一个组合字符。

>>> import unicodedata
>>> [unicodedata.name(char) for char in "ê"]
['LATIN SMALL LETTER E WITH CIRCUMFLEX']
>>> [unicodedata.name(char) for char in "ê"]
['LATIN SMALL LETTER E', 'COMBINING CIRCUMFLEX ACCENT']

处理这个问题最简单的方法是unicodedata.normalize。你可能想使用NFKD规范化，但你可以随时查看文档。然后

>>> unicodedata.normalize("NFKD", "ê") == unicodedata.normalize("NFKD", "ê")
True

最后，这里用函数来表达：

import unicodedata

def normalize_caseless(text):
    return unicodedata.normalize("NFKD", text.casefold())

def caseless_equal(left, right):
    return normalize_caseless(left) == normalize_caseless(right)

解决方案 3：

使用 Python 2，调用.lower()每个字符串或 Unicode 对象...

string1.lower() == string2.lower()

...大多数时候都会起作用，但在@tchrist 描述的情况下确实不起作用。

假设我们有一个名为的文件，unicode.txt其中包含两个字符串Σίσυφος和ΣΊΣΥΦΟΣ。使用 Python 2：

>>> utf8_bytes = open("unicode.txt", 'r').read()
>>> print repr(utf8_bytes)
'xcexa3xcexafxcfx83xcfx85xcfx86xcexbfxcfx82
xcexa3xcex8axcexa3xcexa5xcexa6xcex9fxcexa3
'
>>> u = utf8_bytes.decode('utf8')
>>> print u
Σίσυφος
ΣΊΣΥΦΟΣ

>>> first, second = u.splitlines()
>>> print first.lower()
σίσυφος
>>> print second.lower()
σίσυφοσ
>>> first.lower() == second.lower()
False
>>> first.upper() == second.upper()
True

Σ 字符有两种小写形式，ς 和 σ，并且.lower()无法帮助不区分大小写地比较它们。

但是，从 Python 3 开始，所有三种形式都将解析为 ς，并且对两个字符串调用 lower() 将正常工作：

>>> s = open('unicode.txt', encoding='utf8').read()
>>> print(s)
Σίσυφος
ΣΊΣΥΦΟΣ

>>> first, second = s.splitlines()
>>> print(first.lower())
σίσυφος
>>> print(second.lower())
σίσυφος
>>> first.lower() == second.lower()
True
>>> first.upper() == second.upper()
True

因此，如果您关心希腊语中的三个西格玛之类的边缘情况，请使用 Python 3。

（作为参考，上面的解释器打印输出中显示了 Python 2.7.3 和 Python 3.3.0b1。）

解决方案 4：

Unicode 标准第 3.13 节定义了不区分大小写的匹配算法。

X.casefold() == Y.casefold()在 Python 3 中实现了“默认不区分大小写的匹配”（D144）。

大小写折叠并不保留所有情况下字符串的规范化，因此需要进行规范化（'å'vs. 'å'）。D145 引入了“规范的无大小写匹配”：

import unicodedata

def NFD(text):
    return unicodedata.normalize('NFD', text)

def canonical_caseless(text):
    return NFD(NFD(text).casefold())

NFD()对于涉及 U+0345 字符的极少数边缘情况，会被调用两次。

例子：

>>> 'å'.casefold() == 'å'.casefold()
False
>>> canonical_caseless('å') == canonical_caseless('å')
True

还有针对'㎒'(U+3392) 等大小写的兼容性无大小写匹配 (D146) 和“标识符无大小写匹配”，以简化和优化标识符无大小写匹配。

解决方案 5：

您可以使用 casefold() 方法。casefold() 方法在比较时会忽略大小写。

firstString = "Hi EVERYONE"
secondString = "Hi everyone"

if firstString.casefold() == secondString.casefold():
    print('The strings are equal.')
else:
    print('The strings are not equal.')

输出：

The strings are equal.

解决方案 6：

我在这里看到了使用正则表达式的解决方案。

import re
if re.search('mandy', 'Mandy Pande', re.IGNORECASE):
# is True

它与口音很相配

In [42]: if re.search("ê","ê", re.IGNORECASE):
....:        print(1)
....:
1

但是，它不适用于不区分大小写的 Unicode 字符。感谢 @Rhymoid 指出这一点，因为我的理解是，它需要精确的符号，才能使大小写成立。输出如下：

In [36]: "ß".lower()
Out[36]: 'ß'
In [37]: "ß".upper()
Out[37]: 'SS'
In [38]: "ß".upper().lower()
Out[38]: 'ss'
In [39]: if re.search("ß","ßß", re.IGNORECASE):
....:        print(1)
....:
1
In [40]: if re.search("SS","ßß", re.IGNORECASE):
....:        print(1)
....:
In [41]: if re.search("ß","SS", re.IGNORECASE):
....:        print(1)
....:

解决方案 7：

通常的做法是将字符串大写或小写以进行查找和比较。例如：

>>> "hello".upper() == "HELLO".upper()
True
>>>

解决方案 8：

先转换成小写怎么样？你可以使用string.lower()。

解决方案 9：

我找到了一个干净的解决方案，其中我正在处理一些常量文件扩展名。

from pathlib import Path


class CaseInsitiveString(str):
   def __eq__(self, __o: str) -> bool:
      return self.casefold() == __o.casefold()

GZ = CaseInsitiveString(".gz")
ZIP = CaseInsitiveString(".zip")
TAR = CaseInsitiveString(".tar")

path = Path("/tmp/ALL_CAPS.TAR.GZ")

GZ in path.suffixes, ZIP in path.suffixes, TAR in path.suffixes, TAR == ".tAr"

# (True, False, True, True)

解决方案 10：

您可以在 str.contains() 中提及case=False

data['Column_name'].str.contains('abcd', case=False)

解决方案 11：

def search_specificword(key, stng):
    key = key.lower()
    stng = stng.lower()
    flag_present = False
    if stng.startswith(key+" "):
        flag_present = True
    symb = [',','.']
    for i in symb:
        if stng.find(" "+key+i) != -1:
            flag_present = True
    if key == stng:
        flag_present = True
    if stng.endswith(" "+key):
        flag_present = True
    if stng.find(" "+key+" ") != -1:
        flag_present = True
    print(flag_present)
    return flag_present

输出：search_specificword("经济适用房", "欧洲经济适用房的核心") False

search_specificword("经济适用房", "欧洲经济适用房的核心") True

解决方案 12：

from re import search, IGNORECASE

def is_string_match(word1, word2):
    #  Case insensitively function that checks if two words are the same
    # word1: string
    # word2: string | list

    # if the word1 is in a list of words
    if isinstance(word2, list):
        for word in word2:
            if search(rf'{word1}', word, IGNORECASE):
                return True
        return False

    # if the word1 is same as word2
    if search(rf'{word1}', word2, IGNORECASE):
        return True
    return False

is_match_word = is_string_match("Hello", "hELLO") 
True

is_match_word = is_string_match("Hello", ["Bye", "hELLO", "@vagavela"])
True

is_match_word = is_string_match("Hello", "Bye")
False

解决方案 13：

考虑使用jaraco.text中的FoldedCase：

>>> from jaraco.text import FoldedCase
>>> FoldedCase('Hello World') in ['hello world']
True

如果您想要一个以文本为键、不考虑大小写的字典，请使用jaraco.collections中的FoldedCaseKeyedDict：

>>> from jaraco.collections import FoldedCaseKeyedDict
>>> d = FoldedCaseKeyedDict()
>>> d['heLlo'] = 'world'
>>> list(d.keys()) == ['heLlo']
True
>>> d['hello'] == 'world'
True
>>> 'hello' in d
True
>>> 'HELLO' in d
True

解决方案 14：

def insenStringCompare(s1, s2):
    """ Method that takes two strings and returns True or False, based
        on if they are equal, regardless of case."""
    try:
        return s1.lower() == s2.lower()
    except AttributeError:
        print "Please only pass strings into this method."
        print "You passed a %s and %s" % (s1.__class__, s2.__class__)

解决方案 15：

这是另一个正则表达式，在过去一周里，我学会了喜欢/讨厌它，所以通常将其导入为（在这种情况下是）反映我感受的东西！创建一个正常函数....请求输入，然后使用....something = re.compile（r'foo|spam'，yes.I）...... re.I（下面的yes.I）与IGNORECASE相同，但您在编写它时不会犯太多错误！

然后，您使用正则表达式搜索您的消息，但老实说，这本身就应该有几页，但重点是 foo 或 spam 被连接在一起并且忽略大小写。然后，如果找到其中一个，lost_n_found 将显示其中一个。如果两者都没有，则 lost_n_found 等于 None。如果不等于 none，则使用“return lost_n_found.lower()”以小写形式返回 user_input

这可让您更轻松地匹配任何区分大小写的内容。最后 (NCS) 代表“没人在乎……！”或不区分大小写……无论哪种

如果有人有任何问题请告诉我。

    import re as yes

    def bar_or_spam():

        message = raw_input("
Enter FoO for BaR or SpaM for EgGs (NCS): ") 

        message_in_coconut = yes.compile(r'foo*|spam*',  yes.I)

        lost_n_found = message_in_coconut.search(message).group()

        if lost_n_found != None:
            return lost_n_found.lower()
        else:
            print ("Make tea not love")
            return

    whatz_for_breakfast = bar_or_spam()

    if whatz_for_breakfast == foo:
        print ("BaR")

    elif whatz_for_breakfast == spam:
        print ("EgGs")