在 Python 中,如何拆分字符串并保留分隔符?
- 2024-11-29 08:42:00
- admin 原创
- 134
问题描述:
这是解释这一点最简单的方法。以下是我使用的:
re.split('W', 'foo/bar spam
eggs')
>>> ['foo', 'bar', 'spam', 'eggs']
这是我想要的:
someMethod('W', 'foo/bar spam
eggs')
>>> ['foo', '/', 'bar', ' ', 'spam', '
', 'eggs']
原因是我想将字符串拆分成标记,对其进行操作,然后再将其重新组合在一起。
解决方案 1:
提及的re.split
文档:
根据模式的出现次数分割字符串。如果在模式中使用捕获括号,则模式中所有组的文本也将作为结果列表的一部分返回。
因此,您只需用捕获组包裹分隔符即可:
>>> re.split('(W)', 'foo/bar spam
eggs')
['foo', '/', 'bar', ' ', 'spam', '
', 'eggs']
解决方案 2:
如果要按换行符进行拆分,请使用splitlines(True)
。
>>> 'line 1
line 2
line without newline'.splitlines(True)
['line 1
', 'line 2
', 'line without newline']
(这不是一个通用的解决方案,但在这里添加它以防有人来到这里却没有意识到这种方法的存在。)
解决方案 3:
另一个示例:按非字母数字进行拆分并保留分隔符
import re
a = "foo,bar@candy*ice%cream"
re.split('([^a-zA-Z0-9])', a)
输出:
['foo', ',', 'bar', '@', 'candy', '*', 'ice', '%', 'cream']
解释re.split('([^a-zA-Z0-9])', a)
:
() <- keep the separators
[] <- match everything in between
^a-zA-Z0-9 <- except alphabets, upper/lower and numbers.
解决方案 4:
如果只有 1 个分隔符,则可以使用列表推导:
text = 'foo,bar,baz,qux'
sep = ','
附加/前置分隔符:
result = [x+sep for x in text.split(sep)]
#['foo,', 'bar,', 'baz,', 'qux,']
# to get rid of trailing
result[-1] = result[-1].strip(sep)
#['foo,', 'bar,', 'baz,', 'qux']
result = [sep+x for x in text.split(sep)]
#[',foo', ',bar', ',baz', ',qux']
# to get rid of trailing
result[0] = result[0].strip(sep)
#['foo', ',bar', ',baz', ',qux']
分隔符作为它自己的元素:
result = [u for x in text.split(sep) for u in (x, sep)]
#['foo', ',', 'bar', ',', 'baz', ',', 'qux', ',']
results = result[:-1] # to get rid of trailing
解决方案 5:
另一个在 Python 3 上运行良好的无正则表达式解决方案
# Split strings and keep separator
test_strings = ['<Hello>', 'Hi', '<Hi> <Planet>', '<', '']
def split_and_keep(s, sep):
if not s: return [''] # consistent with string.split()
# Find replacement character that is not used in string
# i.e. just use the highest available character plus one
# Note: This fails if ord(max(s)) = 0x10FFFF (ValueError)
p=chr(ord(max(s))+1)
return s.replace(sep, sep+p).split(p)
for s in test_strings:
print(split_and_keep(s, '<'))
# If the unicode limit is reached it will fail explicitly
unicode_max_char = chr(1114111)
ridiculous_string = '<Hello>'+unicode_max_char+'<World>'
print(split_and_keep(ridiculous_string, '<'))
解决方案 6:
这是一个.split
无需正则表达式的简单解决方案。
这是对 Python split() 的回答,无需删除分隔符,因此与原始帖子所问的并不完全相同,但另一个问题已作为重复问题关闭。
def splitkeep(s, delimiter):
split = s.split(delimiter)
return [substr + delimiter for substr in split[:-1]] + [split[-1]]
随机测试:
import random
CHARS = [".", "a", "b", "c"]
assert splitkeep("", "X") == [""] # 0 length test
for delimiter in ('.', '..'):
for _ in range(100000):
length = random.randint(1, 50)
s = "".join(random.choice(CHARS) for _ in range(length))
assert "".join(splitkeep(s, delimiter)) == s
解决方案 7:
一个懒惰而简单的解决方案
假设你的正则表达式模式是split_pattern = r'(!|?)'
首先,添加一些与新分隔符相同的字符,例如“[cut]”
new_string = re.sub(split_pattern, '\1[cut]', your_string)
然后分割新的分隔符new_string.split('[cut]')
。
解决方案 8:
您还可以使用字符串数组而不是正则表达式来拆分字符串,如下所示:
def tokenizeString(aString, separators):
#separators is an array of strings that are being used to split the string.
#sort separators in order of descending length
separators.sort(key=len)
listToReturn = []
i = 0
while i < len(aString):
theSeparator = ""
for current in separators:
if current == aString[i:i+len(current)]:
theSeparator = current
if theSeparator != "":
listToReturn += [theSeparator]
i = i + len(theSeparator)
else:
if listToReturn == []:
listToReturn = [""]
if(listToReturn[-1] in separators):
listToReturn += [""]
listToReturn[-1] += aString[i]
i += 1
return listToReturn
print(tokenizeString(aString = "\"\"\"hi\"\"\" hello + world += (1*2+3/5) '''hi'''", separators = ["'''", '+=', '+', "/", "*", "\\'", '\\\"', "-=", "-", " ", '"""', "(", ")"]))
解决方案 9:
全部替换
seperator: (W)
为seperator + new_seperator: (W;)
分割
new_seperator: (;)
def split_and_keep(seperator, s):
return re.split(';', re.sub(seperator, lambda match: match.group() + ';', s))
print('W', 'foo/bar spam
eggs')
解决方案 10:
# This keeps all separators in result
##########################################################################
import re
st="%%(c+dd+e+f-1523)%%7"
sh=re.compile('[+-//*<>%()]')
def splitStringFull(sh, st):
ls=sh.split(st)
lo=[]
start=0
for l in ls:
if not l : continue
k=st.find(l)
llen=len(l)
if k> start:
tmp= st[start:k]
lo.append(tmp)
lo.append(l)
start = k + llen
else:
lo.append(l)
start =llen
return lo
#############################
li= splitStringFull(sh , st)
['%%(', 'c', '+', 'dd', '+', 'e', '+', 'f', '-', '1523', ')%%', '7']
解决方案 11:
如果想要拆分字符串,同时通过正则表达式保留分隔符而不捕获组:
def finditer_with_separators(regex, s):
matches = []
prev_end = 0
for match in regex.finditer(s):
match_start = match.start()
if (prev_end != 0 or match_start > 0) and match_start != prev_end:
matches.append(s[prev_end:match.start()])
matches.append(match.group())
prev_end = match.end()
if prev_end < len(s):
matches.append(s[prev_end:])
return matches
regex = re.compile(r"[()]")
matches = finditer_with_separators(regex, s)
如果假设正则表达式被包装到捕获组中:
def split_with_separators(regex, s):
matches = list(filter(None, regex.split(s)))
return matches
regex = re.compile(r"([()])")
matches = split_with_separators(regex, s)
这两种方法都将删除大多数情况下无用且令人讨厌的空组。
解决方案 12:
安装 wrs “无需移除 SPLITOR”
pip install wrs
(由 Rao Hamza 开发)
import wrs
text = "Now inbox “how to make spam ad” Invest in hard email marketing."
splitor = 'email | spam | inbox'
list = wrs.wr_split(splitor, text)
print(list)
结果:
['现在','收件箱“如何制作','垃圾邮件广告','努力投资','电子邮件营销。']
解决方案 13:
我在尝试拆分文件路径时遇到了类似的问题,并努力寻找一个简单的答案。 这对我来说很有效,并且不需要将分隔符替换回拆分文本中:
my_path = 'folder1/folder2/folder3/file1'
import re
re.findall('[^/]+/|[^/]+', my_path)
返回:
['folder1/', 'folder2/', 'folder3/', 'file1']
解决方案 14:
我可以把它留在这里吗
s = 'foo/bar spam
eggs'
print(s.replace('/', '+++/+++').replace(' ', '+++ +++').replace('
', '+++
+++').split('+++'))
['foo', '/', 'bar', ' ', 'spam', '
', 'eggs']
解决方案 15:
我们如何在 Python 中分割字符串,包括空格或连续空格?
def splitWithSpace(string):
list_strings = list(string)
split_list = []
new_word = ""
for characters in list_strings:
if character == " ":
split_list.extend([new_word, " "]) if new_word else split_list.append(" ")
new_word = ""
else:
new_word += character
split_list.append(new_word)
print(split_list)
单倍行距:
splitWithSpace("this is a simple text")
答案:['this', ' ', 'is', ' ', 'a', ' ', 'simple', ' ', 'text']
更多空间:
splitWithSpace("this is a simple text")
答案:['this', ' ', 'is', ' ', ' ', 'a', ' ', ' ', 'simple', ' ', 'text']
解决方案 16:
我发现这种基于生成器的方法更令人满意:
def split_keep(string, sep):
"""Usage:
>>> list(split_keep("a.b.c.d", "."))
['a.', 'b.', 'c.', 'd']
"""
start = 0
while True:
end = string.find(sep, start) + 1
if end == 0:
break
yield string[start:end]
start = end
yield string[start:]
它避免了找出正确正则表达式的需要,而理论上应该相当便宜。它不会创建新的字符串对象,并将大部分迭代工作委托给高效的 find 方法。
...在 Python 3.8 中它可以简短如下:
def split_keep(string, sep):
start = 0
while (end := string.find(sep, start) + 1) > 0:
yield string[start:end]
start = end
yield string[start:]
解决方案 17:
之前发布的一些答案会重复分隔符,或者有我遇到的其他错误。您可以改用此功能:
def split_and_keep_delimiter(input, delimiter):
result = list()
idx = 0
while delimiter in input:
idx = input.index(delimiter);
result.append(input[0:idx+len(delimiter)])
input = input[idx+len(delimiter):]
result.append(input)
return result
解决方案 18:
>>> line = 'hello_toto_is_there'
>>> sep = '_'
>>> [sep + x[1] if x[0] != 0 else x[1] for x in enumerate(line.split(sep))]
['hello', '_toto', '_is', '_there']
解决方案 19:
list
仅使用(借助)的实现str.partition()
:
import typing as t
def partition(s: str, seps: t.Iterable[str]):
if not s or not seps:
return [s]
st1, st2 = [s], []
for sep in set(seps):
if st1:
while st1:
st2.append(st1.pop())
while True:
x1, x2, x3 = st2.pop().rpartition(sep)
if not x2: # `sep` not found
st2.append(x3)
break
if not x1:
st2.extend([x3, x2] if x3 else [x2])
break
st2.extend([x3, x2, x1] if x3 else [x2, x1])
else:
while st2:
st1.append(st2.pop())
while True:
x1, x2, x3 = st1.pop().partition(sep)
if not x2: # `sep` not found
st1.append(x1)
break
if not x3:
st1.extend([x1, x2] if x1 else [x2])
break
st1.extend([x1, x2, x3] if x1 else [x2, x3])
return st1 or list(reversed(st2))
assert partition('abcdbcd', ['a']) == ['a', 'bcdbcd']
assert partition('abcdbcd', ['b']) == ['a', 'b', 'cd', 'b', 'cd']
assert partition('abcdbcd', ['d']) == ['abc', 'd', 'bc', 'd']
assert partition('abcdbcd', ['e']) == ['abcdbcd']
assert partition('abcdbcd', ['b', 'd']) == ['a', 'b', 'c', 'd', 'b', 'c', 'd']
assert partition('abcdbcd', ['db']) == ['abc', 'db', 'cd']
解决方案 20:
使用re.split。另外,您的正则表达式来自一个变量,并且您有一个多分隔符。您可以按如下方式使用:
# BashSpecialParamList is the special parameter in Bash,
# such as your separator is the Bash special parameter
BashSpecialParamList = ["$*", "$@", "$#", "$?", "$-", "$$", "$!", "$0"]
# aStr is the the string to be split
aStr = "$a Klkjfd$0 $? $#%$*Sdfdf"
reStr = "|".join([re.escape(sepStr) for sepStr in BashSpecialParamList])
re.split(f'({reStr})', aStr)
# Then you can get the result:
# ['$a Klkjfd', '$0', ' ', '$?', ' ', '$#', '%', '$*', 'Sdfdf']
参考:GNU Bash 特殊参数
解决方案 21:
下面的代码给出了一个简单、高效且经过充分测试的答案。代码中有注释,解释了其中的所有内容。
我保证它并不像看上去那么可怕——它实际上只有 13 行代码!其余的都是注释、文档和断言
def split_including_delimiters(input: str, delimiter: str):
"""
Splits an input string, while including the delimiters in the output
Unlike str.split, we can use an empty string as a delimiter
Unlike str.split, the output will not have any extra empty strings
Conequently, len(''.split(delimiter))== 0 for all delimiters,
whereas len(input.split(delimiter))>0 for all inputs and delimiters
INPUTS:
input: Can be any string
delimiter: Can be any string
EXAMPLES:
>>> split_and_keep_delimiter('Hello World ! ',' ')
ans = ['Hello ', 'World ', ' ', '! ', ' ']
>>> split_and_keep_delimiter("Hello**World**!***", "**")
ans = ['Hello', '**', 'World', '**', '!', '**', '*']
EXAMPLES:
assert split_and_keep_delimiter('-xx-xx-','xx') == ['-', 'xx', '-', 'xx', '-'] # length 5
assert split_and_keep_delimiter('xx-xx-' ,'xx') == ['xx', '-', 'xx', '-'] # length 4
assert split_and_keep_delimiter('-xx-xx' ,'xx') == ['-', 'xx', '-', 'xx'] # length 4
assert split_and_keep_delimiter('xx-xx' ,'xx') == ['xx', '-', 'xx'] # length 3
assert split_and_keep_delimiter('xxxx' ,'xx') == ['xx', 'xx'] # length 2
assert split_and_keep_delimiter('xxx' ,'xx') == ['xx', 'x'] # length 2
assert split_and_keep_delimiter('x' ,'xx') == ['x'] # length 1
assert split_and_keep_delimiter('' ,'xx') == [] # length 0
assert split_and_keep_delimiter('aaa' ,'xx') == ['aaa'] # length 1
assert split_and_keep_delimiter('aa' ,'xx') == ['aa'] # length 1
assert split_and_keep_delimiter('a' ,'xx') == ['a'] # length 1
assert split_and_keep_delimiter('' ,'' ) == [] # length 0
assert split_and_keep_delimiter('a' ,'' ) == ['a'] # length 1
assert split_and_keep_delimiter('aa' ,'' ) == ['a', '', 'a'] # length 3
assert split_and_keep_delimiter('aaa' ,'' ) == ['a', '', 'a', '', 'a'] # length 5
"""
# Input assertions
assert isinstance(input,str), "input must be a string"
assert isinstance(delimiter,str), "delimiter must be a string"
if delimiter:
# These tokens do not include the delimiter, but are computed quickly
tokens = input.split(delimiter)
else:
# Edge case: if the delimiter is the empty string, split between the characters
tokens = list(input)
# The following assertions are always true for any string input and delimiter
# For speed's sake, we disable this assertion
# assert delimiter.join(tokens) == input
output = tokens[:1]
for token in tokens[1:]:
output.append(delimiter)
if token:
output.append(token)
# Don't let the first element be an empty string
if output[:1]==['']:
del output[0]
# The only case where we should have an empty string in the output is if it is our delimiter
# For speed's sake, we disable this assertion
# assert delimiter=='' or '' not in output
# The resulting strings should be combinable back into the original string
# For speed's sake, we disable this assertion
# assert ''.join(output) == input
return output
- 2024年20款好用的项目管理软件推荐,项目管理提效的20个工具和技巧
- 2024年开源项目管理软件有哪些?推荐5款好用的项目管理工具
- 2024年常用的项目管理软件有哪些?推荐这10款国内外好用的项目管理工具
- 项目管理软件有哪些?推荐7款超好用的项目管理工具
- 项目管理软件有哪些最好用?推荐6款好用的项目管理工具
- 项目管理软件哪个最好用?盘点推荐5款好用的项目管理工具
- 项目管理软件有哪些,盘点推荐国内外超好用的7款项目管理工具
- 项目管理软件排行榜:2024年项目经理必备5款开源项目管理软件汇总
- 项目管理必备:盘点2024年13款好用的项目管理软件
- 2024项目管理软件排行榜(10类常用的项目管理工具全推荐)