如何将文本拆分成句子?
- 2024-12-09 08:30:00
- admin 原创
- 137
问题描述:
我有一个文本文件。我需要获取一个句子列表。
如何实现这一点?有很多微妙之处,例如在缩写中使用点。
我的旧正则表达式效果很差:
re.compile('(. |^|!|?)([A-Z][^;↑.<>@^&/[]]*(.|!|?) )',re.M)
解决方案 1:
自然语言工具包 ( nltk.org ) 有您需要的东西。 这个小组帖子表明它确实可以做到:
import nltk.data
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
fp = open("test.txt")
data = fp.read()
print '
-----
'.join(tokenizer.tokenize(data))
(我还没试过!)
解决方案 2:
此函数可以在大约 0.1 秒内将《哈克贝利·费恩历险记》的整个文本拆分成句子,并处理许多使句子解析变得复杂的更棘手的边缘情况,例如“约翰·约翰逊先生出生于美国,但在以色列获得博士学位,之后加入耐克公司担任工程师。他还曾在 craigslist.org 担任业务分析师。 ”
# -*- coding: utf-8 -*-
import re
alphabets= "([A-Za-z])"
prefixes = "(Mr|St|Mrs|Ms|Dr)[.]"
suffixes = "(Inc|Ltd|Jr|Sr|Co)"
starters = "(Mr|Mrs|Ms|Dr|Prof|Capt|Cpt|Lt|Hes|Shes|Its|Theys|Theirs|Ours|Wes|Buts|Howevers|Thats|Thiss|Wherever)"
acronyms = "([A-Z][.][A-Z][.](?:[A-Z][.])?)"
websites = "[.](com|net|org|io|gov|edu|me)"
digits = "([0-9])"
multiple_dots = r'.{2,}'
def split_into_sentences(text: str) -> list[str]:
"""
Split the text into sentences.
If the text contains substrings "<prd>" or "<stop>", they would lead
to incorrect splitting because they are used as markers for splitting.
:param text: text to be split into sentences
:type text: str
:return: list of sentences
:rtype: list[str]
"""
text = " " + text + " "
text = text.replace("
"," ")
text = re.sub(prefixes,"\\1<prd>",text)
text = re.sub(websites,"<prd>\\1",text)
text = re.sub(digits + "[.]" + digits,"\\1<prd>\\2",text)
text = re.sub(multiple_dots, lambda match: "<prd>" * len(match.group(0)) + "<stop>", text)
if "Ph.D" in text: text = text.replace("Ph.D.","Ph<prd>D<prd>")
text = re.sub("s" + alphabets + "[.] "," \\1<prd> ",text)
text = re.sub(acronyms+" "+starters,"\\1<stop> \\2",text)
text = re.sub(alphabets + "[.]" + alphabets + "[.]" + alphabets + "[.]","\\1<prd>\\2<prd>\\3<prd>",text)
text = re.sub(alphabets + "[.]" + alphabets + "[.]","\\1<prd>\\2<prd>",text)
text = re.sub(" "+suffixes+"[.] "+starters," \\1<stop> \\2",text)
text = re.sub(" "+suffixes+"[.]"," \\1<prd>",text)
text = re.sub(" " + alphabets + "[.]"," \\1<prd>",text)
if "”" in text: text = text.replace(".”","”.")
if "\"" in text: text = text.replace(".\"","\".")
if "!" in text: text = text.replace("!\"","\"!")
if "?" in text: text = text.replace("?\"","\"?")
text = text.replace(".",".<stop>")
text = text.replace("?","?<stop>")
text = text.replace("!","!<stop>")
text = text.replace("<prd>",".")
sentences = text.split("<stop>")
sentences = [s.strip() for s in sentences]
if sentences and not sentences[-1]: sentences = sentences[:-1]
return sentences
比较nltk
:
>>> from nltk.tokenize import sent_tokenize
示例 1: split_into_sentences
在这里更好(因为它明确涵盖了很多情况):
>>> text = 'Some sentence. Mr. Holmes...This is a new sentence!And This is another one.. Hi '
>>> split_into_sentences(text)
['Some sentence.',
'Mr. Holmes...',
'This is a new sentence!',
'And This is another one..',
'Hi']
>>> sent_tokenize(text)
['Some sentence.',
'Mr.',
'Holmes...This is a new sentence!And This is another one.. Hi']
示例 2: nltk.tokenize.sent_tokenize
这里更好(因为它使用了 ML 模型):
>>> text = 'The U.S. Drug Enforcement Administration (DEA) says hello. And have a nice day.'
>>> split_into_sentences(text)
['The U.S.',
'Drug Enforcement Administration (DEA) says hello.',
'And have a nice day.']
>>> sent_tokenize(text)
['The U.S. Drug Enforcement Administration (DEA) says hello.',
'And have a nice day.']
解决方案 3:
您还可以使用 nltk 库,而不是使用正则表达式将文本拆分成句子。
>>> from nltk import tokenize
>>> p = "Good morning Dr. Adams. The patient is waiting for you in room number 3."
>>> tokenize.sent_tokenize(p)
['Good morning Dr. Adams.', 'The patient is waiting for you in room number 3.']
参考: https: //stackoverflow.com/a/9474645/2877052
解决方案 4:
您可以尝试使用Spacy代替正则表达式。我用过它,效果很好。
import spacy
nlp = spacy.load('en')
text = '''Your text here'''
tokens = nlp(text)
for sent in tokens.sents:
print(sent.string.strip())
解决方案 5:
我喜欢 spaCy,但最近我发现了两种新的句子标记方法。一种是微软的BlingFire(速度极快),另一种是AI2 的PySBD(极其准确)。
text = ...
from blingfire import text_to_sentences
sents = text_to_sentences(text).split('
')
from pysbd import Segmenter
segmenter = Segmenter(language='en', clean=False)
sents = segmenter.segment(text)
我使用五种不同的方法分离了 20k 个句子。以下是在 AMD Threadripper Linux 机器上耗费的时间:
spaCy 句子生成器:1.16934 秒
spaCy 解析:25.97063 秒
PySBD:9.03505秒
NLTK:0.30512 秒
BlingFire:0.07933秒
更新:我尝试在全小写文本上使用 BlingFire,但失败了。我暂时打算在我的项目中使用 PySBD。
解决方案 6:
这是一种不依赖任何外部库的折中方法。我使用列表推导来排除缩写和终止符之间的重叠,以及排除终止符变体之间的重叠,例如:“。”与“。”
abbreviations = {'dr.': 'doctor', 'mr.': 'mister', 'bro.': 'brother', 'bro': 'brother', 'mrs.': 'mistress', 'ms.': 'miss', 'jr.': 'junior', 'sr.': 'senior',
'i.e.': 'for example', 'e.g.': 'for example', 'vs.': 'versus'}
terminators = ['.', '!', '?']
wrappers = ['"', "'", ')', ']', '}']
def find_sentences(paragraph):
end = True
sentences = []
while end > -1:
end = find_sentence_end(paragraph)
if end > -1:
sentences.append(paragraph[end:].strip())
paragraph = paragraph[:end]
sentences.append(paragraph)
sentences.reverse()
return sentences
def find_sentence_end(paragraph):
[possible_endings, contraction_locations] = [[], []]
contractions = abbreviations.keys()
sentence_terminators = terminators + [terminator + wrapper for wrapper in wrappers for terminator in terminators]
for sentence_terminator in sentence_terminators:
t_indices = list(find_all(paragraph, sentence_terminator))
possible_endings.extend(([] if not len(t_indices) else [[i, len(sentence_terminator)] for i in t_indices]))
for contraction in contractions:
c_indices = list(find_all(paragraph, contraction))
contraction_locations.extend(([] if not len(c_indices) else [i + len(contraction) for i in c_indices]))
possible_endings = [pe for pe in possible_endings if pe[0] + pe[1] not in contraction_locations]
if len(paragraph) in [pe[0] + pe[1] for pe in possible_endings]:
max_end_start = max([pe[0] for pe in possible_endings])
possible_endings = [pe for pe in possible_endings if pe[0] != max_end_start]
possible_endings = [pe[0] + pe[1] for pe in possible_endings if sum(pe) > len(paragraph) or (sum(pe) < len(paragraph) and paragraph[sum(pe)] == ' ')]
end = (-1 if not len(possible_endings) else max(possible_endings))
return end
def find_all(a_str, sub):
start = 0
while True:
start = a_str.find(sub, start)
if start == -1:
return
yield start
start += len(sub)
我使用了 Karl 的 find_all 函数,来自这个条目:
在 Python 中查找某个子字符串的所有出现位置
解决方案 7:
您还可以使用 NLTK 中的句子标记功能:
from nltk.tokenize import sent_tokenize
sentence = "As the most quoted English writer Shakespeare has more than his share of famous quotes. Some Shakespare famous quotes are known for their beauty, some for their everyday truths and some for their wisdom. We often talk about Shakespeare’s quotes as things the wise Bard is saying to us but, we should remember that some of his wisest words are spoken by his biggest fools. For example, both ‘neither a borrower nor a lender be,’ and ‘to thine own self be true’ are from the foolish, garrulous and quite disreputable Polonius in Hamlet."
sent_tokenize(sentence)
解决方案 8:
对于简单的情况(句子正常终止),这应该有效:
import re
text = ''.join(open('somefile.txt').readlines())
sentences = re.split(r' *[.?!][\'")]]* *', text)
正则表达式为*. +
,它匹配左侧由 0 个或多个空格包围、右侧由 1 个或多个空格包围的句点(以防止将 re.split 中的句点之类的内容算作句子的变化)。
显然,这不是最可靠的解决方案,但在大多数情况下它都能很好地完成任务。唯一无法解决的情况是缩写(也许可以遍历句子列表并检查每个字符串是否sentences
以大写字母开头?)
解决方案 9:
使用spacy:
import spacy
nlp = spacy.load('en_core_web_sm')
text = "How are you today? I hope you have a great day"
tokens = nlp(text)
for sent in tokens.sents:
print(sent.string.strip())
解决方案 10:
不妨把它扔进去,因为这是第一篇显示按 n 个句子分割的句子的帖子。
它适用于可变的分割长度,表示最终连接在一起的句子。
import nltk
//nltk.download('punkt')
from more_itertools import windowed
split_length = 3 // 3 sentences for example
elements = nltk.tokenize.sent_tokenize(text)
segments = windowed(elements, n=split_length, step=split_length)
text_splits = []
for seg in segments:
txt = " ".join([t for t in seg if t])
if len(txt) > 0:
text_splits.append(txt)
解决方案 11:
使用Stanza,这是一个适用于多种人类语言的自然语言处理库。
import stanza
stanza.download('en')
nlp = stanza.Pipeline(lang='en', processors='tokenize')
doc = nlp(t_en)
for sentence in doc.sentences:
print(sentence.text)
解决方案 12:
如果 NLTK 的 sent_tokenize 不存在(例如,长文本需要大量的 GPU RAM)并且正则表达式无法跨语言正常工作,那么句子分割器可能值得一试。
解决方案 13:
此外,请警惕上述某些答案中未包含的其他顶级域名。
例如.info、.biz、.ru、.online 会抛出一些句子解析器,但不包括在上面。
以下是有关顶级域名频率的一些信息:https://www.westhost.com/blog/the-most-popular-top-level-domains-in-2017/
可以通过编辑上面的代码来解决这个问题:
alphabets= "([A-Za-z])"
prefixes = "(Mr|St|Mrs|Ms|Dr)[.]"
suffixes = "(Inc|Ltd|Jr|Sr|Co)"
starters = "(Mr|Mrs|Ms|Dr|Hes|Shes|Its|Theys|Theirs|Ours|Wes|Buts|Howevers|Thats|Thiss|Wherever)"
acronyms = "([A-Z][.][A-Z][.](?:[A-Z][.])?)"
websites = "[.](com|net|org|io|gov|ai|edu|co.uk|ru|info|biz|online)"
解决方案 14:
使用Spacy v3.5:
import spacy
nlp_sentencizer = spacy.blank("en")
nlp_sentencizer.add_pipe("sentencizer")
text = "How are you today? I hope you have a great day"
tokens = nlp_sentencizer(text)
[str(sent) for sent in tokens.sents]
解决方案 15:
毫无疑问,NLTK 最适合此目的。但开始使用 NLTK 相当痛苦(但一旦安装它 - 你就会获得回报)
这里有一个简单的重新基于代码,可在http://pythonicprose.blogspot.com/2009/09/python-split-paragraph-into-sentences.html上找到
# split up a paragraph into sentences
# using regular expressions
def splitParagraphIntoSentences(paragraph):
''' break a paragraph into sentences
and return a list '''
import re
# to split by multile characters
# regular expressions are easiest (and fastest)
sentenceEnders = re.compile('[.!?]')
sentenceList = sentenceEnders.split(paragraph)
return sentenceList
if __name__ == '__main__':
p = """This is a sentence. This is an excited sentence! And do you think this is a question?"""
sentences = splitParagraphIntoSentences(p)
for s in sentences:
print s.strip()
#output:
# This is a sentence
# This is an excited sentence
# And do you think this is a question
解决方案 16:
我希望这对你学习拉丁文、中文、阿拉伯文有帮助
import re
punctuation = re.compile(r"([^d+])(.|!|?|;|
|。|!|?|;|…| |!|؟|؛)+")
lines = []
with open('myData.txt','r',encoding="utf-8") as myFile:
lines = punctuation.sub(r"<pad>", myFile.read())
lines = [line.strip() for line in lines.split("<pad>") if line.strip()]
解决方案 17:
正在执行类似的任务并遇到了这个查询,通过关注几个链接并进行一些 nltk 练习,下面的代码对我来说就像魔术一样。
from nltk.tokenize import sent_tokenize
text = "Hello everyone. Welcome to GeeksforGeeks. You are studying NLP article"
sent_tokenize(text)
输出:
['Hello everyone.',
'Welcome to GeeksforGeeks.',
'You are studying NLP article']
来源:https://www.geeksforgeeks.org/nlp-how-tokenizing-text-sentence-words-works/
解决方案 18:
我必须读取字幕文件并将其拆分成句子。经过预处理(例如删除 .srt 文件的时间信息等)后,变量 fullFile 包含字幕文件的全文。下面的粗略方法将它们整齐地拆分成句子。可能我很幸运,句子总是(正确地)以空格结尾。首先尝试一下,如果有任何例外,请添加更多检查和平衡。
# Very approximate way to split the text into sentences - Break after ? . and !
fullFile = re.sub("(!|?|.) ","\\1<BRK>",fullFile)
sentences = fullFile.split("<BRK>");
sentFile = open("./sentences.out", "w+");
for line in sentences:
sentFile.write (line);
sentFile.write ("
");
sentFile.close;
哦!好吧。我现在意识到,由于我的内容是西班牙语,所以我没有遇到处理“史密斯先生”等问题。不过,如果有人想要一个快速而粗糙的解析器......
解决方案 19:
您可以使用此函数为俄语(和一些其他语言)创建一个新的标记器:
def russianTokenizer(text):
result = text
result = result.replace('.', ' . ')
result = result.replace(' . . . ', ' ... ')
result = result.replace(',', ' , ')
result = result.replace(':', ' : ')
result = result.replace(';', ' ; ')
result = result.replace('!', ' ! ')
result = result.replace('?', ' ? ')
result = result.replace('\"', ' \" ')
result = result.replace('\'', ' \' ')
result = result.replace('(', ' ( ')
result = result.replace(')', ' ) ')
result = result.replace(' ', ' ')
result = result.replace(' ', ' ')
result = result.replace(' ', ' ')
result = result.replace(' ', ' ')
result = result.strip()
result = result.split(' ')
return result
然后按如下方式调用它:
text = 'вы выполняете поиск, используя Google SSL;'
tokens = russianTokenizer(text)
解决方案 20:
使用spacy
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(u'This is first.This is second.This is Thired ')
for sentence in doc.sents:
print(sentence)
但是如果你想通过索引获取一个句子例如:
#don't work
doc.sents[0]
使用
list( doc.sents)[0]
解决方案 21:
(?<!w.w.)(?<![A-Z].)(?<=.|?)s(?=[A-Z])
我们应该使用这个正则表达式来避免某些快捷方式被视为短语结尾的情况。