如何将文本拆分成句子?

2024-12-09 08:30:00
admin
原创
137
摘要:问题描述:我有一个文本文件。我需要获取一个句子列表。如何实现这一点?有很多微妙之处,例如在缩写中使用点。我的旧正则表达式效果很差:re.compile('(. |^|!|?)([A-Z][^;↑.<>@^&/[]]*(.|!|?) )',re.M) 解决方案 1:自然语言工具包 ( nlt...

问题描述:

我有一个文本文件。我需要获取一个句子列表。

如何实现这一点?有很多微妙之处,例如在缩写中使用点。

我的旧正则表达式效果很差:

re.compile('(. |^|!|?)([A-Z][^;↑.<>@^&/[]]*(.|!|?) )',re.M)

解决方案 1:

自然语言工具包 ( nltk.org ) 有您需要的东西。 这个小组帖子表明它确实可以做到:

import nltk.data

tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
fp = open("test.txt")
data = fp.read()
print '
-----
'.join(tokenizer.tokenize(data))

(我还没试过!)

解决方案 2:

此函数可以在大约 0.1 秒内将《哈克贝利·费恩历险记》的整个文本拆分成句子,并处理许多使句子解析变得复杂的更棘手的边缘情况,例如“约翰·约翰逊先生出生于美国,但在以色列获得博士学位,之后加入耐克公司担任工程师。他还曾在 craigslist.org 担任业务分析师。

# -*- coding: utf-8 -*-
import re
alphabets= "([A-Za-z])"
prefixes = "(Mr|St|Mrs|Ms|Dr)[.]"
suffixes = "(Inc|Ltd|Jr|Sr|Co)"
starters = "(Mr|Mrs|Ms|Dr|Prof|Capt|Cpt|Lt|Hes|Shes|Its|Theys|Theirs|Ours|Wes|Buts|Howevers|Thats|Thiss|Wherever)"
acronyms = "([A-Z][.][A-Z][.](?:[A-Z][.])?)"
websites = "[.](com|net|org|io|gov|edu|me)"
digits = "([0-9])"
multiple_dots = r'.{2,}'

def split_into_sentences(text: str) -> list[str]:
    """
    Split the text into sentences.

    If the text contains substrings "<prd>" or "<stop>", they would lead 
    to incorrect splitting because they are used as markers for splitting.

    :param text: text to be split into sentences
    :type text: str

    :return: list of sentences
    :rtype: list[str]
    """
    text = " " + text + "  "
    text = text.replace("
"," ")
    text = re.sub(prefixes,"\\1<prd>",text)
    text = re.sub(websites,"<prd>\\1",text)
    text = re.sub(digits + "[.]" + digits,"\\1<prd>\\2",text)
    text = re.sub(multiple_dots, lambda match: "<prd>" * len(match.group(0)) + "<stop>", text)
    if "Ph.D" in text: text = text.replace("Ph.D.","Ph<prd>D<prd>")
    text = re.sub("s" + alphabets + "[.] "," \\1<prd> ",text)
    text = re.sub(acronyms+" "+starters,"\\1<stop> \\2",text)
    text = re.sub(alphabets + "[.]" + alphabets + "[.]" + alphabets + "[.]","\\1<prd>\\2<prd>\\3<prd>",text)
    text = re.sub(alphabets + "[.]" + alphabets + "[.]","\\1<prd>\\2<prd>",text)
    text = re.sub(" "+suffixes+"[.] "+starters," \\1<stop> \\2",text)
    text = re.sub(" "+suffixes+"[.]"," \\1<prd>",text)
    text = re.sub(" " + alphabets + "[.]"," \\1<prd>",text)
    if "”" in text: text = text.replace(".”","”.")
    if "\"" in text: text = text.replace(".\"","\".")
    if "!" in text: text = text.replace("!\"","\"!")
    if "?" in text: text = text.replace("?\"","\"?")
    text = text.replace(".",".<stop>")
    text = text.replace("?","?<stop>")
    text = text.replace("!","!<stop>")
    text = text.replace("<prd>",".")
    sentences = text.split("<stop>")
    sentences = [s.strip() for s in sentences]
    if sentences and not sentences[-1]: sentences = sentences[:-1]
    return sentences

比较nltk

>>> from nltk.tokenize import sent_tokenize

示例 1: split_into_sentences在这里更好(因为它明确涵盖了很多情况):

>>> text = 'Some sentence. Mr. Holmes...This is a new sentence!And This is another one.. Hi '

>>> split_into_sentences(text)
['Some sentence.',
 'Mr. Holmes...',
 'This is a new sentence!',
 'And This is another one..',
 'Hi']

>>> sent_tokenize(text)
['Some sentence.',
 'Mr.',
 'Holmes...This is a new sentence!And This is another one.. Hi']

示例 2: nltk.tokenize.sent_tokenize这里更好(因为它使用了 ML 模型):

>>> text = 'The U.S. Drug Enforcement Administration (DEA) says hello. And have a nice day.'

>>> split_into_sentences(text)
['The U.S.',
 'Drug Enforcement Administration (DEA) says hello.',
 'And have a nice day.']

>>> sent_tokenize(text)
['The U.S. Drug Enforcement Administration (DEA) says hello.',
 'And have a nice day.']

解决方案 3:

您还可以使用 nltk 库,而不是使用正则表达式将文本拆分成句子。

>>> from nltk import tokenize
>>> p = "Good morning Dr. Adams. The patient is waiting for you in room number 3."

>>> tokenize.sent_tokenize(p)
['Good morning Dr. Adams.', 'The patient is waiting for you in room number 3.']

参考: https: //stackoverflow.com/a/9474645/2877052

解决方案 4:

您可以尝试使用Spacy代替正则表达式。我用过它,效果很好。

import spacy
nlp = spacy.load('en')

text = '''Your text here'''
tokens = nlp(text)

for sent in tokens.sents:
    print(sent.string.strip())

解决方案 5:

我喜欢 spaCy,但最近我发现了两种新的句子标记方法。一种是微软的BlingFire(速度极快),另一种是AI2 的PySBD(极其准确)。

text = ...

from blingfire import text_to_sentences
sents = text_to_sentences(text).split('
')

from pysbd import Segmenter
segmenter = Segmenter(language='en', clean=False)
sents = segmenter.segment(text)

我使用五种不同的方法分离了 20k 个句子。以下是在 AMD Threadripper Linux 机器上耗费的时间:

  • spaCy 句子生成器:1.16934 秒

  • spaCy 解析:25.97063 秒

  • PySBD:9.03505秒

  • NLTK:0.30512 秒

  • BlingFire:0.07933秒

更新:我尝试在全小写文本上使用 BlingFire,但失败了。我暂时打算在我的项目中使用 PySBD。

解决方案 6:

这是一种不依赖任何外部库的折中方法。我使用列表推导来排除缩写和终止符之间的重叠,以及排除终止符变体之间的重叠,例如:“。”与“。”

abbreviations = {'dr.': 'doctor', 'mr.': 'mister', 'bro.': 'brother', 'bro': 'brother', 'mrs.': 'mistress', 'ms.': 'miss', 'jr.': 'junior', 'sr.': 'senior',
                 'i.e.': 'for example', 'e.g.': 'for example', 'vs.': 'versus'}
terminators = ['.', '!', '?']
wrappers = ['"', "'", ')', ']', '}']


def find_sentences(paragraph):
   end = True
   sentences = []
   while end > -1:
       end = find_sentence_end(paragraph)
       if end > -1:
           sentences.append(paragraph[end:].strip())
           paragraph = paragraph[:end]
   sentences.append(paragraph)
   sentences.reverse()
   return sentences


def find_sentence_end(paragraph):
    [possible_endings, contraction_locations] = [[], []]
    contractions = abbreviations.keys()
    sentence_terminators = terminators + [terminator + wrapper for wrapper in wrappers for terminator in terminators]
    for sentence_terminator in sentence_terminators:
        t_indices = list(find_all(paragraph, sentence_terminator))
        possible_endings.extend(([] if not len(t_indices) else [[i, len(sentence_terminator)] for i in t_indices]))
    for contraction in contractions:
        c_indices = list(find_all(paragraph, contraction))
        contraction_locations.extend(([] if not len(c_indices) else [i + len(contraction) for i in c_indices]))
    possible_endings = [pe for pe in possible_endings if pe[0] + pe[1] not in contraction_locations]
    if len(paragraph) in [pe[0] + pe[1] for pe in possible_endings]:
        max_end_start = max([pe[0] for pe in possible_endings])
        possible_endings = [pe for pe in possible_endings if pe[0] != max_end_start]
    possible_endings = [pe[0] + pe[1] for pe in possible_endings if sum(pe) > len(paragraph) or (sum(pe) < len(paragraph) and paragraph[sum(pe)] == ' ')]
    end = (-1 if not len(possible_endings) else max(possible_endings))
    return end


def find_all(a_str, sub):
    start = 0
    while True:
        start = a_str.find(sub, start)
        if start == -1:
            return
        yield start
        start += len(sub)

我使用了 Karl 的 find_all 函数,来自这个条目:
在 Python 中查找某个子字符串的所有出现位置

解决方案 7:

您还可以使用 NLTK 中的句子标记功能:

from nltk.tokenize import sent_tokenize
sentence = "As the most quoted English writer Shakespeare has more than his share of famous quotes.  Some Shakespare famous quotes are known for their beauty, some for their everyday truths and some for their wisdom. We often talk about Shakespeare’s quotes as things the wise Bard is saying to us but, we should remember that some of his wisest words are spoken by his biggest fools. For example, both ‘neither a borrower nor a lender be,’ and ‘to thine own self be true’ are from the foolish, garrulous and quite disreputable Polonius in Hamlet."

sent_tokenize(sentence)

解决方案 8:

对于简单的情况(句子正常终止),这应该有效:

import re
text = ''.join(open('somefile.txt').readlines())
sentences = re.split(r' *[.?!][\'")]]* *', text)

正则表达式为*. +,它匹配左侧由 0 个或多个空格包围、右侧由 1 个或多个空格包围的句点(以防止将 re.split 中的句点之类的内容算作句子的变化)。

显然,这不是最可靠的解决方案,但在大多数情况下它都能很好地完成任务。唯一无法解决的情况是缩写(也许可以遍历句子列表并检查每个字符串是否sentences以大写字母开头?)

解决方案 9:

使用spacy:

import spacy

nlp = spacy.load('en_core_web_sm')
text = "How are you today? I hope you have a great day"
tokens = nlp(text)
for sent in tokens.sents:
    print(sent.string.strip())

解决方案 10:

不妨把它扔进去,因为这是第一篇显示按 n 个句子分割的句子的帖子。

它适用于可变的分割长度,表示最终连接在一起的句子。

import nltk
//nltk.download('punkt')
from more_itertools import windowed

split_length = 3 // 3 sentences for example 

elements = nltk.tokenize.sent_tokenize(text)
segments = windowed(elements, n=split_length, step=split_length)
text_splits = []
for seg in segments:
          txt = " ".join([t for t in seg if t])
          if len(txt) > 0:
                text_splits.append(txt)

解决方案 11:

使用Stanza,这是一个适用于多种人类语言的自然语言处理库。

import stanza

stanza.download('en')
nlp = stanza.Pipeline(lang='en', processors='tokenize')

doc = nlp(t_en)
for sentence in doc.sentences:
    print(sentence.text)

解决方案 12:

如果 NLTK 的 sent_tokenize 不存在(例如,长文本需要大量的 GPU RAM)并且正则表达式无法跨语言正常工作,那么句子分割器可能值得一试。

解决方案 13:

此外,请警惕上述某些答案中未包含的其他顶级域名。

例如.info、.biz、.ru、.online 会抛出一些句子解析器,但不包括在上面。

以下是有关顶级域名频率的一些信息:https://www.westhost.com/blog/the-most-popular-top-level-domains-in-2017/

可以通过编辑上面的代码来解决这个问题:

alphabets= "([A-Za-z])"
prefixes = "(Mr|St|Mrs|Ms|Dr)[.]"
suffixes = "(Inc|Ltd|Jr|Sr|Co)"
starters = "(Mr|Mrs|Ms|Dr|Hes|Shes|Its|Theys|Theirs|Ours|Wes|Buts|Howevers|Thats|Thiss|Wherever)"
acronyms = "([A-Z][.][A-Z][.](?:[A-Z][.])?)"
websites = "[.](com|net|org|io|gov|ai|edu|co.uk|ru|info|biz|online)"

解决方案 14:

使用Spacy v3.5:

import spacy

nlp_sentencizer = spacy.blank("en")
nlp_sentencizer.add_pipe("sentencizer")

text = "How are you today? I hope you have a great day"
tokens = nlp_sentencizer(text)
[str(sent) for sent in tokens.sents]

解决方案 15:

毫无疑问,NLTK 最适合此目的。但开始使用 NLTK 相当痛苦(但一旦安装它 - 你就会获得回报)

这里有一个简单的重新基于代码,可在http://pythonicprose.blogspot.com/2009/09/python-split-paragraph-into-sentences.html上找到

# split up a paragraph into sentences
# using regular expressions


def splitParagraphIntoSentences(paragraph):
    ''' break a paragraph into sentences
        and return a list '''
    import re
    # to split by multile characters

    #   regular expressions are easiest (and fastest)
    sentenceEnders = re.compile('[.!?]')
    sentenceList = sentenceEnders.split(paragraph)
    return sentenceList


if __name__ == '__main__':
    p = """This is a sentence.  This is an excited sentence! And do you think this is a question?"""

    sentences = splitParagraphIntoSentences(p)
    for s in sentences:
        print s.strip()

#output:
#   This is a sentence
#   This is an excited sentence

#   And do you think this is a question 

解决方案 16:

我希望这对你学习拉丁文、中文、阿拉伯文有帮助

import re

punctuation = re.compile(r"([^d+])(.|!|?|;|
|。|!|?|;|…| |!|؟|؛)+")
lines = []

with open('myData.txt','r',encoding="utf-8") as myFile:
    lines = punctuation.sub(r"<pad>", myFile.read())
    lines = [line.strip() for line in lines.split("<pad>") if line.strip()]

解决方案 17:

正在执行类似的任务并遇到了这个查询,通过关注几个链接并进行一些 nltk 练习,下面的代码对我来说就像魔术一样。

from nltk.tokenize import sent_tokenize 
  
text = "Hello everyone. Welcome to GeeksforGeeks. You are studying NLP article"
sent_tokenize(text) 

输出:

['Hello everyone.',
 'Welcome to GeeksforGeeks.',
 'You are studying NLP article']

来源:https://www.geeksforgeeks.org/nlp-how-tokenizing-text-sentence-words-works/

解决方案 18:

我必须读取字幕文件并将其拆分成句子。经过预处理(例如删除 .srt 文件的时间信息等)后,变量 fullFile 包含字幕文件的全文。下面的粗略方法将它们整齐地拆分成句子。可能我很幸运,句子总是(正确地)以空格结尾。首先尝试一下,如果有任何例外,请添加更多检查和平衡。

# Very approximate way to split the text into sentences - Break after ? . and !
fullFile = re.sub("(!|?|.) ","\\1<BRK>",fullFile)
sentences = fullFile.split("<BRK>");
sentFile = open("./sentences.out", "w+");
for line in sentences:
    sentFile.write (line);
    sentFile.write ("
");
sentFile.close;

哦!好吧。我现在意识到,由于我的内容是西班牙语,所以我没有遇到处理“史密斯先生”等问题。不过,如果有人想要一个快速而粗糙的解析器......

解决方案 19:

您可以使用此函数为俄语(和一些其他语言)创建一个新的标记器:

def russianTokenizer(text):
    result = text
    result = result.replace('.', ' . ')
    result = result.replace(' .  .  . ', ' ... ')
    result = result.replace(',', ' , ')
    result = result.replace(':', ' : ')
    result = result.replace(';', ' ; ')
    result = result.replace('!', ' ! ')
    result = result.replace('?', ' ? ')
    result = result.replace('\"', ' \" ')
    result = result.replace('\'', ' \' ')
    result = result.replace('(', ' ( ')
    result = result.replace(')', ' ) ') 
    result = result.replace('  ', ' ')
    result = result.replace('  ', ' ')
    result = result.replace('  ', ' ')
    result = result.replace('  ', ' ')
    result = result.strip()
    result = result.split(' ')
    return result

然后按如下方式调用它:

text = 'вы выполняете поиск, используя Google SSL;'
tokens = russianTokenizer(text)

解决方案 20:

使用spacy

import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(u'This is first.This is second.This is Thired ')
for sentence in doc.sents:
  print(sentence)

但是如果你想通过索引获取一个句子例如:

#don't work
 doc.sents[0]

使用

list( doc.sents)[0]

解决方案 21:

(?<!w.w.)(?<![A-Z].)(?<=.|?)s(?=[A-Z])

我们应该使用这个正则表达式来避免某些快捷方式被视为短语结尾的情况。

相关推荐
  政府信创国产化的10大政策解读一、信创国产化的背景与意义信创国产化,即信息技术应用创新国产化,是当前中国信息技术领域的一个重要发展方向。其核心在于通过自主研发和创新,实现信息技术应用的自主可控,减少对外部技术的依赖,并规避潜在的技术制裁和风险。随着全球信息技术竞争的加剧,以及某些国家对中国在科技领域的打压,信创国产化显...
工程项目管理   1565  
  为什么项目管理通常仍然耗时且低效?您是否还在反复更新电子表格、淹没在便利贴中并参加每周更新会议?这确实是耗费时间和精力。借助软件工具的帮助,您可以一目了然地全面了解您的项目。如今,国内外有足够多优秀的项目管理软件可以帮助您掌控每个项目。什么是项目管理软件?项目管理软件是广泛行业用于项目规划、资源分配和调度的软件。它使项...
项目管理软件   1354  
  信创国产芯片作为信息技术创新的核心领域,对于推动国家自主可控生态建设具有至关重要的意义。在全球科技竞争日益激烈的背景下,实现信息技术的自主可控,摆脱对国外技术的依赖,已成为保障国家信息安全和产业可持续发展的关键。国产芯片作为信创产业的基石,其发展水平直接影响着整个信创生态的构建与完善。通过不断提升国产芯片的技术实力、产...
国产信创系统   21  
  信创生态建设旨在实现信息技术领域的自主创新和安全可控,涵盖了从硬件到软件的全产业链。随着数字化转型的加速,信创生态建设的重要性日益凸显,它不仅关乎国家的信息安全,更是推动产业升级和经济高质量发展的关键力量。然而,在推进信创生态建设的过程中,面临着诸多复杂且严峻的挑战,需要深入剖析并寻找切实可行的解决方案。技术创新难题技...
信创操作系统   27  
  信创产业作为国家信息技术创新发展的重要领域,对于保障国家信息安全、推动产业升级具有关键意义。而国产芯片作为信创产业的核心基石,其研发进展备受关注。在信创国产芯片的研发征程中,面临着诸多复杂且艰巨的难点,这些难点犹如一道道关卡,阻碍着国产芯片的快速发展。然而,科研人员和相关企业并未退缩,积极探索并提出了一系列切实可行的解...
国产化替代产品目录   28  
热门文章
项目管理软件有哪些?
云禅道AD
禅道项目管理软件

云端的项目管理软件

尊享禅道项目软件收费版功能

无需维护,随时随地协同办公

内置subversion和git源码管理

每天备份,随时转为私有部署

免费试用