如何使用 BeautifulSoup 仅抓取可见的网页文本？-IT科技

摘要：问题描述：基本上，我想用它BeautifulSoup来严格抓取网页上的可见文本。例如，这个网页是我的测试用例。我主要想获取正文（文章），甚至可能还有一些标签名称。我尝试了这个SO 问题中的建议，它返回了大量<script>我不想要的标签和 html 注释。我无法弄清楚我需要函数的参数findAll...

问题描述：

基本上，我想用它BeautifulSoup来严格抓取网页上的可见文本。例如，这个网页是我的测试用例。我主要想获取正文（文章），甚至可能还有一些标签名称。我尝试了这个SO 问题中的建议，它返回了大量<script>我不想要的标签和 html 注释。我无法弄清楚我需要函数的参数findAll()，以便只获取网页上的可见文本。

那么，我应该如何找到除脚本、注释、CSS 等之外的所有可见文本？

解决方案 1：

尝试一下：

from bs4 import BeautifulSoup
from bs4.element import Comment
import urllib.request


def tag_visible(element):
    if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
        return False
    if isinstance(element, Comment):
        return False
    return True


def text_from_html(body):
    soup = BeautifulSoup(body, 'html.parser')
    texts = soup.findAll(text=True)
    visible_texts = filter(tag_visible, texts)  
    return u" ".join(t.strip() for t in visible_texts)

html = urllib.request.urlopen('http://www.nytimes.com/2009/12/21/us/21storm.html').read()
print(text_from_html(html))

解决方案 2：

@jbochi 批准的答案对我不起作用。 str() 函数调用引发异常，因为它无法对 BeautifulSoup 元素中的非 ASCII 字符进行编码。这是一种更简洁的方法，可将示例网页过滤为可见文本。

html = open('21storm.html').read()
soup = BeautifulSoup(html)
[s.extract() for s in soup(['style', 'script', '[document]', 'head', 'title'])]
visible_text = soup.getText()

解决方案 3：

import urllib
from bs4 import BeautifulSoup

url = "https://www.yahoo.com"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)

# kill all script and style elements
for script in soup(["script", "style"]):
    script.extract()    # rip it out

# get text
text = soup.get_text()

# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
# drop blank lines
text = '
'.join(chunk for chunk in chunks if chunk)

print(text.encode('utf-8'))

解决方案 4：

我完全尊重使用 Beautiful Soup 来获取渲染内容，但它可能不是获取页面上渲染内容的理想包。

我在获取渲染内容或典型浏览器中的可见内容时遇到了类似的问题。特别是，我遇到了许多可能不典型的情况来处理下面这样一个简单的示例。在这种情况下，不可显示的标签嵌套在样式标签中，并且在我检查过的许多浏览器中都不可见。还存在其他变体，例如定义一个将显示设置为 none 的类标签。然后将此类用于 div。

<html>
  <title>  Title here</title>

  <body>

    lots of text here <p> <br>
    <h1> even headings </h1>

    <style type="text/css"> 
        <div > this will not be visible </div> 
    </style>


  </body>

</html>

上面发布的一个解决方案是：

html = Utilities.ReadFile('simple.html')
soup = BeautifulSoup.BeautifulSoup(html)
texts = soup.findAll(text=True)
visible_texts = filter(visible, texts)
print(visible_texts)


[u'
', u'
', u'

        lots of text here ', u' ', u'
', u' even headings ', u'
', u' this will not be visible ', u'
', u'
']

这种解决方案在很多情况下都有应用，而且通常能很好地完成工作，但在上面发布的 html 中，它保留了未呈现的文本。经过搜索，出现了几个解决方案，这里是BeautifulSoup get_text 不会删除所有标签和 JavaScript ，这里是使用 Python 将 HTML 呈现为纯文本

我尝试了这两种解决方案：html2text 和 nltk.clean_html，并对时间结果感到惊讶，因此认为它们值得为后人提供答案。当然，速度在很大程度上取决于数据的内容……

@Helge 这里的一个答案是关于使用 nltk 的所有事物。

import nltk

%timeit nltk.clean_html(html)
was returning 153 us per loop

它确实能很好地返回带有渲染的 html 的字符串。这个 nltk 模块甚至比 html2text 还快，尽管 html2text 可能更强大。

betterHTML = html.decode(errors='ignore')
%timeit html2text.html2text(betterHTML)
%3.09 ms per loop

解决方案 5：

使用 BeautifulSoup 是最简单的方法，用更少的代码来获取字符串，没有空行和废话。

tag = <Parent_Tag_that_contains_the_data>
soup = BeautifulSoup(tag, 'html.parser')

for i in soup.stripped_strings:
    print repr(i)

解决方案 6：

如果您关心性能，这里还有另一种更有效的方法：

import re

INVISIBLE_ELEMS = ('style', 'script', 'head', 'title')
RE_SPACES = re.compile(r's{3,}')

def visible_texts(soup):
    """ get visible text from a document """
    text = ' '.join([
        s for s in soup.strings
        if s.parent.name not in INVISIBLE_ELEMS
    ])
    # collapse multiple spaces to two spaces.
    return RE_SPACES.sub('  ', text)

soup.strings是一个迭代器，它返回，NavigableString以便您可以直接检查父级的标签名称，而无需经过多个循环。

解决方案 7：

虽然我完全建议一般使用 beautiful-soup，但如果有人出于某种原因想要显示格式错误的 html 的可见部分（例如，只有网页的一段或一行），则以下命令将删除<和>标签之间的内容：

import re   ## only use with malformed html - this is not efficient
def display_visible_html_using_re(text):             
    return(re.sub("(<.*?>)", "",text))

解决方案 8：

标题位于<nyt_headline>标签内，该标签嵌套在<h1>标签和<div>id 为“article”的标签内。

soup.findAll('nyt_headline', limit=1)

应该可以。

文章正文位于标签内<nyt_text>，而该标签又嵌套在<div>ID 为“articleBody”的标签内。在元素内<nyt_text> ，文本本身包含在<p> 标签内。图像不在这些<p>标签内。我很难尝试语法，但我期望工作抓取看起来像这样。

text = soup.findAll('nyt_text', limit=1)[0]
text.findAll('p')

解决方案 9：

from bs4 import BeautifulSoup
from bs4.element import Comment
import urllib.request
import re
import ssl

def tag_visible(element):
    if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
        return False
    if isinstance(element, Comment):
        return False
    if re.match(r"[
]+",str(element)): return False
    return True
def text_from_html(url):
    body = urllib.request.urlopen(url,context=ssl._create_unverified_context()).read()
    soup = BeautifulSoup(body ,"lxml")
    texts = soup.findAll(text=True)
    visible_texts = filter(tag_visible, texts)  
    text = u",".join(t.strip() for t in visible_texts)
    text = text.lstrip().rstrip()
    text = text.split(',')
    clean_text = ''
    for sen in text:
        if sen:
            sen = sen.rstrip().lstrip()
            clean_text += sen+','
    return clean_text
url = 'http://www.nytimes.com/2009/12/21/us/21storm.html'
print(text_from_html(url))

解决方案 10：

更新

来自文档：从Beautiful Soup4.9.0 版本开始，当使用lxml或时，、和标签的内容通常不被视为“文本”，因为这些标签不是页面中人类可见内容的一部分。html.parser`<script><style><template>`

要获取所有人类可读的 HTML 文本，<body>您可以使用.get_text()，摆脱多余的空格等，设置 strip 参数并用单个空格连接/分隔所有内容：

import bs4, requests

response = requests.get('https://www.nytimes.com/interactive/2022/09/13/us/politics/congress-stock-trading-investigation.html',headers={'User-Agent': 'Mozilla/5.0','cache-control': 'max-age=0'}, cookies={'cookies':''})
soup = bs4.BeautifulSoup(response.text)

soup.article.get_text(' ', strip=True)

在较新的代码中，避免使用旧语法，findAll()而是使用find_all()或select()使用css selectors- 有关详细信息，请花一点时间查看文档

解决方案 11：

处理这种情况的最简单方法是使用getattr()。您可以根据需要调整此示例：

from bs4 import BeautifulSoup

source_html = """
<span class="ratingsDisplay">
    <a class="ratingNumber" href="https://www.youtube.com/watch?v=oHg5SJYRHA0" target="_blank" rel="noopener">
        <span class="ratingsContent">3.7</span>
    </a>
</span>
"""

soup = BeautifulSoup(source_html, "lxml")
my_ratings = getattr(soup.find('span', {"class": "ratingsContent"}), "text", None)
print(my_ratings)

当文本元素存在时，它将"3.7"在标签对象内找到它，但是当文本元素不存在时<span class="ratingsContent">3.7</span>，它将默认为。NoneType

getattr(object, name[, default])
返回对象的命名属性的值。name 必须是字符串。如果字符串是对象属性之一的名称，则结果为该属性的值。例如，getattr(x, 'foobar') 相当于 x.foobar。如果命名属性不存在，则返回 default（如果提供），否则引发 AttributeError。