Extracting text from HTML file using Python

2024-12-16 08:35:00
admin
原创
242
摘要:问题描述:I'd like to extract the text from an HTML file using Python. I want essentially the same output I would get if I copied the text from a browser and pa...

问题描述:

I'd like to extract the text from an HTML file using Python. I want essentially the same output I would get if I copied the text from a browser and pasted it into notepad.

I'd like something more robust than using regular expressions that may fail on poorly formed HTML. I've seen many people recommend Beautiful Soup, but I've had a few problems using it. For one, it picked up unwanted text, such as JavaScript source. Also, it did not interpret HTML entities. For example, I would expect ' in HTML source to be converted to an apostrophe in text, just as if I'd pasted the browser content into notepad.

Update html2text looks promising. It handles HTML entities correctly and ignores JavaScript. However, it does not exactly produce plain text; it produces markdown that would then have to be turned into plain text. It comes with no examples or documentation, but the code looks clean.


Related questions:

  • Filter out HTML tags and resolve entities in python

  • Convert XML/HTML Entities into Unicode String in Python


解决方案 1:

The best piece of code I found for extracting text without getting javascript or not wanted things :

from urllib.request import urlopen
from bs4 import BeautifulSoup

url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"
html = urlopen(url).read()
soup = BeautifulSoup(html, features="html.parser")

# kill all script and style elements
for script in soup(["script", "style"]):
    script.extract()    # rip it out

# get text
text = soup.get_text()

# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
# drop blank lines
text = '
'.join(chunk for chunk in chunks if chunk)

print(text)

You just have to install BeautifulSoup before :

pip install beautifulsoup4

解决方案 2:

html2text is a Python program that does a pretty good job at this.

解决方案 3:

NOTE: NTLK no longer supports clean_html function

Original answer below, and an alternative in the comments sections.


Use NLTK

I wasted my 4-5 hours fixing the issues with html2text. Luckily i could encounter NLTK.

It works magically.

import nltk   
from urllib import urlopen

url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"    
html = urlopen(url).read()    
raw = nltk.clean_html(html)  
print(raw)

解决方案 4:

Found myself facing just the same problem today. I wrote a very simple HTML parser to strip incoming content of all markups, returning the remaining text with only a minimum of formatting.

from HTMLParser import HTMLParser
from re import sub
from sys import stderr
from traceback import print_exc

class _DeHTMLParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.__text = []

    def handle_data(self, data):
        text = data.strip()
        if len(text) > 0:
            text = sub('[     
]+', ' ', text)
            self.__text.append(text + ' ')

    def handle_starttag(self, tag, attrs):
        if tag == 'p':
            self.__text.append('

')
        elif tag == 'br':
            self.__text.append('
')

    def handle_startendtag(self, tag, attrs):
        if tag == 'br':
            self.__text.append('

')

    def text(self):
        return ''.join(self.__text).strip()


def dehtml(text):
    try:
        parser = _DeHTMLParser()
        parser.feed(text)
        parser.close()
        return parser.text()
    except:
        print_exc(file=stderr)
        return text


def main():
    text = r'''
        <html>
            <body>
                <b>Project:</b> DeHTML<br>
                <b>Description</b>:<br>
                This small script is intended to allow conversion from HTML markup to 
                plain text.
            </body>
        </html>
    '''
    print(dehtml(text))


if __name__ == '__main__':
    main()

解决方案 5:

I know there are a lot of answers already, but the most elegent and pythonic solution I have found is described, in part, here.

from bs4 import BeautifulSoup

text = ' '.join(BeautifulSoup(some_html_string, "html.parser").findAll(text=True))

Update

Based on Fraser's comment, here is more elegant solution:

from bs4 import BeautifulSoup

clean_text = ' '.join(BeautifulSoup(some_html_string, "html.parser").stripped_strings)

解决方案 6:

Here is a version of xperroni's answer which is a bit more complete. It skips script and style sections and translates charrefs (e.g., ') and HTML entities (e.g., &).

It also includes a trivial plain-text-to-html inverse converter.

"""
HTML <-> text conversions.
"""
from HTMLParser import HTMLParser, HTMLParseError
from htmlentitydefs import name2codepoint
import re

class _HTMLToText(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self._buf = []
        self.hide_output = False

    def handle_starttag(self, tag, attrs):
        if tag in ('p', 'br') and not self.hide_output:
            self._buf.append('
')
        elif tag in ('script', 'style'):
            self.hide_output = True

    def handle_startendtag(self, tag, attrs):
        if tag == 'br':
            self._buf.append('
')

    def handle_endtag(self, tag):
        if tag == 'p':
            self._buf.append('
')
        elif tag in ('script', 'style'):
            self.hide_output = False

    def handle_data(self, text):
        if text and not self.hide_output:
            self._buf.append(re.sub(r's+', ' ', text))

    def handle_entityref(self, name):
        if name in name2codepoint and not self.hide_output:
            c = unichr(name2codepoint[name])
            self._buf.append(c)

    def handle_charref(self, name):
        if not self.hide_output:
            n = int(name[1:], 16) if name.startswith('x') else int(name)
            self._buf.append(unichr(n))

    def get_text(self):
        return re.sub(r' +', ' ', ''.join(self._buf))

def html_to_text(html):
    """
    Given a piece of HTML, return the plain text it contains.
    This handles entities and char refs, but not javascript and stylesheets.
    """
    parser = _HTMLToText()
    try:
        parser.feed(html)
        parser.close()
    except HTMLParseError:
        pass
    return parser.get_text()

def text_to_html(text):
    """
    Convert the given text to html, wrapping what looks like URLs with <a> tags,
    converting newlines to <br> tags and converting confusing chars into html
    entities.
    """
    def f(mo):
        t = mo.group()
        if len(t) == 1:
            return {'&':'&amp;', "'":'&#39;', '"':'"', '<':'&lt;', '>':'&gt;'}.get(t)
        return '<a href="%s">%s</a>' % (t, t)
    return re.sub(r'https?://[^] ()"\';]+|[&\'"<>]', f, text)

解决方案 7:

I know there's plenty of answers here already but I think newspaper3k also deserves a mention. I recently needed to complete a similar task of extracting the text from articles on the web and this library has done an excellent job of achieving this so far in my tests. It ignores the text found in menu items and side bars as well as any JavaScript that appears on the page as the OP requests.

from newspaper import Article

article = Article(url)
article.download()
article.parse()
article.text

If you already have the HTML files downloaded you can do something like this:

article = Article('')
article.set_html(html)
article.parse()
article.text

It even has a few NLP features for summarizing the topics of articles:

article.nlp()
article.summary

解决方案 8:

You can use html2text method in the stripogram library also.

from stripogram import html2text
text = html2text(your_html_string)

To install stripogram run sudo easy_install stripogram

解决方案 9:

There is Pattern library for data mining.

http://www.clips.ua.ac.be/pages/pattern-web

You can even decide what tags to keep:

s = URL('http://www.clips.ua.ac.be').download()
s = plaintext(s, keep={'h1':[], 'h2':[], 'strong':[], 'a':['href']})
print s

解决方案 10:

if you need more speed and less accuracy then you could use raw lxml.

import lxml.html as lh
from lxml.html.clean import clean_html

def lxml_to_text(html):
    doc = lh.fromstring(html)
    doc = clean_html(doc)
    return doc.text_content()

解决方案 11:

PyParsing does a great job. The PyParsing wiki was killed so here is another location where there are examples of the use of PyParsing (example link). One reason for investing a little time with pyparsing is that he has also written a very brief very well organized O'Reilly Short Cut manual that is also inexpensive.

Having said that, I use BeautifulSoup a lot and it is not that hard to deal with the entities issues, you can convert them before you run BeautifulSoup.

Goodluck

解决方案 12:

If you want to automatically extract text passages from a webpage there are some python packages available such as Trafilatura. As part of its benchmarking several python packages have been compared:

https://github.com/adbar/trafilatura#evaluation-and-alternatives

解决方案 13:

This isn't exactly a Python solution, but it will convert text Javascript would generate into text, which I think is important (E.G. google.com). The browser Links (not Lynx) has a Javascript engine, and will convert source to text with the -dump option.

So you could do something like:

fname = os.tmpnam()
fname.write(html_source)
proc = subprocess.Popen(['links', '-dump', fname], 
                        stdout=subprocess.PIPE,
                        stderr=open('/dev/null','w'))
text = proc.stdout.read()

解决方案 14:

Instead of the HTMLParser module, check out htmllib. It has a similar interface, but does more of the work for you. (It is pretty ancient, so it's not much help in terms of getting rid of javascript and css. You could make a derived class, but and add methods with names like start_script and end_style (see the python docs for details), but it's hard to do this reliably for malformed html.) Anyway, here's something simple that prints the plain text to the console

from htmllib import HTMLParser, HTMLParseError
from formatter import AbstractFormatter, DumbWriter
p = HTMLParser(AbstractFormatter(DumbWriter()))
try: p.feed('hello<br>there'); p.close() #calling close is not usually needed, but let's play it safe
except HTMLParseError: print ':(' #the html is badly malformed (or you found a bug)

解决方案 15:

I recommend a Python Package called goose-extractor
Goose will try to extract the following information:

Main text of an article
Main image of article
Any Youtube/Vimeo movies embedded in article
Meta Description
Meta tags

More :https://pypi.python.org/pypi/goose-extractor/

解决方案 16:

Anyone has tried bleach.clean(html,tags=[],strip=True) with bleach? it's working for me.

解决方案 17:

install html2text using

pip install html2text

then,

>>> import html2text
>>>
>>> h = html2text.HTML2Text()
>>> # Ignore converting links from HTML
>>> h.ignore_links = True
>>> print h.handle("<p>Hello, <a href='http://earth.google.com/'>world</a>!")
Hello, world!

解决方案 18:

Best worked for me is inscripts .

https://github.com/weblyzard/inscriptis

import urllib.request
from inscriptis import get_text

url = "http://www.informationscience.ch"
html = urllib.request.urlopen(url).read().decode('utf-8')

text = get_text(html)
print(text)

The results are really good

解决方案 19:

I had a similar question and actually used one of the answers with BeautifulSoup.
The problem was it was really slow. I ended up using library called selectolax.
It's pretty limited but it works for this task.
The only issue was that I had manually remove unnecessary white spaces.
But it seems to be working much faster that BeautifulSoup solution.

from selectolax.parser import HTMLParser

def get_text_selectolax(html):
    tree = HTMLParser(html)

    if tree.body is None:
        return None

    for tag in tree.css('script'):
        tag.decompose()
    for tag in tree.css('style'):
        tag.decompose()

    text = tree.body.text(separator='')
    text = " ".join(text.split()) # this will remove all the whitespaces
    return text

解决方案 20:

Beautiful soup does convert html entities. It's probably your best bet considering HTML is often buggy and filled with unicode and html encoding issues. This is the code I use to convert html to raw text:

import BeautifulSoup
def getsoup(data, to_unicode=False):
    data = data.replace("&nbsp;", " ")
    # Fixes for bad markup I've seen in the wild.  Remove if not applicable.
    masssage_bad_comments = [
        (re.compile('<!-([^-])'), lambda match: '<!--' + match.group(1)),
        (re.compile('<!WWWAnswer T[=wds]*>'), lambda match: '<!--' + match.group(0) + '-->'),
    ]
    myNewMassage = copy.copy(BeautifulSoup.BeautifulSoup.MARKUP_MASSAGE)
    myNewMassage.extend(masssage_bad_comments)
    return BeautifulSoup.BeautifulSoup(data, markupMassage=myNewMassage,
        convertEntities=BeautifulSoup.BeautifulSoup.ALL_ENTITIES 
                    if to_unicode else None)

remove_html = lambda c: getsoup(c, to_unicode=True).getText(separator=u' ') if c else ""

解决方案 21:

Another non-python solution: Libre Office:

soffice --headless --invisible --convert-to txt input1.html

The reason I prefer this one over other alternatives is that every HTML paragraph gets converted into a single text line (no line breaks), which is what I was looking for. Other methods require post-processing. Lynx does produce nice output, but not exactly what I was looking for. Besides, Libre Office can be used to convert from all sorts of formats...

解决方案 22:

Another option is to run the html through a text based web browser and dump it. For example (using Lynx):

lynx -dump html_to_convert.html > converted_html.txt

This can be done within a python script as follows:

import subprocess

with open('converted_html.txt', 'w') as outputFile:
    subprocess.call(['lynx', '-dump', 'html_to_convert.html'], stdout=testFile)

It won't give you exactly just the text from the HTML file, but depending on your use case it may be preferable to the output of html2text.

解决方案 23:

@PeYoTIL's answer using BeautifulSoup and eliminating style and script content didn't work for me. I tried it using decompose instead of extract but it still didn't work. So I created my own which also formats the text using the <p> tags and replaces <a> tags with the href link. Also copes with links inside text. Available at this gist with a test doc embedded.

from bs4 import BeautifulSoup, NavigableString

def html_to_text(html):
    "Creates a formatted text email message as a string from a rendered html template (page)"
    soup = BeautifulSoup(html, 'html.parser')
    # Ignore anything in head
    body, text = soup.body, []
    for element in body.descendants:
        # We use type and not isinstance since comments, cdata, etc are subclasses that we don't want
        if type(element) == NavigableString:
            # We use the assumption that other tags can't be inside a script or style
            if element.parent.name in ('script', 'style'):
                continue

            # remove any multiple and leading/trailing whitespace
            string = ' '.join(element.string.split())
            if string:
                if element.parent.name == 'a':
                    a_tag = element.parent
                    # replace link text with the link
                    string = a_tag['href']
                    # concatenate with any non-empty immediately previous string
                    if (    type(a_tag.previous_sibling) == NavigableString and
                            a_tag.previous_sibling.string.strip() ):
                        text[-1] = text[-1] + ' ' + string
                        continue
                elif element.previous_sibling and element.previous_sibling.name == 'a':
                    text[-1] = text[-1] + ' ' + string
                    continue
                elif element.parent.name == 'p':
                    # Add extra paragraph formatting newline
                    string = '
' + string
                text += [string]
    doc = '
'.join(text)
    return doc

解决方案 24:

I've had good results with Apache Tika. Its purpose is the extraction of metadata and text from content, hence the underlying parser is tuned accordingly out of the box.

Tika can be run as a server, is trivial to run / deploy in a Docker container, and from there can be accessed via Python bindings.

解决方案 25:

While alot of people mentioned using regex to strip html tags, there are a lot of downsides.

for example:

<p>hello&nbsp;world</p>I love you

Should be parsed to:

Hello world
I love you

Here's a snippet I came up with, you can cusomize it to your specific needs, and it works like a charm

import re
import html
def html2text(htm):
    ret = html.unescape(htm)
    ret = ret.translate({
        8209: ord('-'),
        8220: ord('"'),
        8221: ord('"'),
        160: ord(' '),
    })
    ret = re.sub(r"s", " ", ret, flags = re.MULTILINE)
    ret = re.sub("<br>|<br />|</p>|</div>|</hd>", "
", ret, flags = re.IGNORECASE)
    ret = re.sub('<.*?>', ' ', ret, flags=re.DOTALL)
    ret = re.sub(r"  +", " ", ret)
    return ret

解决方案 26:

in a simple way

import re

html_text = open('html_file.html').read()
text_filtered = re.sub(r'<(.*?)>', '', html_text)

this code finds all parts of the html_text started with '<' and ending with '>' and replace all found by an empty string

解决方案 27:

In Python 3.x you can do it in a very easy way by importing 'imaplib' and 'email' packages. Although this is an older post but maybe my answer can help new comers on this post.

status, data = self.imap.fetch(num, '(RFC822)')
email_msg = email.message_from_bytes(data[0][1]) 
#email.message_from_string(data[0][1])

#If message is multi part we only want the text version of the body, this walks the message and gets the body.

if email_msg.is_multipart():
    for part in email_msg.walk():       
        if part.get_content_type() == "text/plain":
            body = part.get_payload(decode=True) #to control automatic email-style MIME decoding (e.g., Base64, uuencode, quoted-printable)
            body = body.decode()
        elif part.get_content_type() == "text/html":
            continue

Now you can print body variable and it will be in plaintext format :) If it is good enough for you then it would be nice to select it as accepted answer.

解决方案 28:

Here's the code I use on a regular basis.

from bs4 import BeautifulSoup
import urllib.request


def processText(webpage):

    # EMPTY LIST TO STORE PROCESSED TEXT
    proc_text = []

    try:
        news_open = urllib.request.urlopen(webpage.group())
        news_soup = BeautifulSoup(news_open, "lxml")
        news_para = news_soup.find_all("p", text = True)

        for item in news_para:
            # SPLIT WORDS, JOIN WORDS TO REMOVE EXTRA SPACES
            para_text = (' ').join((item.text).split())

            # COMBINE LINES/PARAGRAPHS INTO A LIST
            proc_text.append(para_text)

    except urllib.error.HTTPError:
        pass

    return proc_text

I hope that helps.

解决方案 29:

you can extract only text from HTML with BeautifulSoup

url = "https://www.geeksforgeeks.org/extracting-email-addresses-using-regular-expressions-python/"
con = urlopen(url).read()
soup = BeautifulSoup(con,'html.parser')
texts = soup.get_text()
print(texts)

解决方案 30:

Another example using BeautifulSoup4 in Python 2.7.9+

includes:

import urllib2
from bs4 import BeautifulSoup

Code:

def read_website_to_text(url):
    page = urllib2.urlopen(url)
    soup = BeautifulSoup(page, 'html.parser')
    for script in soup(["script", "style"]):
        script.extract() 
    text = soup.get_text()
    lines = (line.strip() for line in text.splitlines())
    chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
    text = '
'.join(chunk for chunk in chunks if chunk)
    return str(text.encode('utf-8'))

Explained:

Read in the url data as html (using BeautifulSoup), remove all script and style elements, and also get just the text using .get_text(). Break into lines and remove leading and trailing space on each, then break multi-headlines into a line each chunks = (phrase.strip() for line in lines for phrase in line.split(" ")). Then using text = '\n'.join, drop blank lines, finally return as sanctioned utf-8.

Notes:

相关推荐
  政府信创国产化的10大政策解读一、信创国产化的背景与意义信创国产化,即信息技术应用创新国产化,是当前中国信息技术领域的一个重要发展方向。其核心在于通过自主研发和创新,实现信息技术应用的自主可控,减少对外部技术的依赖,并规避潜在的技术制裁和风险。随着全球信息技术竞争的加剧,以及某些国家对中国在科技领域的打压,信创国产化显...
工程项目管理   1565  
  为什么项目管理通常仍然耗时且低效?您是否还在反复更新电子表格、淹没在便利贴中并参加每周更新会议?这确实是耗费时间和精力。借助软件工具的帮助,您可以一目了然地全面了解您的项目。如今,国内外有足够多优秀的项目管理软件可以帮助您掌控每个项目。什么是项目管理软件?项目管理软件是广泛行业用于项目规划、资源分配和调度的软件。它使项...
项目管理软件   1354  
  信创国产芯片作为信息技术创新的核心领域,对于推动国家自主可控生态建设具有至关重要的意义。在全球科技竞争日益激烈的背景下,实现信息技术的自主可控,摆脱对国外技术的依赖,已成为保障国家信息安全和产业可持续发展的关键。国产芯片作为信创产业的基石,其发展水平直接影响着整个信创生态的构建与完善。通过不断提升国产芯片的技术实力、产...
国产信创系统   21  
  信创生态建设旨在实现信息技术领域的自主创新和安全可控,涵盖了从硬件到软件的全产业链。随着数字化转型的加速,信创生态建设的重要性日益凸显,它不仅关乎国家的信息安全,更是推动产业升级和经济高质量发展的关键力量。然而,在推进信创生态建设的过程中,面临着诸多复杂且严峻的挑战,需要深入剖析并寻找切实可行的解决方案。技术创新难题技...
信创操作系统   27  
  信创产业作为国家信息技术创新发展的重要领域,对于保障国家信息安全、推动产业升级具有关键意义。而国产芯片作为信创产业的核心基石,其研发进展备受关注。在信创国产芯片的研发征程中,面临着诸多复杂且艰巨的难点,这些难点犹如一道道关卡,阻碍着国产芯片的快速发展。然而,科研人员和相关企业并未退缩,积极探索并提出了一系列切实可行的解...
国产化替代产品目录   28  
热门文章
项目管理软件有哪些?
云禅道AD
禅道项目管理软件

云端的项目管理软件

尊享禅道项目软件收费版功能

无需维护,随时随地协同办公

内置subversion和git源码管理

每天备份,随时转为私有部署

免费试用