从 HTML 表中提取数据-IT科技

摘要：问题描述：我正在寻找一种在 Linux shell 环境中从 HTML 获取某些信息的方法。这是我感兴趣的一点：<table class="details" border="0" cellpadding="5" cellspacing=&quo...

问题描述：

我正在寻找一种在 Linux shell 环境中从 HTML 获取某些信息的方法。

这是我感兴趣的一点：

<table class="details" border="0" cellpadding="5" cellspacing="2" width="95%">
  <tr valign="top">
    <th>Tests</th>
    <th>Failures</th>
    <th>Success Rate</th>
    <th>Average Time</th>
    <th>Min Time</th>
    <th>Max Time</th>
  </tr>
  <tr valign="top" class="Failure">
    <td>103</td>
    <td>24</td>
    <td>76.70%</td>
    <td>71 ms</td>
    <td>0 ms</td>
    <td>829 ms</td>
  </tr>
</table>

我想将其存储在 shell 变量中或将其回显在从上面的 html 中提取的键值对中。例如：

Tests         : 103
Failures      : 24
Success Rate  : 76.70 %
and so on..

我目前能做的是创建一个 java 程序，使用 sax 解析器或 html 解析器（例如 jsoup）来提取此信息。

但是在这里使用 java 似乎会增加开销，因为要将可运行的 jar 包含在要执行的“包装器”脚本中。

我确信一定有可以做同样事情的“shell”语言，例如 perl、python、bash 等。

我的问题是，我对此毫无经验，有人能帮我解决这个“相当简单”的问题吗

快速更新：

我忘了说我在 .html 文档中还有更多表格和更多行，对此我深感抱歉（清晨）。

更新 #2：

由于我没有 root 权限，所以尝试这样安装 Bsoup：

$ wget http://www.crummy.com/software/BeautifulSoup/bs4/download/4.0/beautifulsoup4-4.1.0.tar.gz
$ tar -zxvf beautifulsoup4-4.1.0.tar.gz
$ cp -r beautifulsoup4-4.1.0/bs4 .
$ vi htmlParse.py # (paste code from ) Tichodromas' answer, just in case this (http://pastebin.com/4Je11Y9q) is what I pasted
$ run file (python htmlParse.py)

错误：

$ python htmlParse.py
Traceback (most recent call last):
  File "htmlParse.py", line 1, in ?
    from bs4 import BeautifulSoup
  File "/home/gdd/setup/py/bs4/__init__.py", line 29
    from .builder import builder_registry
         ^
SyntaxError: invalid syntax

更新#3：

运行 Tichodromas 的答案会出现此错误：

Traceback (most recent call last):
  File "test.py", line 27, in ?
    headings = [th.get_text() for th in table.find("tr").find_all("th")]
TypeError: 'NoneType' object is not callable

有什么想法吗？

解决方案 1：

使用BeautifulSoup4的 Python 解决方案（编辑：适当跳过。Edit3 ：使用class="details"选择table）：

from bs4 import BeautifulSoup

html = """
  <table class="details" border="0" cellpadding="5" cellspacing="2" width="95%">
    <tr valign="top">
      <th>Tests</th>
      <th>Failures</th>
      <th>Success Rate</th>
      <th>Average Time</th>
      <th>Min Time</th>
      <th>Max Time</th>
   </tr>
   <tr valign="top" class="Failure">
     <td>103</td>
     <td>24</td>
     <td>76.70%</td>
     <td>71 ms</td>
     <td>0 ms</td>
     <td>829 ms</td>
  </tr>
</table>"""

soup = BeautifulSoup(html)
table = soup.find("table", attrs={"class":"details"})

# The first tr contains the field names.
headings = [th.get_text() for th in table.find("tr").find_all("th")]

datasets = []
for row in table.find_all("tr")[1:]:
    dataset = zip(headings, (td.get_text() for td in row.find_all("td")))
    datasets.append(dataset)

print datasets

结果如下：

[[(u'Tests', u'103'),
  (u'Failures', u'24'),
  (u'Success Rate', u'76.70%'),
  (u'Average Time', u'71 ms'),
  (u'Min Time', u'0 ms'),
  (u'Max Time', u'829 ms')]]

编辑2：要产生所需的输出，请使用以下命令：

for dataset in datasets:
    for field in dataset:
        print "{0:<16}: {1}".format(field[0], field[1])

结果：

Tests           : 103
Failures        : 24
Success Rate    : 76.70%
Average Time    : 71 ms
Min Time        : 0 ms
Max Time        : 829 ms

解决方案 2：

使用pandas.read_html：

import pandas as pd
html_tables = pd.read_html('resources/test.html')
df = html_tables[0]
df.T # transpose to align
                   0
Tests            103
Failures          24
Success Rate  76.70%
Average Time   71 ms

解决方案 3：

这是最佳答案，适用于 Python3 兼容性，并通过删除单元格中的空格进行了改进：

from bs4 import BeautifulSoup

html = """
  <table class="details" border="0" cellpadding="5" cellspacing="2" width="95%">
    <tr valign="top">
      <th>Tests</th>
      <th>Failures</th>
      <th>Success Rate</th>
      <th>Average Time</th>
      <th>Min Time</th>
      <th>Max Time</th>
   </tr>
   <tr valign="top" class="Failure">
     <td>103</td>
     <td>24</td>
     <td>76.70%</td>
     <td>71 ms</td>
     <td>0 ms</td>
     <td>829 ms</td>
  </tr>
</table>"""

soup = BeautifulSoup(s, 'html.parser')
table = soup.find("table")

# The first tr contains the field names.
headings = [th.get_text().strip() for th in table.find("tr").find_all("th")]

print(headings)

datasets = []
for row in table.find_all("tr")[1:]:
    dataset = dict(zip(headings, (td.get_text() for td in row.find_all("td"))))
    datasets.append(dataset)

print(datasets)

解决方案 4：

假设你的 html 代码存储在 mycode.html 文件中，这里有一个 bash 方法：

paste -d: <(grep '<th>' mycode.html | sed -e 's,</*th>,,g') <(grep '<td>' mycode.html | sed -e 's,</*td>,,g')

注意：输出并不完全对齐

解决方案 5：

以下是我在 Python 2.7 上测试过的基于 Python 正则表达式的解决方案。它不依赖于 XML 模块，因此即使 XML 格式不完善，它也能正常工作。

import re
# input args: html string
# output: tables as a list, column max length
def extract_html_tables(html):
  tables=[]
  maxlen=0
  rex1=r'<table.*?/table>'
  rex2=r'<tr.*?/tr>'
  rex3=r'<(td|th).*?/(td|th)>'
  s = re.search(rex1,html,re.DOTALL)
  while s:
    t = s.group()  # the table
    s2 = re.search(rex2,t,re.DOTALL)
    table = []
    while s2:
      r = s2.group() # the row 
      s3 = re.search(rex3,r,re.DOTALL)
      row=[]
      while s3:
        d = s3.group() # the cell
        #row.append(strip_tags(d).strip() )
        row.append(d.strip() )

        r = re.sub(rex3,'',r,1,re.DOTALL)
        s3 = re.search(rex3,r,re.DOTALL)

      table.append( row )
      if maxlen<len(row):
        maxlen = len(row)

      t = re.sub(rex2,'',t,1,re.DOTALL)
      s2 = re.search(rex2,t,re.DOTALL)

    html = re.sub(rex1,'',html,1,re.DOTALL)
    tables.append(table)
    s = re.search(rex1,html,re.DOTALL)
  return tables, maxlen

html = """
  <table class="details" border="0" cellpadding="5" cellspacing="2" width="95%">
    <tr valign="top">
      <th>Tests</th>
      <th>Failures</th>
      <th>Success Rate</th>
      <th>Average Time</th>
      <th>Min Time</th>
      <th>Max Time</th>
   </tr>
   <tr valign="top" class="Failure">
     <td>103</td>
     <td>24</td>
     <td>76.70%</td>
     <td>71 ms</td>
     <td>0 ms</td>
     <td>829 ms</td>
  </tr>
</table>"""
print extract_html_tables(html)

解决方案 6：

undef $/;
$text = <DATA>;

@tabs = $text =~ m!<table.*?>(.*?)</table>!gms;
for (@tabs) {
    @th = m!<th>(.*?)</th>!gms;
    @td = m!<td>(.*?)</td>!gms;
}
for $i (0..$#th) {
    printf "%-16s    : %s
", $th[$i], $td[$i];
}

__DATA__
<table class="details" border="0" cellpadding="5" cellspacing="2" width="95%">
<tr valign="top">
<th>Tests</th>
<th>Failures</th>
<th>Success Rate</th>
<th>Average Time</th>
<th>Min Time</th>
<th>Max Time</th>
</tr>
<tr valign="top" class="Failure">
<td>103</td>
<td>24</td>
<td>76.70%</td>
<td>71 ms</td>
<td>0 ms</td>
<td>829 ms</td>
</tr>
</table>

输出如下：

Tests               : 103
Failures            : 24
Success Rate        : 76.70%
Average Time        : 71 ms
Min Time            : 0 ms
Max Time            : 829 ms

解决方案 7：

仅使用标准库的 Python 解决方案（利用 HTML 恰好是格式良好的 XML 这一事实）。可以处理多行数据。

（使用 Python 2.6 和 2.7 测试。问题已更新，指出 OP 使用 Python 2.4，因此在这种情况下这个答案可能不是很有用。ElementTree 是在 Python 2.5 中添加的）

from xml.etree.ElementTree import fromstring

HTML = """
<table class="details" border="0" cellpadding="5" cellspacing="2" width="95%">
  <tr valign="top">
    <th>Tests</th>
    <th>Failures</th>
    <th>Success Rate</th>
    <th>Average Time</th>
    <th>Min Time</th>
    <th>Max Time</th>
  </tr>
  <tr valign="top" class="Failure">
    <td>103</td>
    <td>24</td>
    <td>76.70%</td>
    <td>71 ms</td>
    <td>0 ms</td>
    <td>829 ms</td>
  </tr>
  <tr valign="top" class="whatever">
    <td>A</td>
    <td>B</td>
    <td>C</td>
    <td>D</td>
    <td>E</td>
    <td>F</td>
  </tr>
</table>"""

tree = fromstring(HTML)
rows = tree.findall("tr")
headrow = rows[0]
datarows = rows[1:]

for num, h in enumerate(headrow):
    data = ", ".join([row[num].text for row in datarows])
    print "{0:<16}: {1}".format(h.text, data)

输出：

Tests           : 103, A
Failures        : 24, B
Success Rate    : 76.70%, C
Average Time    : 71 ms, D
Min Time        : 0 ms, E
Max Time        : 829 ms, F