awk 模式可以匹配多行吗？-IT科技

摘要：问题描述：我有一些复杂的日志文件，需要编写一些工具来处理它们。我一直在使用 awk，但我不确定 awk 是否适合这个。我的日志文件是 OSPF 协议解码的打印输出，其中包含各种协议 pkt 及其内容的文本日志，其中各种协议字段及其值标识。我想处理这些文件并仅打印出与特定 pkt 相关的某些日志行。每个 pkt...

问题描述：

我有一些复杂的日志文件，需要编写一些工具来处理它们。我一直在使用 awk，但我不确定 awk 是否适合这个。

我的日志文件是 OSPF 协议解码的打印输出，其中包含各种协议 pkt 及其内容的文本日志，其中各种协议字段及其值标识。我想处理这些文件并仅打印出与特定 pkt 相关的某些日志行。每个 pkt 日志可以包含该 pkt 条目的不同数量的行。

awk 似乎能够处理与模式匹配的单行。我可以找到所需的 pkt，但随后我需要匹配后面几行中的模式，以确定它是否是我想要打印出来的 pkt。

另一种看待这个问题的方式是，我想要隔离日志文件中的几行，并根据几行上的模式匹配打印出有关特定 pkt 的详细信息。

由于 awk 似乎是基于行的，因此我不确定它是否是最好的工具。

如果 awk 可以做到这一点，那么它是如何做到的？如果不行，有什么建议可以使用哪种工具来实现这一点吗？

解决方案 1：

Awk 可以轻松检测多行模式组合，但您需要在代码中创建所谓的状态机来识别序列。

考虑这个输入：

how
second half #1
now
first half
second half #2
brown
second half #3
cow

如您所见，识别单个模式很容易。现在，我们可以编写一个 awk 程序，仅在后半部分紧接着前半部分时才识别后半部分。（使用更复杂的状态机，您可以检测任意模式序列。）

/second half/ {
  if(lastLine == &quot;first half&quot;) {
    print
  }
}

{ lastLine = $0 }

如果你运行这个你将会看到：

second half #2

现在，这个例子非常简单，几乎只是一个状态机。有趣的状态只持续if语句的持续时间，而先前的状态是隐式的，取决于lastLine 的值。在更规范的状态机中，您将保留一个显式状态变量，并根据现有状态和当前输入从一个状态转换到另一个状态。但您可能不需要那么多控制机制。

解决方案 2：

awk 能够处理从起始模式到结束模式

/start-pattern/,/end-pattern/ {
  print
}

我在寻找如何搭配

 * Implements hook_entity_info_alter().
 */
function file_test_entity_type_alter(&amp;$entity_types) {

如此创造

/* Implements hook_/,/function / {
  print
}

我需要的内容。一个更复杂的例子是跳过行并删除非空格部分。注意 awk 是一个记录（行）和单词（按空格分割）工具。

# start,end pattern match using comma
/ * Implements hook_(.*?)./,/function (.S*?)/ {
  # skip PHP multi line comment end
  $0 ~ / *// skip

  # Only print 3rd word
  if ($0 ~ /Implements/) {
    hook=$3
    # scrub of opening parenthesis and following.
    sub(/(.*$/, &quot;&quot;, hook)
    print hook
  }

  # Only print function name without parenthesis
  if ($0 ~ /function/) {
    name=$2

    # scrub of opening parenthesis and following.
    sub(/(.*$/, &quot;&quot;, name)

    print name
    print &quot;&quot;
  }
}

希望这也有帮助。

另请参阅GAWK 范围以了解更多信息。

解决方案 3：

Awk 实际上是基于记录的。默认情况下，它将一行视为一条记录，但您可以使用 RS（记录分隔符）变量进行更改。

解决此问题的一种方法是先使用 sed（如果愿意，也可以使用 awk 执行此操作）将记录用不同的字符（如换页符）分隔开。然后，您可以编写 awk 脚本，让脚本将一组行视为单个记录。

例如，如果这是您的数据：

animal 0
name: joe
type: dog
animal 1
name: bill
type: cat
animal 2
name: ed
type: cat

要使用换页符 ( ) 分隔记录：

$ cat data | sed -E $&#039;s|^(animal.*)|\\1|&#039;

现在我们将把它传递给 awk。以下是有条件打印记录的示例：

$ cat data | sed -E $&#039;s|^(animal.*)|\\1|&#039; | awk &#039;
      BEGIN { RS=&quot;&quot; }                                     
      /type: cat/ { print }&#039;

输出：

animal 1
name: bill
type: cat

animal 2
name: ed
type: cat

编辑：作为奖励，这里是如何用 awk-ward ruby 来做到这一点（-014 表示使用换页符（八进制代码 014）作为记录分隔符）：

$ cat data | sed -E $&#039;s|^(animal.*)|\\1|&#039; |
      ruby -014 -ne &#039;print if /type: cat/&#039;

解决方案 4：

我时不时地会对 sendmail 日志做这种事情。

鉴于：

Jan 15 22:34:39 mail sm-mta[36383]: r0B8xkuT048547: to=&lt;www@web3>, delay=4+18:34:53, xdelay=00:00:00, mailer=esmtp, pri=21092363, relay=web3., dsn=4.0.0, stat=Deferred: Operation timed out with web3.
Jan 15 22:34:39 mail sm-mta[36383]: r0B8hpoV047895: to=&lt;www@web3>, delay=4+18:49:22, xdelay=00:00:00, mailer=esmtp, pri=21092556, relay=web3., dsn=4.0.0, stat=Deferred: Operation timed out with web3.
Jan 15 22:34:51 mail sm-mta[36719]: r0G3Youh036719: from=&lt;obfTaIX3@nickhearn.com>, size=0, class=0, nrcpts=0, proto=ESMTP, daemon=IPv4, relay=[50.71.152.178]
Jan 15 22:35:04 mail sm-mta[36722]: r0G3Z2SF036722: lost input channel from [190.107.98.82] to IPv4 after rcpt
Jan 15 22:35:04 mail sm-mta[36722]: r0G3Z2SF036722: from=&lt;amahrroc@europe.com>, size=0, class=0, nrcpts=0, proto=SMTP, daemon=IPv4, relay=[190.107.98.82]
Jan 15 22:35:36 mail sm-mta[36728]: r0G3ZXiX036728: lost input channel from ABTS-TN-dynamic-237.104.174.122.airtelbroadband.in [122.174.104.237] (may be forged) to IPv4 after rcpt
Jan 15 22:35:36 mail sm-mta[36728]: r0G3ZXiX036728: from=&lt;clunch.hilarymas@javagame.ru>, size=0, class=0, nrcpts=0, proto=SMTP, daemon=IPv4, relay=ABTS-TN-dynamic-237.104.174.122.airtelbroadband.in [122.174.104.237] (may be forged)

我使用类似这样的脚本：

#!/usr/bin/awk -f

BEGIN {
  search=ARGV[1];  # Grab the first command line option
  delete ARGV[1];  # Delete it so it won&#039;t be considered a file
}

# First, store every line in an array keyed on the Queue ID.
# Obviously, this only works for smallish log segments, as it uses up memory.
{
  line[$6]=sprintf(&quot;%s
%s&quot;, line[$6], $0);
}

# Next, keep a record of Queue IDs with substrings that match our search string.
index($0, search) {
  show[$6];
}

# Finally, once we&#039;ve processed all input data, walk through our array of &quot;found&quot;
# Queue IDs, and print the corresponding records from the storage array.
END {
  for(qid in show) {
    print line[qid];
  }
}

得到以下输出：

$ mqsearch airtel /var/log/maillog

Jan 15 22:35:36 mail sm-mta[36728]: r0G3ZXiX036728: lost input channel from ABTS-TN-dynamic-237.104.174.122.airtelbroadband.in [122.174.104.237] (may be forged) to IPv4 after rcpt
Jan 15 22:35:36 mail sm-mta[36728]: r0G3ZXiX036728: from=&lt;clunch.hilarymas@javagame.ru>, size=0, class=0, nrcpts=0, proto=SMTP, daemon=IPv4, relay=ABTS-TN-dynamic-237.104.174.122.airtelbroadband.in [122.174.104.237] (may be forged)

这里的想法是，我打印与我要搜索的字符串的 Sendmail 队列 ID 匹配的所有行。代码的结构当然是日志文件结构的产物，因此您需要针对要分析和提取的数据自定义解决方案。

解决方案 5：

`pcregrep -M` works pretty well for this.

来自 pcregrep(1):

-M，--多行
允许模式匹配多行。当给出此选项时，模式可能包含文字换行符和内部出现的 ^ 和 $ 字符。成功匹配的输出可能包含多行，最后一行是匹配结束的行。如果匹配的字符串以换行符序列结尾，则输出在该行末尾结束。
设置此选项后，PCRE 库将以“多行”模式调用。可匹配的行数存在限制，这是由 pcregrep 在扫描输入文件时缓冲输入文件的方式决定的。但是，pcregrep 可确保至少 8K 个字符或文档的其余部分（以较短者为准）可用于正向匹配，同样，前 8K 个字符（或所有前一个字符，如果少于 8K）也保证可用于后向断言。逐行读取输入时，此选项不起作用（请参阅 --line-buffered）。

解决方案 6：

awk &#039;/pattern-start/,/pattern-end/&#039;

参考