在 Bash 中从另一个较大的文件中查找文件行的最快方法-IT科技

摘要：问题描述：我有两个文件，file1.txt.大约有 14K 行，大约有 20 亿个字符。每行有一个字段file2.txt，而有 3 个字段，通过，以分隔。file1.txt`file2.txtfile1.txtf1file2.txtf1f3|`我想找到与匹配的所有行（file2.txt或者如果我们不想...

问题描述：

我有两个文件，file1.txt.大约有 14K 行，大约有 20 亿个字符。每行有一个字段file2.txt，而有 3 个字段，通过，以分隔。file1.txt`file2.txtfile1.txtf1file2.txtf1f3|`

我想找到与匹配的所有行（file2.txt或者如果我们不想花费额外的时间来分割的值，则可以找到行上的任何位置）。f1`file1.txtf2file2.txt`file2.txt

file1.txt (约14K行，未排序):

foo1
foo2
...
bar1
bar2
...

file2.txt (约20亿行，未排序):

date1|foo1|number1
date2|foo2|number2
...
date1|bar1|number1
date2|bar2|number2
...

预期输出：

date1|foo1|number1
date2|foo2|number2
...
date1|bar1|number1
date2|bar2|number2
...

这是我尝试过的并且似乎要花几个小时才能运行：

fgrep -F -f file1.txt file2.txt > file.matched

我想知道是否有更好更快的方法使用常见的 Unix 命令或小脚本来执行此操作。

解决方案 1：

Perl 解决方案。[请参阅下面的注释。]

对第一个文件使用哈希。逐行读取大文件时，通过正则表达式提取字段（捕获之间的第一个模式||）或split（获取第二个单词）并打印是否exists。它们的速度可能略有不同（计时）。defined正则表达式中不需要检查，而split使用//短路的（定义或）。

use warnings;
use strict;

# If &#039;prog smallfile bigfile&#039; is the preferred use
die &quot;Usage: $0 smallfile bigfile
&quot;  if @ARGV != 2;
my ($smallfile, $bigfile) = @ARGV;

open my $fh, &#039;&lt;&#039;, $smallfile or die &quot;Can&#039;t open $smallfile: $!&quot;;    
my %word = map { chomp; $_ => 1 } &lt;$fh>;

open    $fh, &#039;&lt;&#039;, $bigfile or die &quot;Can&#039;t open $bigfile: $!&quot;;       
while (&lt;$fh>) 
{
    exists $word{ (/|([^|]+)/)[0] } &amp;&amp; print;  

    # Or
    #exists $word{ (split /|/)[1] // &#039;&#039; } &amp;&amp; print;
}
close $fh;

避免if分支并使用短路会更快，但速度非常慢。在数十亿行上，这些调整会累积起来，但同样不会太多。逐行读取小文件可能会（也可能不会）比上面的列表上下文更快一点，但这不应该被注意到。

更新写入可STDOUT节省两次操作，我反复计时，发现它比写入文件要快一点。这种用法也与大多数 UNIX 工具一致，因此我将其改为写入STDOUT。接下来，exists不需要测试，删除它可以节省一次操作。但是，我始终能获得更好的运行时间，同时它也能更好地传达目的。总之，我将其保留。感谢ikegami的评论。

注意：根据下面的基准测试，注释掉的版本比另一个版本快 50% 左右。之所以给出这两个版本，是因为它们不同，一个查找第一个匹配项，另一个查找第二个字段。我将其保留为更通用的选择，因为这个问题对此不明确。

一些比较（基准）[已更新STDOUT，请参阅上面的“更新”]

HåkonHægland在答案中进行了广泛的分析，对大多数解决方案的一次运行进行了计时。这是另一种看法，对上述两个解决方案、OP 自己的答案以及发布的fgrep答案进行了基准测试，预计该答案会很快并在问题和许多答案中使用。

我以以下方式构建测试数据。对于两个文件，用随机单词组成几行，长度大致如图所示，以便在第二个字段中匹配。然后，我用不匹配的行填充数据样本的这个“种子”，以模拟 OP 引用的大小和匹配之间的比率：小文件中的14K行在大文件中有1.3M行，产生126K 个匹配。然后重复写入这些样本以构建完整的数据文件，就像 OP 一样，每次都使用List::Util :: -ed。shuffle

下面比较的所有运行都106_120对上述文件大小进行了匹配（diff-ed 进行检查），因此匹配频率足够接近。它们是通过使用调用完整程序来对它们进行基准测试的。在 v5.16 上my $res = timethese(60 ...)的结果是cmpthese($res)

        评价正则表达式 cfor split fgrep
正则表达式 1.05/秒 -- -23% -35% -44%
c为1.36/秒 30% -- -16% -28%
分割 1.62/秒 54% 19% -- -14%
fgrep 1.89/秒 80% 39% 17% --

优化后的 C 程序fgrep名列前茅这一事实并不令人惊讶。“ regex ”落后于“ split ”可能是由于多次启动引擎进行小匹配的开销。考虑到不断发展的正则表达式引擎优化，这可能会因 Perl 版本而异。我包括了 @codeforester（“ cfor ”）的答案，因为它据称是最快的，它24%落后于非常相似的“ split ”可能是由于零星的小效率低下（请参阅此答案下方的评论）。†

虽然硬件和软件以及数据细节存在一定差异，但这并不是天壤之别。我在不同的 Perl 和机器上运行了它，显著的区别是在某些情况下fgrep确实快了一个数量级。

楼主的体验非常慢，fgrep这让人很惊讶。考虑到他们引用的运行时间，比上面的慢了几个数量级，我猜是旧系统“惹的祸”。

尽管这完全基于 I/O，但将其放在多个核心上仍然具有并发优势，我期望获得良好的加速，最高可达几倍。

†唉，评论被删了（？）。简而言之：不必要地使用标量（成本）、分支if、defined代替printf（print慢！）。这些对 20 亿行的运行时间很重要。

解决方案 2：

我尝试对这里介绍的一些方法进行比较。

首先我创建了一个 Perl 脚本来生成输入文件file1.txt和file2.txt。为了比较一些解决方案，我确保中的单词file1.txt只能出现在中的第二个字段中file2.txt。另外，为了能够使用join@GeorgeVasiliou 提供的解决方案，我对file1.txt和进行了排序file2.txt。目前我仅基于 75 个随机单词（取自https://www.randomlists.com/random-words）生成输入文件。在剩下的 70 个单词中，只有 5 个用于file1.txt填充中的字段file2.txt。可能需要大幅增加单词数量才能获得真实结果（根据 OP，原文file1.txt包含 14000 个单词）。在下面的测试中，我使用了有 1000000（1 百万）行的。该脚本还生成@BOC 的 grep 解决方案所需的file2.txt文件。regexp1.txt

gen_input_files.pl：

#! /usr/bin/env perl
use feature qw(say);
use strict;
use warnings;

use Data::Printer;
use Getopt::Long;

GetOptions (&quot;num_lines=i&quot; => my $nlines )
  or die(&quot;Error in command line arguments
&quot;);

# Generated random words from site: https://www.randomlists.com/random-words
my $word_filename        = &#039;words.txt&#039;; # 75 random words
my $num_match_words      = 5;
my $num_file2_lines      = $nlines || 1_000_000;
my $file2_words_per_line = 3;
my $file2_match_field_no = 2;
my $file1_filename       = &#039;file1.txt&#039;;
my $file2_filename       = &#039;file2.txt&#039;;
my $file1_regex_fn       = &#039;regexp1.txt&#039;;

say &quot;generating $num_file2_lines lines..&quot;;
my ( $words1, $words2 ) = get_words( $word_filename, $num_match_words );

write_file1( $file1_filename, $words2 );
write_file2(
    $file2_filename, $words1, $words2, $num_file2_lines,
    $file2_words_per_line, $file2_match_field_no
);
write_BOC_regexp_file( $file1_regex_fn, $words2 );


sub write_BOC_regexp_file {
    my ( $fn, $words ) = @_;

    open( my $fh, &#039;>&#039;, $fn ) or die &quot;Could not open file &#039;$fn&#039;: $!&quot;;
    print $fh &#039;\|&#039; . (join &quot;|&quot;, @$words) . &#039;\|&#039;;
    close $fh;
}

sub write_file2 {
    my ( $fn, $words1, $words2, $nlines, $words_per_line, $field_no ) = @_;

    my $nwords1 = scalar @$words1;
    my $nwords2 = scalar @$words2;
    my @lines;
    for (1..$nlines) {
        my @words_line;
        my $key;
        for (1..$words_per_line) {
            my $word;
            if ( $_ != $field_no ) {
                my $index = int (rand $nwords1);
                $word = @{ $words1 }[$index];
            }
            else {
                my $index = int (rand($nwords1 + $nwords2) );
                if ( $index &lt; $nwords2 ) {
                    $word = @{ $words2 }[$index];
                }
                else {
                    $word =  @{ $words1 }[$index - $nwords2];
                }
                $key = $word;
            }
            push @words_line, $word;
        }
        push @lines, [$key, (join &quot;|&quot;, @words_line)];
    }
    @lines = map { $_->[1] } sort { $a->[0] cmp $b->[0] } @lines; 
    open( my $fh, &#039;>&#039;, $fn ) or die &quot;Could not open file &#039;$fn&#039;: $!&quot;;
    print $fh (join &quot;
&quot;, @lines);
    close $fh;
}

sub write_file1 {
    my ( $fn, $words ) = @_;

    open( my $fh, &#039;>&#039;, $fn ) or die &quot;Could not open file &#039;$fn&#039;: $!&quot;;
    print $fh (join &quot;
&quot;, sort @$words);
    close $fh;
}

sub get_words {
    my ( $fn, $N ) = @_;

    open( my $fh, &#039;&lt;&#039;, $fn ) or die &quot;Could not open file &#039;$fn&#039;: $!&quot;;
    my @words = map {chomp $_; $_} &lt;$fh>;
    close $fh;

    my @words1 = @words[$N..$#words];
    my @words2 = @words[0..($N - 1)];
    return ( @words1, @words2 );
}

接下来，我创建了一个solutions包含所有测试用例的子文件夹：

$ tree solutions/
solutions/
├── BOC1
│   ├── out.txt
│   └── run.sh
├── BOC2
│   ├── out.txt
│   └── run.sh
├── codeforester
│   ├── out.txt
│   ├── run.pl
│   └── run.sh
[...]

这里的文件out.txt是每个解决方案的 greps 的输出。脚本run.sh运行给定测试用例的解决方案。

关于不同解决方案的说明

BOC1：@BOC 提出的第一个解决方案

grep -E -f regexp1.txt file2.txt

BOC2：@BOC 建议的第二种解决方案：

LC_ALL=C grep -E -f regexp1.txt file2.txt

codeforester：@codeforester 接受的 Perl 解决方案（参见源代码）
codeforester_orig：@codeforested 提出的原始解决方案：

fgrep -f file1.txt file2.txt

dawg：@dawg 提出的使用字典和分割线的 Python 解决方案（参见源代码）
gregory1：@gregory 建议使用 Gnu Parallel 的解决方案

parallel -k --pipepart -a file2.txt --block &quot;$block_size&quot; fgrep -F -f file1.txt

关于如何选择请参阅下面的注释$block_size。

hakon1：@HåkonHægland 提供的 Perl 解决方案（参见源代码）。此解决方案要求在第一次运行代码时编译 c 扩展。当file1.txt或file2.txt更改时，它不需要重新编译。注意：首次运行时编译 c 扩展所用的时间不包括在下面显示的运行时间中。
ikegami：使用组装的正则表达式并使用grep -P@ikegami 给出的解决方案。注意：组装的正则表达式被写入单独的文件regexp_ikegami.txt，因此生成正则表达式的运行时不包括在下面的比较中。这是使用的代码：

regexp=$(&lt; &quot;regexp_ikegami.txt&quot;)
grep -P &quot;$regexp&quot; file2.txt

inian1：第一个解决方案来自@Inian 使用match()

awk &#039;FNR==NR{
    hash[$1]; next
}
{
   for (i in hash) if (match($0,i)) {print; break}
}&#039; file1.txt FS=&#039;|&#039; file2.txt

inian2：@Inian 的第二个解决方案使用index()

awk &#039;FNR==NR{
    hash[$1]; next
}
{
   for (i in hash) if (index($0,i)) {print; break}
}&#039; file1.txt FS=&#039;|&#039; file2.txt

inian3：@Inian 的第三种解决方案仅检查$2字段：

awk &#039;FNR==NR{
    hash[$1]; next
}
$2 in hash&#039; file1.txt FS=&#039;|&#039; file2.txt

inian4：@Inian 的第四个想法（基本上与相同codeforester_orig）LC_ALL：

LC_ALL=C fgrep -f file1.txt file2.txt

inian5：@Inian 提出的第 5 个解决方案（与相同，inian1但不同LC_ALL）：

LC_ALL=C awk &#039;FNR==NR{
    hash[$1]; next
}
{
   for (i in hash) if (match($0,i)) {print; break}
}&#039; file1.txt FS=&#039;|&#039; file2.txt

inian6：与相同，inian3但带有LC_ALL=C。感谢@GeorgeVasiliou 的建议。
jjoao：按照@JJoao 的建议编译 flex 生成的 C 代码（参见源代码）。注意：每次更改时都必须重新编译可执行文件file1.txt。编译可执行文件所用的时间不包括在下面显示的运行时间中。
oliv：@oliv 提供的 Python 脚本（参见源代码）
Vasilioujoin：按照@GeorgeVasiliou的建议使用：

join --nocheck-order -11 -22 -t&#039;|&#039; -o 2.1 2.2 2.3 file1.txt file2.txt

Vasiliou2：与相同，Vasiliou但带有LC_ALL=C。
zdim：使用@zdim 提供的 Perl 脚本（参见源代码）。注意：这使用正则表达式搜索版本（而不是分割线解决方案）。
zdim2：与之相同，只是zdim它使用split函数而不是正则表达式来搜索中的字段file2.txt。

笔记

我用 Gnu parallel 做了一些实验（参见gregory1上面的解决方案）来确定我的 CPU 的最佳块大小。我有 4 个核心，目前看来最佳选择是将文件（file2.txt）分成 4 个大小相等的块，并在 4 个处理器上分别运行一个作业。这里可能需要进行更多测试。因此，对于第一个测试用例，其中file2.txt是 20M，我将其设置$block_size为 5M（参见gregory1上面的解决方案），而对于下面介绍的更现实的情况，其中file2.txt是 268M，$block_size使用了 67M。
解决方案BOC1、BOC2、codeforester_orig、inian1、和都使用了松散匹配。这意味着单词不必与的字段 #2 完全匹配。行上任何位置的匹配均可接受。由于此行为使得与其他方法进行比较更加困难，因此还引入了一些修改后的方法。前两种方法调用inian4并使用修改后的文件。原始文件中的行位于表单上，可匹配任何字段边界处的单词。修改后的文件则使用表单将匹配固定到字段 #2 。inian5`gregory1file1.txtfile2.txtBOC1BBOC2Bregexp1.txtregexp1.txt|foo1|foo2|...|fooN|regexp1b.txt`^[^|]*|foo1|foo2|...|fooN|

然后，其余的修改后的方法codeforester_origB、inian1B、inian4B、inian5B和gregory1B使用了修改后的。修改后的文件不再每行使用一个文字单词，而是file1.txt每行使用一个正则表达式，形式如下：file1b.txt

 ^[^|]*|word1|
 ^[^|]*|word2|
 ^[^|]*|word3|
 [...]

此外，这些方法fgrep -f已被取代。grep -E -f

运行测试

这是用于运行所有测试的脚本。它使用 Bashtime命令记录每个脚本所花费的时间。请注意，该time命令返回三个不同的时间real，分别调用user、和sys。首先我使用了user+ sys，但意识到在使用 Gnu parallel 命令时这是不正确的，因此下面报告的时间现在是real返回的部分time。有关返回的不同时间的更多信息，请参阅此问题time。

第一个测试运行了file1.txt包含 5 行，file2.txt包含1000000行。这是脚本的前 52 行run_all.pl，其余脚本可在此处获取。

运行所有程序

#! /usr/bin/env perl

use feature qw(say);
use strict;
use warnings;

use Cwd;
use Getopt::Long;
use Data::Printer;
use FGB::Common;
use List::Util qw(max shuffle);
use Number::Bytes::Human qw(format_bytes);
use Sys::Info;

GetOptions (
    &quot;verbose&quot;       => my $verbose,
    &quot;check&quot;         => my $check,
    &quot;single-case=s&quot; => my $case,
    &quot;expected=i&quot;    => my $expected_no_lines,
) or die(&quot;Error in command line arguments
&quot;);

my $test_dir    = &#039;solutions&#039;;
my $output_file = &#039;out.txt&#039;;
my $wc_expected = $expected_no_lines; # expected number of output lines

my $tests       = get_test_names( $test_dir, $case );

my $file2_size  = get_file2_size();
my $num_cpus    = Sys::Info->new()->device( CPU => () )->count;

chdir $test_dir;
my $cmd = &#039;run.sh&#039;;
my @times;
for my $case (@$tests) {
    my $savedir = getcwd();
    chdir $case;
    say &quot;Running &#039;$case&#039;..&quot;;
    my $arg = get_cmd_args( $case, $file2_size, $num_cpus );
    my $output = `bash -c &quot;{ time -p $cmd $arg; } 2>&amp;1&quot;`;
    my ($user, $sys, $real ) = get_run_times( $output );
    print_timings( $user, $sys, $real ) if $verbose;
    check_output_is_ok( $output_file, $wc_expected, $verbose, $check );
    print &quot;
&quot; if $verbose;
    push @times, $real;
    #push @times, $user + $sys; # this is wrong when using Gnu parallel
    chdir $savedir;
}

say &quot;Done.
&quot;;

print_summary( $tests, @times );

结果

以下是运行测试的输出：

$  run_all.pl --verbose
Running &#039;inian3&#039;..
..finished in 0.45 seconds ( user: 0.44, sys: 0.00 )
..no of output lines: 66711

Running &#039;inian2&#039;..
..finished in 0.73 seconds ( user: 0.73, sys: 0.00 )
..no of output lines: 66711

Running &#039;Vasiliou&#039;..
..finished in 0.09 seconds ( user: 0.08, sys: 0.00 )
..no of output lines: 66711

Running &#039;codeforester_orig&#039;..
..finished in 0.05 seconds ( user: 0.05, sys: 0.00 )
..no of output lines: 66711

Running &#039;codeforester&#039;..
..finished in 0.45 seconds ( user: 0.44, sys: 0.01 )
..no of output lines: 66711

[...]

概括

[@Vasiliou 得到的结果显示在中间一列。]

                               |Vasiliou
My Benchmark                   |Results  |   Details
-------------------------------|---------|----------------------
inian4             : 0.04s     |0.22s    | LC_ALL fgrep -f [loose] 
codeforester_orig  : 0.05s     |         | fgrep -f [loose]
Vasiliou2          : 0.06s     |0.16s    | [LC_ALL join [requires sorted files]]
BOC1               : 0.06s     |         | grep -E [loose] 
BOC2               : 0.07s     |15s      | LC_ALL grep -E [loose] 
BOC2B              : 0.07s     |         | LC_ALL grep -E [strict] 
inian4B            : 0.08s     |         | LC_ALL grep -E -f [strict] 
Vasiliou           : 0.08s     |0.23s    | [join [requires sorted files]] 
gregory1B          : 0.08s     |         | [parallel + grep -E -f [strict]] 
ikegami            : 0.1s      |         | grep -P 
gregory1           : 0.11s     |0.5s     | [parallel + fgrep -f [loose]] 
hakon1             : 0.14s     |         | [perl + c]
BOC1B              : 0.14s     |         | grep -E [strict] 
jjoao              : 0.21s     |         | [compiled flex generated c code] 
inian6             : 0.26s     |0.7s     | [LC_ALL awk + split+dict] 
codeforester_origB : 0.28s     |         | grep -E -f [strict] 
dawg               : 0.35s     |         | [python + split+dict] 
inian3             : 0.44s     |1.1s     | [awk + split+dict] 
zdim2              : 0.4s      |         | [perl + split+dict] 
codeforester       : 0.45s     |         | [perl + split+dict] 
oliv               : 0.5s      |         | [python + compiled regex + re.search()] 
zdim               : 0.61s     |         | [perl + regexp+dict] 
inian2             : 0.73s     |1.7s     | [awk + index($0,i)] 
inian5             : 18.12s    |         | [LC_ALL awk + match($0,i) [loose]] 
inian1             : 19.46s    |         | [awk + match($0,i) [loose]] 
inian5B            : 42.27s    |         | [LC_ALL awk + match($0,i) [strict]] 
inian1B            : 85.67s    |         | [awk + match($0,i) [strict]] 

Vasiliou Results : 2 X CPU Intel 2 Duo T6570 @ 2.10GHz - 2Gb RAM-Debian Testing 64bit- kernel 4.9.0.1 - no cpu freq scaling.

更现实的测试用例

然后，我创建了一个更现实的案例，其中file1.txt有 100 个单词和file2.txt1000 万行（文件大小为 268Mb）。我从字典中提取了 1000 个随机单词，/usr/share/dict/american-english然后shuf -n1000 /usr/share/dict/american-english > words.txt将其中 100 个单词提取到中file1.txt，然后file2.txt按照上面第一个测试用例中描述的方式构建。请注意，字典文件是 UTF-8 编码的，我从中删除了所有非 ASCII 字符words.txt。

然后，我运行测试时，没有使用前一种情况下最慢的三种方法。即inian1，，inian2和inian5被排除在外。以下是新的结果：

gregory1           : 0.86s     | [parallel + fgrep -f [loose]]
Vasiliou2          : 0.94s     | [LC_ALL join [requires sorted files]]
inian4B            : 1.12s     | LC_ALL grep -E -f [strict] 
BOC2B              : 1.13s     | LC_ALL grep -E [strict] 
BOC2               : 1.15s     | LC_ALL grep -E [loose] 
BOC1               : 1.18s     | grep -E [loose] 
ikegami            : 1.33s     | grep -P 
Vasiliou           : 1.37s     | [join [requires sorted files]]
hakon1             : 1.44s     | [perl + c]
inian4             : 2.18s     | LC_ALL fgrep -f [loose] 
codeforester_orig  : 2.2s      | fgrep -f [loose] 
inian6             : 2.82s     | [LC_ALL awk + split+dict] 
jjoao              : 3.09s     | [compiled flex generated c code] 
dawg               : 3.54s     | [python + split+dict] 
zdim2              : 4.21s     | [perl + split+dict]
codeforester       : 4.67s     | [perl + split+dict] 
inian3             : 5.52s     | [awk + split+dict] 
zdim               : 6.55s     | [perl + regexp+dict] 
gregory1B          : 45.36s    | [parallel + grep -E -f [strict]] 
oliv               : 60.35s    | [python + compiled regex + re.search()] 
BOC1B              : 74.71s    | grep -E [strict] 
codeforester_origB : 75.52s    | grep -E -f [strict]

笔记

基于的解决方案grep正在寻找整行匹配，因此在这种情况下它们包含一些错误匹配：方法codeforester_orig、BOC1、BOC2、gregory1、inian4和oliv从 10,000,000 行中提取了 1,087,609 行，而其他方法从中提取了正确的 997,993 行file2.txt。

笔记

我在我的 Ubuntu 16.10 笔记本电脑（Intel Core i7-7500U CPU @ 2.70GHz）上进行了测试
整个基准研究可在此处查看。

解决方案 3：

您是否尝试Awk过可以加快速度一点：

awk &#039;FNR==NR{hash[$1]; next}{for (i in hash) if (match($0,i)) {print; break}}&#039; file1.txt FS=&#039;|&#039; file2.txt

（或）按照Benjamin W.的评论建议使用index()函数，如下所示Awk

awk &#039;FNR==NR{hash[$1]; next}{for (i in hash) if (index($0,i)) {print; break}}&#039; file1.txt FS=&#039;|&#039; file2.txt

（或）更直接的正则表达式匹配，如Ed Morton在评论中所建议的，

awk &#039;FNR==NR{hash[$1]; next}{for (i in hash) if ($0~i) {print; break}}&#039; file1.txt FS=&#039;|&#039; file2.txt

就是你所需要的。我猜这会更快，但对于包含数百万个条目的文件，我不太确定。这里的问题在于匹配的可能性。如果匹配出现在任何特定列中（例如$2单独出现），更快的方法可能是

awk &#039;FNR==NR{hash[$1]; next}$2 in hash&#039; file1.txt FS=&#039;|&#039; file2.txt

locale您还可以通过在系统中设置来加快速度。套用Stéphane Chazelas 对这个问题的精彩回答，您可以通过将语言环境传递给本地运行的LC_ALL=C命令来快速加快速度。

在任何GNU基于的系统上，默认的locale

$ locale
LANG=en_US.UTF-8
LC_CTYPE=&quot;en_US.UTF-8&quot;
LC_NUMERIC=&quot;en_US.UTF-8&quot;
LC_TIME=&quot;en_US.UTF-8&quot;
LC_COLLATE=&quot;en_US.UTF-8&quot;
LC_MONETARY=&quot;en_US.UTF-8&quot;
LC_MESSAGES=&quot;en_US.UTF-8&quot;
LC_PAPER=&quot;en_US.UTF-8&quot;
LC_NAME=&quot;en_US.UTF-8&quot;
LC_ADDRESS=&quot;en_US.UTF-8&quot;
LC_TELEPHONE=&quot;en_US.UTF-8&quot;
LC_MEASUREMENT=&quot;en_US.UTF-8&quot;
LC_IDENTIFICATION=&quot;en_US.UTF-8&quot;
LC_ALL=

使用一个变量LC_ALL，您可以LC_一次性将所有类型变量设置为指定的语言环境

$ LC_ALL=C locale
LANG=en_US.UTF-8
LC_CTYPE=&quot;C&quot;
LC_NUMERIC=&quot;C&quot;
LC_TIME=&quot;C&quot;
LC_COLLATE=&quot;C&quot;
LC_MONETARY=&quot;C&quot;
LC_MESSAGES=&quot;C&quot;
LC_PAPER=&quot;C&quot;
LC_NAME=&quot;C&quot;
LC_ADDRESS=&quot;C&quot;
LC_TELEPHONE=&quot;C&quot;
LC_MEASUREMENT=&quot;C&quot;
LC_IDENTIFICATION=&quot;C&quot;       
LC_ALL=C

那么这会产生什么影响呢？

简单来说，当使用时，locale C它将默认为服务器的基本Unix / Linux语言ASCII。基本上，当您执行grep某些操作时，默认情况下您的语言环境将被国际化并设置为UTF-8，它可以代表Unicode字符集中的每个字符，以帮助显示世界上任何书写系统，目前超过100个110,000唯一字符，而ASCII每个字符都以单字节序列编码，其字符集不超过100个128唯一字符。

因此，这意味着，当在字符集grep编码的文件上使用时UTF-8，它需要将每个字符与十万个唯一字符中的任何一个进行匹配，但仅限128于ASCII，因此请使用您的fgrepas

LC_ALL=C fgrep -F -f file1.txt file2.txt

此外，同样可以适用于Awk，因为它使用regex与match($0,i)调用的匹配，设置C语言环境可以加快字符串匹配。

LC_ALL=C awk &#039;FNR==NR{hash[$1]; next}{for (i in hash) if (match($0,i)) {print; break}}&#039; file1.txt FS=&#039;|&#039; file2.txt

解决方案 4：

一小段 Perl 代码解决了这个问题。这是采取的方法：

将行存储file1.txt在哈希中
file2.txt逐行读取，解析并提取第二个字段
检查提取的字段是否在哈希中；如果是，则打印该行

以下是代码：

#!/usr/bin/perl -w

use strict;
if (scalar(@ARGV) != 2) {
  printf STDERR &quot;Usage: fgrep.pl smallfile bigfile
&quot;;
  exit(2);
}

my ($small_file, $big_file) = ($ARGV[0], $ARGV[1]);
my ($small_fp, $big_fp, %small_hash, $field);

open($small_fp, &quot;&lt;&quot;, $small_file) || die &quot;Can&#039;t open $small_file: &quot; . $!;
open($big_fp, &quot;&lt;&quot;, $big_file)     || die &quot;Can&#039;t open $big_file: &quot;   . $!;

# store contents of small file in a hash
while (&lt;$small_fp>) {
  chomp;
  $small_hash{$_} = undef;
}
close($small_fp);

# loop through big file and find matches
while (&lt;$big_fp>) {
  # no need for chomp
  $field = (split(/|/, $_))[1];
  if (defined($field) &amp;&amp; exists($small_hash{$field})) {
    printf(&quot;%s&quot;, $_);
  }
}

close($big_fp);
exit(0);

我运行了上述脚本，file1.txt 中有 14K 行，file2.txt 中有 1.3M 行。它大约用了 13 秒完成，产生了 126K 个匹配项。以下是time相同的输出：

real    0m11.694s
user    0m11.507s
sys 0m0.174s

我运行了@Inian 的awk代码：

awk &#039;FNR==NR{hash[$1]; next}{for (i in hash) if (match($0,i)) {print; break}}&#039; file1.txt FS=&#039;|&#039; file2.txt

它比 Perl 解决方案慢得多，因为它对 file2.txt 中的每一行循环 14K 次 - 这确实很昂贵。它在处理 592K 条记录file2.txt并生成 40K 条匹配行后中止。以下是它所花的时间：

awk: illegal primary in regular expression 24/Nov/2016||592989 at 592989
 input record number 675280, file file2.txt
 source line number 1

real    55m5.539s
user    54m53.080s
sys 0m5.095s

使用@Inian 的其他awk解决方案，消除循环问题：

time awk -F &#039;|&#039; &#039;FNR==NR{hash[$1]; next}$2 in hash&#039; file1.txt FS=&#039;|&#039; file2.txt > awk1.out

real    0m39.966s
user    0m37.916s
sys 0m0.743s

time LC_ALL=C awk -F &#039;|&#039; &#039;FNR==NR{hash[$1]; next}$2 in hash&#039; file1.txt FS=&#039;|&#039; file2.txt > awk.out

real    0m41.057s
user    0m38.475s
sys 0m0.904s

awk非常令人印象深刻，因为我们不需要编写整个程序来完成它。

我也运行了@oliv 的 Python 代码。它花了大约 15 个小时才完成这项工作，看起来它产生了正确的结果。构建一个巨大的正则表达式并不像使用哈希查找那样高效。这里是输出time：

real    895m14.862s
user    806m59.219s
sys 1m12.147s

我尝试按照建议使用parallel。但是，fgrep: memory exhausted即使块大小非常小，它也会因错误而失败。

令我惊讶的是，这fgrep完全不适合这种情况。我在 22 小时后中止了它，它产生了大约 100K 个匹配项。 我希望fgrep有一个选项可以强制将的内容-f file保存在哈希中，就像 Perl 代码所做的那样。

我没有检查join方法——我不想要对文件进行排序的额外开销。此外，考虑到fgrep的性能不佳，我认为它不会join比 Perl 代码做得更好。

感谢大家的关注和回应。

解决方案 5：

假设：1. 您只想在本地工作站上运行此搜索。2. 您有多个核心/CPU 来利用并行搜索。

parallel --pipepart -a file2.txt --block 10M fgrep -F -f file1.txt

根据上下文进行一些进一步的调整：A. 使用 LANG=C 禁用 NLS（这在另一个答案中已经提到）B. 使用 -m 标志设置最大匹配数。

注意：我猜测文件 2 约为 4GB，10M 块大小没问题，但您可能需要优化块大小以获得最快的运行速度。

解决方案 6：

此 Perl 脚本 ( a) 生成一个正则表达式模式：

#!/usr/bin/perl

use strict;
use warnings;

use Regexp::Assemble qw( );

chomp( my @ids = &lt;> );
my $ra = Regexp::Assemble->new();
$ra->add(quotemeta($_)) for @ids;
print(&quot;^[^|]*\|(?:&quot; . (re::regexp_pattern($ra->re()))[0] . &quot;)\|&quot;);

使用方法如下：

$ LC_ALL=C grep -P &quot;$( a file1.txt )&quot; file2.txt
date1|foo1|number1
date2|foo2|number2
date1|bar1|number1
date2|bar2|number2

请注意，该脚本使用 Regexp::Assemble，因此您可能需要安装它。

sudo su
cpan Regexp::Assemble

笔记：

与 BOC1、BOC2、codeforester_orig、gregory1、inian2、inian4 和 oliv 的解决方案不同，我的解决方案正确处理

file1.txt
foo1

file2.txt
date1|foo12|number5

我的方案应该比 @BOC 的类似方案更好，因为该模式经过了优化，可以减少回溯。（如果中有超过三个字段，我的方案也能正常工作file2.txt，而链接方案可能会失败。）
我不知道它与拆分+字典解决方案相比如何。

解决方案 7：

Inline::C以下是用于加速在大型文件中搜索匹配字段的Perl 解决方案：

use strict;
use warnings;
use Inline C => &#039;./search.c&#039;;

my $smallfile = &#039;file1.txt&#039;;
my $bigfile   = &#039;file2.txt&#039;;

open my $fh, &#039;&lt;&#039;, $smallfile or die &quot;Can&#039;t open $smallfile: $!&quot;;
my %word = map { chomp; $_ => 1 } &lt;$fh>;
search( $bigfile, %word );

子程序search()用纯 C 实现，用于perlapi在小文件字典中查找键%words：

搜索.c：

#include &lt;stdio.h>
#include &lt;sys/stat.h> 
#include &lt;fcntl.h>
#include &lt;unistd.h>
#include &lt;errno.h>


#define BLOCK_SIZE 8192       /* how much to read from file each time */
static char read_buf[BLOCK_SIZE + 1];

/*  reads a block from file, returns -1 on error, 0 on EOF, 
     else returns chars read, pointer to buf, and pointer to end of buf  */
size_t read_block( int fd, char **ret_buf, char **end_buf ) {
    int ret;
    char *buf = read_buf;
    size_t len = BLOCK_SIZE;
    while (len != 0 &amp;&amp; (ret = read(fd, buf, len)) != 0) {
        if (ret == -1) {
            if (errno == EINTR)
                continue;
            perror( &quot;read&quot; );
            return ret;
        }
        len -= ret;
        buf += ret;
    }
    *end_buf = buf;
    *ret_buf = read_buf;
    return (size_t) (*end_buf - *ret_buf);
}

/* updates the line buffer with the char pointed to by cur,
   also updates cur
    */
int update_line_buffer( char **cur, char **line, size_t *llen, size_t max_line_len ) {
    if ( *llen > max_line_len ) {
        fprintf( stderr, &quot;Too long line. Maximimum allowed line length is %ld
&quot;,
                 max_line_len );
        return 0;
    }
    **line = **cur;
    (*line)++;
    (*llen)++;
    (*cur)++; 
    return 1;
}


/*    search for first pipe on a line (or next line if this is empty),
    assume line ptr points to beginning of line buffer.
  return 1 on success
  Return 0 if pipe could not be found for some reason, or if 
    line buffer length was exceeded  */
int search_field_start(
    int fd, char **cur, char **end_buf, char **line, size_t *llen, size_t max_line_len
) {
    char *line_start = *line;

    while (1) {
        if ( *cur >= *end_buf ) {
            size_t res = read_block( fd, cur, end_buf );        
            if (res &lt;= 0) return 0;
        }
        if ( **cur == &#039;|&#039; ) break;
        /* Currently we just ignore malformed lines ( lines that do not have a pipe,
           and empty lines in the input */
        if ( **cur == &#039;
&#039; ) {
            *line = line_start;
            *llen = 0;
            (*cur)++;
        }
        else {
            if (! update_line_buffer( cur, line, llen, max_line_len ) ) return 0;
        }
    }
    return 1;
}

/* assume cur points at starting pipe of field
  return -1 on read error, 
  return 0 if field len was too large for buffer or line buffer length exceed,
  else return 1
  and field, and  length of field
 */
int copy_field(
    int fd, char **cur, char **end_buf, char *field,
    size_t *flen, char **line, size_t *llen, size_t max_field_len, size_t max_line_len
) {
    *flen = 0;
    while( 1 ) {
        if (! update_line_buffer( cur, line, llen, max_line_len ) ) return 0;
        if ( *cur >= *end_buf ) {
            size_t res = read_block( fd, cur, end_buf );        
            if (res &lt;= 0) return -1;
        }
        if ( **cur == &#039;|&#039; ) break;
        if ( *flen > max_field_len ) {
            printf( &quot;Field width too large. Maximum allowed field width: %ld
&quot;,
                    max_field_len );
            return 0;
        }
        *field++ = **cur;
        (*flen)++;
    }
    /* It is really not necessary to null-terminate the field 
       since we return length of field and also field could 
       contain internal null characters as well
    */
    //*field = &#039;&#039;;
    return 1;
}

/* search to beginning of next line,
  return 0 on error,
  else return 1 */
int search_eol(
    int fd, char **cur, char **end_buf, char **line, size_t *llen, size_t max_line_len)
{
    while (1) {
        if ( *cur >= *end_buf ) {
            size_t res = read_block( fd, cur, end_buf );        
            if (res &lt;= 0) return 0;
        }
        if ( !update_line_buffer( cur, line, llen, max_line_len ) ) return 0;
        if ( *(*cur-1) == &#039;
&#039; ) {
            break;
        }
    }
    //**line = &#039;&#039;; // not necessary
    return 1;
}

#define MAX_FIELD_LEN 80  /* max number of characters allowed in a field  */
#define MAX_LINE_LEN 80   /* max number of characters allowed on a line */

/* 
   Get next field ( i.e. field #2 on a line). Fields are
   separated by pipes &#039;|&#039; in the input file.
   Also get the line of the field.
   Return 0 on error,
   on success: Move internal pointer to beginning of next line
     return 1 and the field.
 */
size_t get_field_and_line_fast(
    int fd, char *field, size_t *flen, char *line, size_t *llen
) {
    static char *cur = NULL;
    static char *end_buf = NULL;

    size_t res;
    if (cur == NULL) {
        res = read_block( fd, &amp;cur, &amp;end_buf );        
        if ( res &lt;= 0 ) return 0;
    }
    *llen = 0;
    if ( !search_field_start( fd, &amp;cur, &amp;end_buf, &amp;line, llen, MAX_LINE_LEN )) return 0;
    if ( (res = copy_field(
        fd, &amp;cur, &amp;end_buf, field, flen, &amp;line, llen, MAX_FIELD_LEN, MAX_LINE_LEN
    ) ) &lt;= 0)
        return 0;
    if ( !search_eol( fd, &amp;cur, &amp;end_buf, &amp;line, llen, MAX_LINE_LEN ) ) return 0;
    return 1;
}

void search( char *filename, SV *href) 
{
    if( !SvROK( href ) || ( SvTYPE( SvRV( href ) ) != SVt_PVHV ) ) {
        croak( &quot;Not a hash reference&quot; );
    }

    int fd = open (filename, O_RDONLY);
    if (fd == -1) {
        croak( &quot;Could not open file &#039;%s&#039;&quot;, filename );
    }
    char field[MAX_FIELD_LEN+1];
    char line[MAX_LINE_LEN+1];
    size_t flen, llen;
    HV *hash = (HV *)SvRV( href );
    while ( get_field_and_line_fast( fd, field, &amp;flen, line, &amp;llen ) ) {
        if( hv_exists( hash, field, flen ) )
            fwrite( line, sizeof(char), llen, stdout);
    }
    if (close(fd) == -1)
        croak( &quot;Close failed&quot; );

}

测试表明，它比这里介绍的最快的纯 Perl 解决方案（请参阅我的其他答案zdim2中的方法）快大约 3 倍。

解决方案 8：

这是一个使用集合的 Python 解决方案——在概念上大致相当于 Perl 仅有键的哈希或 awk 数组。

#!/usr/bin/python

import sys 

with open(sys.argv[1]) as f:
    tgt={e.rstrip() for e in f}

with open(sys.argv[2]) as f:
    for line in f:
        cells=line.split(&quot;|&quot;)
        if cells[1] in tgt:
            print line.rstrip()

当我对类似大小的文件运行此程序时，它大约需要 8 秒。

速度相同：

$ awk &#039;FNR==NR{arr[$1]; next} $2 in arr{print $0}&#039; FS=&quot;|&quot; /tmp/f1 /tmp/f2

这里的 Python 和 awk 解决方案都只是完整字符串匹配；而不是部分正则表达式样式匹配。

由于 awk 解决方案速度快且符合 POSIX 标准，因此这是更好的答案。

解决方案 9：

虽然这个帖子已经结束了，但是两个文件之间所有类似的 grep 方法都收集在了这篇文章中，为什么不添加这个 awk 替代方案，类似于（甚至改进）获得赏金的 Inian 的 awk 解决方案呢：

awk &#039;NR==FNR{a[$0]=1;next}a[$2]&#039; patterns.txt FS=&quot;|&quot; datafile.txt >matches.txt # For matches restricted on Field2 of datafile

这相当于 Inian awk$2 in hash解决方案，但由于我们不要求 awk 检查整个哈希数组是否包含 file2 的 $2 - 我们只检查 a[$2] 是否有值，因此它可能更快。

在读取第一个模式文件时，除了创建哈希数组之外，我们还分配一个值。

如果$2之前在模式文件中找到数据文件，那么a[$2]就会有一个值，因此会被打印出来，因为它不为空。

如果a[$2]数据文件不返回值（null），则转换为false => 不打印。

扩展以匹配数据文件的三个字段中的任意一个：

awk &#039;NR==FNR{a[$0]=1;next}(a[$1] || a[$2] || a[$3])&#039; patterns.txt FS=&quot;|&quot; datafile.txt >matches.txt. #Printed if any of the three fields of datafile match pattern.

在这两种情况下，在 awk 前面应用LC_ALL=C似乎可以加快速度。

PS1：当然，这个解决方案也存在所有 awk 解决方案的缺陷。不是模式匹配。是两个文件之间的直接/固定匹配，就像这里的大多数解决方案一样。

PS2：在我的较差的机器上使用Håkon Hægland的小型基准测试文件进行基准测试时，与awk 'FNR==NR{hash[$1]; next}$2 in hash' file1.txt FS='|' file2.txt

解决方案 10：

你能试试吗join？不过文件必须排序……

$ cat d.txt
bar1
bar2
foo1
foo2

$ cat e.txt
date1|bar1|number1
date2|bar2|number2
date3|bar3|number3
date1|foo1|number1
date2|foo2|number2
date3|foo3|number3

$ join --nocheck-order -11 -22 -t&#039;|&#039; -o 2.1 2.2 2.3 d.txt e.txt
date1|bar1|number1
date2|bar2|number2
date1|foo1|number1
date2|foo2|number2

小更新：

通过在 join 前面使用 LC_ALL=C，速度确实加快了，这可以从Håkon Hægland的基准测试中看出

PS1：我怀疑 join 是否比 grep -f 更快……

解决方案 11：

一种可能的方法是使用python：

$ cat test.py
import sys,re

with open(sys.argv[1], &quot;r&quot;) as f1:
    patterns = f1.read().splitlines() # read pattern from file1 without the trailing newline

m = re.compile(&quot;|&quot;.join(patterns))    # create the regex

with open(sys.argv[2], &quot;r&quot;) as f2:
    for line in f2: 
        if m.search(line) : 
            print line,               # print line from file2 if this one matches the regex

像这样使用它：

python test.py file1.txt file2.txt

解决方案 12：

您也可以使用 Perl 来实现此目的：

请注意，这将占用内存，您的机器/服务器最好有一些内存。

示例数据：

%_STATION@gaurav * /root/ga/pl> head file1.txt file2.txt
==> file1.txt &lt;==
foo1
foo2
...
bar1
bar2
...

==> file2.txt &lt;==
date1|foo1|number1
date2|foo2|number2
date3|foo3|number3
...
date1|bar1|number1
date2|bar2|number2
date3|bar3|number3
%_STATION@gaurav * /root/ga/study/pl>

脚本输出：脚本将在名为的文件中产生最终output_comp输出。

%_STATION@gaurav * /root/ga/pl> ./comp.pl  file1.txt file2.txt ; cat output_comp
date1|bar1|number1
date2|bar2|number2
date2|foo2|number2
date1|foo1|number1
%_STATION@gaurav * /root/ga/pl>

脚本：

%_STATION@gaurav * /root/ga/pl> cat comp.pl
#!/usr/bin/perl

use strict ;
use warnings ;
use Data::Dumper ;

my ($file1,$file2) = @ARGV ;
my $output = &quot;output_comp&quot; ;
my %hash ;    # This will store main comparison data.
my %tmp ;     # This will store already selected results, to be skipped.
(scalar @ARGV != 2 ? (print &quot;Need 2 files!
&quot;) : ()) ? exit 1 : () ;

# Read all files at once and use their name as the key.
for (@ARGV) {
  open FH, &quot;&lt;$_&quot; or die &quot;Cannot open $_
&quot; ;
  while  (my $line = &lt;FH>) {chomp $line ;$hash{$_}{$line} = &quot;$line&quot;}
  close FH ;
}

# Now we churn through the data and compare to generate
# the sorted output in the output file.
open FH, &quot;>>$output&quot; or die &quot;Cannot open outfile!
&quot; ;
foreach my $k1 (keys %{$hash{$file1}}){
  foreach my $k2 (keys %{$hash{$file2}}){
    if ($k1 =~ m/^.+?$k2.+?$/) {
      if (!defined $tmp{&quot;$hash{$file2}{$k2}&quot;}) {
        print FH &quot;$hash{$file2}{$k2}
&quot; ;
        $tmp{&quot;$hash{$file2}{$k2}&quot;} = 1 ;
      }
    }
  }
}
close FH  ;
%_STATION@gaurav * /root/ga/pl>

谢谢。

解决方案 13：

恕我直言，grep 是一款针对巨大的 file2.txt 高度优化的好工具，但可能不适合搜索如此多的模式。我建议将 file1.txt 的所有字符串组合成一个单一的巨大正则表达式，如 |bar1|bar2|foo1|foo2|

echo  &#039;|&#039;$(paste -s -d &#039;|&#039; file1.txt)&#039;|&#039; > regexp1.txt

grep -E -f regexp1.txt file2.txt > file.matched

当然 LANG=C 可能会有帮助。请提供反馈或发送您的文件，以便我可以自己测试。

解决方案 14：

我会使用 SQLite3 :) 也许是内存数据库或其他什么。导入文件并使用 SQL 查询。

解决方案 15：

使用flex：

1：构建弹性处理器：

$ awk &#039;NR==1{ printf &quot;%%%%

.*\|(%s&quot;,$0 } 
            { printf &quot;|%s&quot;,$0 } 
       END  { print &quot;)\|.*\n ECHO;
.*\n ;
%%
&quot; }&#039; file1.txt > a.fl

2：编译

$ flex -Ca -F a.fl ; cc -O lex.yy.c -lfl

3：然后运行

$ a.out &lt; file2.txt  > out

编译（cc ...）是一个缓慢的过程；这种方法只适用于稳定的 file1.txt 的情况。

（在我的计算机上）使用此方法运行搜索“10_000_000 中的 100”测试所需的时间比LC_ALL=C fgrep...

解决方案 16：

设置语言等也许有一点帮助。

否则我无法想出一个神奇的解决方案来解决您的基本问题：数据不是结构化的，因此您将进行搜索，结果为文件 1 中的行数乘以文件 2 中的行数。

将十亿行数据放入数据库中，并以智能方式对其进行索引，这是我能想到的唯一加速方法。但是，该索引必须非常智能......

简单的解决方案是：拥有足够的内存来容纳所有内容。否则你对此无能为力......