如何在 Python 中合并并求和两个字典？-IT科技

摘要：问题描述：我有一本下面的字典，我想添加另一本不一定具有不同元素的字典并合并其结果。是否有任何内置函数可以实现此功能，还是我需要自己创建？{ '6d6e7bf221ae24e07ab90bba4452267b05db7824cd3fd1ea94b2c9a8': 6, '7c4a462a6ed4a3070...

问题描述：

我有一本下面的字典，我想添加另一本不一定具有不同元素的字典并合并其结果。

是否有任何内置函数可以实现此功能，还是我需要自己创建？

{
  '6d6e7bf221ae24e07ab90bba4452267b05db7824cd3fd1ea94b2c9a8': 6,
  '7c4a462a6ed4a3070b6d78d97c90ac230330603d24a58cafa79caf42': 7,
  '9c37bdc9f4750dd7ee2b558d6c06400c921f4d74aabd02ed5b4ddb38': 9,
  'd3abb28d5776aef6b728920b5d7ff86fa3a71521a06538d2ad59375a': 15,
  '2ca9e1f9cbcd76a5ce1772f9b59995fd32cbcffa8a3b01b5c9c8afc2': 11
}

字典中元素的数量也是未知的。

当合并考虑两个相同的键时，应该将这些键的值相加而不是覆盖。

解决方案 1：

您没有说明具体想要如何合并，因此请自行选择：

x = {'both1': 1, 'both2': 2, 'only_x': 100}
y = {'both1': 10, 'both2': 20, 'only_y': 200}

print {k: x.get(k, 0) + y.get(k, 0) for k in set(x)}
print {k: x.get(k, 0) + y.get(k, 0) for k in set(x) & set(y)}
print {k: x.get(k, 0) + y.get(k, 0) for k in set(x) | set(y)}

结果：

{'both2': 22, 'only_x': 100, 'both1': 11}
{'both2': 22, 'both1': 11}
{'only_y': 200, 'both2': 22, 'both1': 11, 'only_x': 100}

解决方案 2：

您可以使用执行+、-、&和|（交集和并集）collections.Counter()。

我们可以执行以下操作（字典中仅保留正计数值）：

from collections import Counter

x = {'both1':1, 'both2':2, 'only_x': 100 }
y = {'both1':10, 'both2': 20, 'only_y':200 }

z = dict(Counter(x) + Counter(y))

print(z)
[out]:
{'both2': 22, 'only_x': 100, 'both1': 11, 'only_y': 200}

为了解决结果可能为零或负数的添加值，使用Counter.update()加法，Counter.subtract()使用减法：

x = {'both1':0, 'both2':2, 'only_x': 100 }
y = {'both1':0, 'both2': -20, 'only_y':200 }
xx = Counter(x)
yy = Counter(y)
xx.update(yy)
dict(xx)
[out]:
{'both2': -18, 'only_x': 100, 'both1': 0, 'only_y': 200}

解决方案 3：

根据georg、NPE、Scott和Havok的回答做出的附加说明。

我尝试对 2 个或更多词典的集合执行此操作，并有兴趣了解每个词典所花费的时间。因为我想对任意数量的词典执行此操作，所以我不得不稍微更改一些答案。如果有人有更好的建议，请随时编辑。

这是我的测试方法。我最近对其进行了更新，以包括使用更大词典的测试，并再次包括 Havok 和 Scott 的新方法：

首先我使用了以下数据：

import random

x = {'xy1': 1, 'xy2': 2, 'xyz': 3, 'only_x': 100}
y = {'xy1': 10, 'xy2': 20, 'xyz': 30, 'only_y': 200}
z = {'xyz': 300, 'only_z': 300}

small_tests = [x, y, z]

# 200,000 random 8 letter keys
keys = [''.join(random.choice("abcdefghijklmnopqrstuvwxyz") for _ in range(8)) for _ in range(200000)]

a, b, c = {}, {}, {}

# 50/50 chance of a value being assigned to each dictionary, some keys will be missed but meh
for key in keys:
    if random.getrandbits(1):
        a[key] = random.randint(0, 1000)
    if random.getrandbits(1):
        b[key] = random.randint(0, 1000)
    if random.getrandbits(1):
        c[key] = random.randint(0, 1000)

large_tests = [a, b, c]

print("a:", len(a), "b:", len(b), "c:", len(c))
#: a: 100069 b: 100385 c: 99989

现在每个方法：

from collections import defaultdict, Counter
from functools import reduce

def georg_method(tests):
    return {k: sum(t.get(k, 0) for t in tests) for k in set.union(*[set(t) for t in tests])}

def georg_method_nosum(tests):
    # If you know you will have exactly 3 dicts
    return {k: tests[0].get(k, 0) + tests[1].get(k, 0) + tests[2].get(k, 0) for k in set.union(*[set(t) for t in tests])}

def npe_method(tests):
    ret = defaultdict(int)
    for d in tests:
        for k, v in d.items():
            ret[k] += v
    return dict(ret)

# Note: There is a bug with scott's method. See below for details.
# Scott included a similar version using counters that is fixed
# See the scott_update_method below
def scott_method(tests):
    return dict(sum((Counter(t) for t in tests), Counter()))

def scott_method_nosum(tests):
    # If you know you will have exactly 3 dicts
    return dict(Counter(tests[0]) + Counter(tests[1]) + Counter(tests[2]))

def scott_update_method(tests):
    ret = Counter()
    for test in tests:
        ret.update(test)
    return dict(ret)

def scott_update_method_static(tests):
    # If you know you will have exactly 3 dicts
    xx = Counter(tests[0])
    yy = Counter(tests[1])
    zz = Counter(tests[2])
    xx.update(yy)
    xx.update(zz)
    return dict(xx)

def havok_method(tests):
    def reducer(accumulator, element):
        for key, value in element.items():
            accumulator[key] = accumulator.get(key, 0) + value
        return accumulator
    return reduce(reducer, tests, {})

methods = {
    "georg_method": georg_method, "georg_method_nosum": georg_method_nosum,
    "npe_method": npe_method,
    "scott_method": scott_method, "scott_method_nosum": scott_method_nosum,
    "scott_update_method": scott_update_method, "scott_update_method_static": scott_update_method_static,
    "havok_method": havok_method
}

我还编写了一个快速函数来查找列表之间的差异。不幸的是，那时我发现了 Scott 的方法中的问题，即，如果您的字典总数为 0，则由于Counter()添加时的行为方式，该字典根本不会被包括在内。

测试设置：

MacBook Pro（15 英寸，2016 年末），2.9 GHz Intel Core i7，16 GB 2133 MHz LPDDR3 RAM，运行 macOS Mojave 版本 10.14.5
通过 IPython 6.1.0 运行 Python 3.6.5

最后，结果：

结果：小型测试

for name, method in methods.items():
    print("Method:", name)
    %timeit -n10000 method(small_tests)
#: Method: georg_method
#: 7.81 µs ± 321 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
#: Method: georg_method_nosum
#: 4.6 µs ± 48.8 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
#: Method: npe_method
#: 3.2 µs ± 24.7 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
#: Method: scott_method
#: 24.9 µs ± 326 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
#: Method: scott_method_nosum
#: 18.9 µs ± 64.8 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
#: Method: scott_update_method
#: 9.1 µs ± 90.7 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
#: Method: scott_update_method_static
#: 14.4 µs ± 122 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
#: Method: havok_method
#: 3.09 µs ± 47.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

结果：大型测试

自然，无法运行那么多循环

for name, method in methods.items():
    print("Method:", name)
    %timeit -n10 method(large_tests)
#: Method: georg_method
#: 347 ms ± 20 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
#: Method: georg_method_nosum
#: 280 ms ± 4.97 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
#: Method: npe_method
#: 119 ms ± 11 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
#: Method: scott_method
#: 324 ms ± 16.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
#: Method: scott_method_nosum
#: 289 ms ± 14.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
#: Method: scott_update_method
#: 123 ms ± 1.94 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
#: Method: scott_update_method_static
#: 136 ms ± 3.19 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
#: Method: havok_method
#: 103 ms ± 1.31 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

结论

╔═══════════════════════════╦═══════╦═════════════════════════════╗
║                           ║       ║    Best of Time Per Loop    ║
║         Algorithm         ║  By   ╠══════════════╦══════════════╣
║                           ║       ║  small_tests ║  large_tests ║
╠═══════════════════════════╬═══════╬══════════════╬══════════════╣
║ functools reduce          ║ Havok ║       3.1 µs ║   103,000 µs ║
║ defaultdict sum           ║ NPE   ║       3.2 µs ║   119,000 µs ║
║ Counter().update loop     ║ Scott ║       9.1 µs ║   123,000 µs ║
║ Counter().update static   ║ Scott ║      14.4 µs ║   136,000 µs ║
║ set unions without sum()  ║ georg ║       4.6 µs ║   280,000 µs ║
║ set unions with sum()     ║ georg ║       7.8 µs ║   347,000 µs ║
║ Counter() without sum()   ║ Scott ║      18.9 µs ║   289,000 µs ║
║ Counter() with sum()      ║ Scott ║      24.9 µs ║   324,000 µs ║
╚═══════════════════════════╩═══════╩══════════════╩══════════════╝

重要。YMMV。

解决方案 4：

你可以使用defaultdict这个：

from collections import defaultdict

def dsum(*dicts):
    ret = defaultdict(int)
    for d in dicts:
        for k, v in d.items():
            ret[k] += v
    return dict(ret)

x = {'both1':1, 'both2':2, 'only_x': 100 }
y = {'both1':10, 'both2': 20, 'only_y':200 }

print(dsum(x, y))

这产生了

{'both1': 11, 'both2': 22, 'only_x': 100, 'only_y': 200}

解决方案 5：

另一个选项是使用 Reduce 函数。这允许对任意字典集合进行求和合并：

from functools import reduce

collection = [
    {'a': 1, 'b': 1},
    {'a': 2, 'b': 2},
    {'a': 3, 'b': 3},
    {'a': 4, 'b': 4, 'c': 1},
    {'a': 5, 'b': 5, 'c': 1},
    {'a': 6, 'b': 6, 'c': 1},
    {'a': 7, 'b': 7},
    {'a': 8, 'b': 8},
    {'a': 9, 'b': 9},
]


def reducer(accumulator, element):
    for key, value in element.items():
        accumulator[key] = accumulator.get(key, 0) + value
    return accumulator


total = reduce(reducer, collection, {})


assert total['a'] == sum(d.get('a', 0) for d in collection)
assert total['b'] == sum(d.get('b', 0) for d in collection)
assert total['c'] == sum(d.get('c', 0) for d in collection)

print(total)

执行：

{'a': 45, 'b': 45, 'c': 3}

优点：

简单，清晰，Python 风格。
无需架构，只要所有键都是“可总结的”。
时间复杂度为 O(n)，内存复杂度为 O(1)。

解决方案 6：

d1 = {'apples': 2, 'banana': 1}
d2 = {'apples': 3, 'banana': 2}
merged = reduce(
    lambda d, i: (
        d.update(((i[0], d.get(i[0], 0) + i[1]),)) or d
    ),
    d2.iteritems(),
    d1.copy(),
)

还有相当简单的替换dict.update()：

merged = dict(d1, **d2)

解决方案 7：

class dict_merge(dict):
def __add__(self, other):
    result = dict_merge({})
    for key in self.keys():
        if key in other.keys():
            result[key] = self[key] + other[key]
        else:
            result[key] = self[key]
    for key in other.keys():
        if key in self.keys():
            pass
        else:
            result[key] = other[key]
    return result


a = dict_merge({"a":2, "b":3, "d":4})
b = dict_merge({"a":1, "b":2})
c = dict_merge({"a":5, "b":6, "c":5})
d = dict_merge({"a":8, "b":6, "e":5})

print((a + b + c +d))


>>> {'a': 16, 'b': 17, 'd': 4, 'c': 5, 'e': 5}

这就是运算符重载。使用__add__，我们定义了如何使用从内置 python 继承的+运算符。您可以继续使用类似的方法在同一个类中定义其他运算符，使其更加灵活，例如使用进行乘法，使用进行除法，甚至使用进行模数运算，并将替换为相应的运算符（如果您发现自己需要这样的合并）。我只测试了没有其他运算符的原样，但我预计其他运算符不会出现问题。只要尝试就可以学习。dict_merge`dict*mul/div%mod+self[key] + other[key]`

解决方案 8：

一个相当简单的方法：

from collections import Counter
from functools import reduce

data = [
  {'x': 10, 'y': 1, 'z': 100},
  {'x': 20, 'y': 2, 'z': 200},
  {'a': 10, 'z': 300}
]

result = dict(reduce(lambda x, y: Counter(x) + Counter(y), data))

解决方案 9：

TL；DR；

此代码适用于和list of dicts（pandas series当字典为行项时）。速度超快。

@Havok 方法是迄今为止我测试过的最佳方法，因为其他一些测试也证实了这一点，我不会在这里放出测试结果，而是除了 Havok 的方法之外，我还分享了我的代码。因此，以下代码适用于字典列表，也适用于每行都有一个字典的 pandas 系列。

from functools import reduce
def reducer(accumulator, element):
    """Set unions two dictionary keys, and sums their values if keys are same,
    see explanation here https://stackoverflow.com/a/46128481/2234161"""
    for key, value in element.items():
        if accumulator.get(key, 0)!=0 and not accumulator.get(key, 0):
            print("why not", accumulator.get(key, 0))
        elif not value:
            print("why not value",value)
        accumulator[key] = accumulator.get(key, 0) + value
    return accumulator

def sum_dicts(dicts_collection, init_dict = None):
    """
    For a given a collection of dictionaries, it sums values of the same keys
    :param dicts_collection: [list of dictonaries, it can be a pandas series where each column has a dictionary]
    :param init_dict: [if there is a initial dictionary where the dicts_collection will be added on], defaults to dict()
    """
    res=None
    if not init_dict:
        init_dict = dict()
    try:
        res = reduce(reducer, dicts_collection, init_dict)
    except Exception as ex:
        print(f"Error while reducing dict: {dicts_collection}", ex)
        raise ex
    return res



result_dict = sum_dicts(list_of_dicts_or_pandasSeries)

解决方案 10：

创建两个具有随机 int 值的字典

多个列具有相同的名称

import random
import pandas as pd

def create_random_dict(txt):
    my_dict = {}
    for c in txt:
        my_dict[c] = random.randint(1,30219)
    return my_dict

dict1 = create_random_dict('abcdefg')
dict2 = create_random_dict('cxzdywuf')
print(dict1)
print(dict2)

您的打印结果可能会因随机

{'a'：21804，'b'：19749，'c'：16837，'d'：10134，'e'：26181，'f'：8343，'g'：10268}

{'z'：12763，'x'：23583，'c'：20710，'d'：22395，'y'：25782，'f'：23376，'w'：25857，'u'：9154}

收集两个字典的所有键

cols = list(dict1.keys())+list(dict2.keys())

删除列名中的重复项

cols = list(dict.fromkeys(cols))

创建与字典对应的数据框

df1 = pd.DataFrame(dict1, columns=cols, index=[0]).fillna(0)
df2 = pd.DataFrame(dict2, columns=cols, index=[0]).fillna(0)

对数据框求和并将其转换回字典

result = (df1+df2).T.to_dict()[0]
print(result)

{'a'：21804，'b'：19749，'c'：37547，'d'：32529，'e'：26181，'f'：31719，'g'：10268，'z'：12763，'x'：23583，'y'：25782，'w'：25857，'u'：9154}

解决方案 11：

a = {}
b = {}

a = a.values()
b = b.values()

answer = [(x + y) for x, y in zip(a,b)]

解决方案 12：

如果您想创建一个新的dict使用|：

>>> dict({'a': 1,'c': 2}, **{'c': 1})
{'a': 1, 'c': 1}

解决方案 13：

Scott 使用的方法collections.Counter很好，但它的缺点是不能用sum；另外，当你只是想逐个添加值时，需要处理负值或零值对我来说有点违反直觉。

所以我认为，为此编写一个自定义类可能是一个好主意。这也是 John Mutuma 的想法。但是，我想添加我的解决方案：

我创建了一个行为与非常相似的类，基本上将所有成员调用都传递给getatrr方法中的dict底层。唯一不同的两点是：_data

它有一个DEFAULT_VALUE（类似于collections.defaultdict），用作不存在的键的值。
它实现了一种__add__()方法，该方法（与该__radd__()方法一起）负责按组件添加字典。

from typing import Union, Any


class AddableDict:
    DEFAULT_VALUE = 0

    def __init__(self, data: dict) -> None:
        self._data = data

    def __getattr__(self, attr: str) -> Any:
        return getattr(self._data, attr)

    def __getitem__(self, item) -> Any:
        try:
            return self._data[item]
        except KeyError:
            return self.DEFAULT_VALUE

    def __repr__(self):
        return self._data.__repr__()

    def __add__(self, other) -> "AddableDict":
        return AddableDict({
            key: self[key] + other[key]
            for key in set(self.keys()) | set(other.keys())
        })

    def __radd__(
        self, other: Union[int, "AddableDict"]
    ) -> "AddableDict":
        if other == 0:
            return self

这样我们sum就可以添加两个对象以及这些对象的可迭代对象：

>>> alpha = AddableDict({"a": 1})
>>> beta = AddableDict({"a": 10, "b": 5})
>>> alpha + beta
{'a': 11, 'b': 5}

>>> sum([beta]*10)
{'a': 100, 'b': 50}

在我看来，这种解决方案的优点是为开发人员提供了一个简单易懂的界面。当然，您也可以继承而dict不是使用组合。