GroupBy pandas DataFrame 并选择最常见的值-IT科技

摘要：问题描述：我有一个包含三个字符串列的数据框。我知道第三列中只有一个值对前两列的每个组合都有效。为了清理数据，我必须按前两列对数据框进行分组，并为每个组合选择第三列中最常见的值。我的代码：import pandas as pd from scipy import stats source = pd.DataF...

问题描述：

我有一个包含三个字符串列的数据框。我知道第三列中只有一个值对前两列的每个组合都有效。为了清理数据，我必须按前两列对数据框进行分组，并为每个组合选择第三列中最常见的值。

我的代码：

import pandas as pd
from scipy import stats

source = pd.DataFrame({
    'Country': ['USA', 'USA', 'Russia', 'USA'], 
    'City': ['New-York', 'New-York', 'Sankt-Petersburg', 'New-York'],
    'Short name': ['NY', 'New', 'Spb', 'NY']})

source.groupby(['Country','City']).agg(lambda x: stats.mode(x['Short name'])[0])

最后一行代码不起作用，它说KeyError: 'Short name'如果我尝试仅按城市分组，则会出现 AssertionError。我该怎么做才能修复它？

解决方案 1：

熊猫> = 0.16

`pd.Series.mode`可用！

使用groupby、GroupBy.agg，并将pd.Series.mode函数应用到每个组：

source.groupby(['Country','City'])['Short name'].agg(pd.Series.mode)

Country  City            
Russia   Sankt-Petersburg    Spb
USA      New-York             NY
Name: Short name, dtype: object

如果需要将其作为 DataFrame，请使用

source.groupby(['Country','City'])['Short name'].agg(pd.Series.mode).to_frame()

                         Short name
Country City                       
Russia  Sankt-Petersburg        Spb
USA     New-York                 NY

的有用之处Series.mode在于它始终返回一个 Series，这使得它与agg和非常兼容apply，尤其是在重建 groupby 输出时。它也更快。

# Accepted answer.
%timeit source.groupby(['Country','City']).agg(lambda x:x.value_counts().index[0])
# Proposed in this post.
%timeit source.groupby(['Country','City'])['Short name'].agg(pd.Series.mode)

5.56 ms ± 343 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.76 ms ± 387 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

处理多种模式

Series.mode在有多种模式时也能很好地完成工作：

source2 = source.append(
    pd.Series({'Country': 'USA', 'City': 'New-York', 'Short name': 'New'}),
    ignore_index=True)

# Now `source2` has two modes for the 
# ("USA", "New-York") group, they are "NY" and "New".
source2

  Country              City Short name
0     USA          New-York         NY
1     USA          New-York        New
2  Russia  Sankt-Petersburg        Spb
3     USA          New-York         NY
4     USA          New-York        New

source2.groupby(['Country','City'])['Short name'].agg(pd.Series.mode)

Country  City            
Russia   Sankt-Petersburg          Spb
USA      New-York            [NY, New]
Name: Short name, dtype: object

或者，如果您希望每种模式都有单独的行，则可以使用GroupBy.apply：

source2.groupby(['Country','City'])['Short name'].apply(pd.Series.mode)

Country  City               
Russia   Sankt-Petersburg  0    Spb
USA      New-York          0     NY
                           1    New
Name: Short name, dtype: object

如果您不关心返回哪种模式，只要它是其中之一，那么您将需要一个调用mode并提取第一个结果的 lambda。

source2.groupby(['Country','City'])['Short name'].agg(
    lambda x: pd.Series.mode(x)[0])

Country  City            
Russia   Sankt-Petersburg    Spb
USA      New-York             NY
Name: Short name, dtype: object

不考虑的替代方案

您也可以使用statistics.modepython，但是......

source.groupby(['Country','City'])['Short name'].apply(statistics.mode)

Country  City            
Russia   Sankt-Petersburg    Spb
USA      New-York             NY
Name: Short name, dtype: object

...在处理多种模式时效果不佳；StatisticsError提出了 a。文档中提到了这一点：

如果数据为空，或者没有一个最常见的值，则会引发 StatisticsError。

但你可以自己看看……

statistics.mode([1, 2])
# ---------------------------------------------------------------------------
# StatisticsError                           Traceback (most recent call last)
# ...
# StatisticsError: no unique mode; found 2 equally common values

解决方案 2：

您可以使用value_counts()来获取计数系列，并获取第一行：

source.groupby(['Country','City']).agg(lambda x: x.value_counts().index[0])

如果您想了解如何执行其他聚合函数.agg()，请尝试一下。

# Let's add a new col, "account"
source['account'] = [1, 2, 3, 3]

source.groupby(['Country','City']).agg(
    mod=('Short name', lambda x: x.value_counts().index[0]),
    avg=('account', 'mean'))

解决方案 3：

虽然有点晚了，但是我遇到了 HYRY 解决方案的一些性能问题，所以我不得不想出另一个解决方案。

它的工作原理是查找每个键值的频率，然后对于每个键仅保留最常出现的值。

还有支持多种模式的附加解决方案。

在代表我正在处理的数据的规模测试中，这将运行时间从 37.4 秒减少到 0.5 秒！

以下是解决方案的代码、一些用法示例和规模测试：

import numpy as np
import pandas as pd
import random
import time

test_input = pd.DataFrame(columns=[ 'key',          'value'],
                          data=  [[ 1,              'A'    ],
                                  [ 1,              'B'    ],
                                  [ 1,              'B'    ],
                                  [ 1,              np.nan ],
                                  [ 2,              np.nan ],
                                  [ 3,              'C'    ],
                                  [ 3,              'C'    ],
                                  [ 3,              'D'    ],
                                  [ 3,              'D'    ]])

def mode(df, key_cols, value_col, count_col):
    '''                                                                                                                                                                                                                                                                                                                                                              
    Pandas does not provide a `mode` aggregation function                                                                                                                                                                                                                                                                                                            
    for its `GroupBy` objects. This function is meant to fill                                                                                                                                                                                                                                                                                                        
    that gap, though the semantics are not exactly the same.                                                                                                                                                                                                                                                                                                         

    The input is a DataFrame with the columns `key_cols`                                                                                                                                                                                                                                                                                                             
    that you would like to group on, and the column                                                                                                                                                                                                                                                                                                                  
    `value_col` for which you would like to obtain the mode.                                                                                                                                                                                                                                                                                                         

    The output is a DataFrame with a record per group that has at least one mode                                                                                                                                                                                                                                                                                     
    (null values are not counted). The `key_cols` are included as columns, `value_col`                                                                                                                                                                                                                                                                               
    contains a mode (ties are broken arbitrarily and deterministically) for each                                                                                                                                                                                                                                                                                     
    group, and `count_col` indicates how many times each mode appeared in its group.                                                                                                                                                                                                                                                                                 
    '''
    return df.groupby(key_cols + [value_col]).size() \n             .to_frame(count_col).reset_index() \n             .sort_values(count_col, ascending=False) \n             .drop_duplicates(subset=key_cols)

def modes(df, key_cols, value_col, count_col):
    '''                                                                                                                                                                                                                                                                                                                                                              
    Pandas does not provide a `mode` aggregation function                                                                                                                                                                                                                                                                                                            
    for its `GroupBy` objects. This function is meant to fill                                                                                                                                                                                                                                                                                                        
    that gap, though the semantics are not exactly the same.                                                                                                                                                                                                                                                                                                         

    The input is a DataFrame with the columns `key_cols`                                                                                                                                                                                                                                                                                                             
    that you would like to group on, and the column                                                                                                                                                                                                                                                                                                                  
    `value_col` for which you would like to obtain the modes.                                                                                                                                                                                                                                                                                                        

    The output is a DataFrame with a record per group that has at least                                                                                                                                                                                                                                                                                              
    one mode (null values are not counted). The `key_cols` are included as                                                                                                                                                                                                                                                                                           
    columns, `value_col` contains lists indicating the modes for each group,                                                                                                                                                                                                                                                                                         
    and `count_col` indicates how many times each mode appeared in its group.                                                                                                                                                                                                                                                                                        
    '''
    return df.groupby(key_cols + [value_col]).size() \n             .to_frame(count_col).reset_index() \n             .groupby(key_cols + [count_col])[value_col].unique() \n             .to_frame().reset_index() \n             .sort_values(count_col, ascending=False) \n             .drop_duplicates(subset=key_cols)

print test_input
print mode(test_input, ['key'], 'value', 'count')
print modes(test_input, ['key'], 'value', 'count')

scale_test_data = [[random.randint(1, 100000),
                    str(random.randint(123456789001, 123456789100))] for i in range(1000000)]
scale_test_input = pd.DataFrame(columns=['key', 'value'],
                                data=scale_test_data)

start = time.time()
mode(scale_test_input, ['key'], 'value', 'count')
print time.time() - start

start = time.time()
modes(scale_test_input, ['key'], 'value', 'count')
print time.time() - start

start = time.time()
scale_test_input.groupby(['key']).agg(lambda x: x.value_counts().index[0])
print time.time() - start

运行此代码将会打印类似如下的内容：

   key value
0    1     A
1    1     B
2    1     B
3    1   NaN
4    2   NaN
5    3     C
6    3     C
7    3     D
8    3     D
   key value  count
1    1     B      2
2    3     C      2
   key  count   value
1    1      2     [B]
2    3      2  [C, D]
0.489614009857
9.19386196136
37.4375009537

希望这有帮助！

解决方案 4：

对于agg，lambba 函数得到一个Series，它没有'Short name'属性。

stats.mode返回一个由两个数组组成的元组，因此您必须取此元组中第一个数组的第一个元素。

通过以下两个简单的更改：

source.groupby(['Country','City']).agg(lambda x: stats.mode(x)[0][0])

                         Short name
Country City                       
Russia  Sankt-Petersburg        Spb
USA     New-York                 NY

解决方案 5：

这里最热门的两个答案表明：

df.groupby(cols).agg(lambda x:x.value_counts().index[0])

或者，最好

df.groupby(cols).agg(pd.Series.mode)

然而，这两种方法在简单的边缘情况下都会失败，如下所示：

df = pd.DataFrame({
    'client_id':['A', 'A', 'A', 'A', 'B', 'B', 'B', 'C'],
    'date':['2019-01-01', '2019-01-01', '2019-01-01', '2019-01-01', '2019-01-01', '2019-01-01', '2019-01-01', '2019-01-01'],
    'location':['NY', 'NY', 'LA', 'LA', 'DC', 'DC', 'LA', np.NaN]
})

第一：

df.groupby(['client_id', 'date']).agg(lambda x:x.value_counts().index[0])

产生IndexError（因为 group 返回了空的 Series C）。第二个：

df.groupby(['client_id', 'date']).agg(pd.Series.mode)

返回ValueError: Function does not reduce，因为第一组返回两个列表（因为有两种模式）。（如此处所述，如果第一组返回单一模式，这将有效！）

针对这种情况，有两种可能的解决方案：

import scipy
x.groupby(['client_id', 'date']).agg(lambda x: scipy.stats.mode(x)[0])

cs95 在这里的评论中给我的解决方案是：

def foo(x): 
    m = pd.Series.mode(x); 
    return m.values[0] if not m.empty else np.nan
df.groupby(['client_id', 'date']).agg(foo)

但是，所有这些都很慢，不适合处理大型数据集。我最终使用的解决方案 a) 可以处理这些情况，并且 b) 速度要快得多，它是 abw33 答案的轻微修改版本（应该更高）：

def get_mode_per_column(dataframe, group_cols, col):
    return (dataframe.fillna(-1)  # NaN placeholder to keep group 
            .groupby(group_cols + [col])
            .size()
            .to_frame('count')
            .reset_index()
            .sort_values('count', ascending=False)
            .drop_duplicates(subset=group_cols)
            .drop(columns=['count'])
            .sort_values(group_cols)
            .replace(-1, np.NaN))  # restore NaNs

group_cols = ['client_id', 'date']    
non_grp_cols = list(set(df).difference(group_cols))
output_df = get_mode_per_column(df, group_cols, non_grp_cols[0]).set_index(group_cols)
for col in non_grp_cols[1:]:
    output_df[col] = get_mode_per_column(df, group_cols, col)[col].values

本质上，该方法每次处理一列并输出一个 df，因此concat，您不是将（这是密集的）视为 df，而是将第一个视为 df，然后迭代地将输出数组（values.flatten()）添加为 df 中的一列。

解决方案 6：

正式来说，正确答案是 @eumiro 解决方案。@HYRY 解决方案的问题在于，当您有 [1,2,3,4] 这样的数字序列时，解决方案是错误的，即您没有模式。示例：

>>> import pandas as pd
>>> df = pd.DataFrame(
        {
            'client': ['A', 'B', 'A', 'B', 'B', 'C', 'A', 'D', 'D', 'E', 'E', 'E', 'E', 'E', 'A'], 
            'total': [1, 4, 3, 2, 4, 1, 2, 3, 5, 1, 2, 2, 2, 3, 4], 
            'bla': [10, 40, 30, 20, 40, 10, 20, 30, 50, 10, 20, 20, 20, 30, 40]
        }
    )

如果你像@HYRY 那样计算，你会获得：

>>> print(df.groupby(['client']).agg(lambda x: x.value_counts().index[0]))
        total  bla
client            
A           4   30
B           4   40
C           1   10
D           3   30
E           2   20

这显然是错误的（请参阅A值应该是1而不是4），因为它无法处理唯一值。

因此，另一个解决方案是正确的：

>>> import scipy.stats
>>> print(df.groupby(['client']).agg(lambda x: scipy.stats.mode(x)[0][0]))
        total  bla
client            
A           1   10
B           4   40
C           1   10
D           3   30
E           2   20

解决方案 7：

用于`DataFrame.value_counts`快速解决方案

以下是前 3 个答案：

source.groupby(['Country','City'])['Short name'].agg(pd.Series.mode)
source.groupby(['Country','City']).agg(lambda x:x.value_counts().index[0])
source.groupby(['Country','City']).agg(lambda x: stats.mode(x)[0])

对于大型数据集来说速度非常慢。

使用的解决方案collections.Counter更快（比前 3 种方法快 20-40 倍）

source.groupby(['Country', 'City'])['Short name'].agg(lambda srs: Counter(list(srs)).most_common(1)[0][0])

但还是很慢。

abw333 和 Josh Friedlander 的解决方案要快得多（比使用的方法快 10 倍左右）。这些解决方案可以通过使用（自 pandas 1.1.0 起可用）Counter进行进一步优化。value_counts`DataFrame.value_counts`

source.value_counts(['Country', 'City', 'Short name']).pipe(lambda x: x[~x.droplevel('Short name').index.duplicated()]).reset_index(name='Count')

为了使函数像 Josh Friedlander 的函数那样考虑 NaN，只需关闭dropna参数：

source.value_counts(['Country', 'City', 'Short name'], dropna=False).pipe(lambda x: x[~x.droplevel('Short name').index.duplicated()]).reset_index(name='Count')

使用 abw333 的设置，如果我们测试运行时差异，对于具有 1mil 行的 DataFrame，value_counts比 abw333 的解决方案快约 10%。

scale_test_data = [[random.randint(1, 100),
                    str(random.randint(100, 900)), 
                    str(random.randint(0,2))] for i in range(1000000)]
source = pd.DataFrame(data=scale_test_data, columns=['Country', 'City', 'Short name'])
keys = ['Country', 'City']
vals = ['Short name']

%timeit source.value_counts(keys+vals).pipe(lambda x: x[~x.droplevel(vals).index.duplicated()]).reset_index(name='Count')
# 376 ms ± 3.42 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit mode(source, ['Country', 'City'], 'Short name', 'Count')
# 415 ms ± 1.08 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

为了方便使用，我将此解决方案包装在一个函数中，您可以轻松复制粘贴并在自己的环境中使用。此函数还可以查找多列的分组模式。

def get_groupby_modes(source, keys, values, dropna=True, return_counts=False):
    """
    A function that groups a pandas dataframe by some of its columns (keys) and 
    returns the most common value of each group for some of its columns (values).
    The output is sorted by the counts of the first column in values (because it
    uses pd.DataFrame.value_counts internally).
    An equivalent one-liner if values is a singleton list is:
    (
        source
        .value_counts(keys+values)
        .pipe(lambda x: x[~x.droplevel(values).index.duplicated()])
        .reset_index(name=f"{values[0]}_count")
    )
    If there are multiple modes for some group, it returns the value with the 
    lowest Unicode value (because under the hood, it drops duplicate indexes in a 
    sorted dataframe), unlike, e.g. df.groupby(keys)[values].agg(pd.Series.mode).
    Must have Pandas 1.1.0 or later for the function to work and must have 
    Pandas 1.3.0 or later for the dropna parameter to work.
    -----------------------------------------------------------------------------
    Parameters:
    -----------
    source: pandas dataframe.
        A pandas dataframe with at least two columns.
    keys: list.
        A list of column names of the pandas dataframe passed as source. It is 
        used to determine the groups for the groupby.
    values: list.
        A list of column names of the pandas dataframe passed as source. 
        If it is a singleton list, the output contains the mode of each group 
        for this column. If it is a list longer than 1, then the modes of each 
        group for the additional columns are assigned as new columns.
    dropna: bool, default: True.
        Whether to count NaN values as the same or not. If True, NaN values are 
        treated by their default property, NaN != NaN. If False, NaN values in 
        each group are counted as the same values (NaN could potentially be a 
        most common value).
    return_counts: bool, default: False.
        Whether to include the counts of each group's mode. If True, the output 
        contains a column for the counts of each mode for every column in values. 
        If False, the output only contains the modes of each group for each 
        column in values.
    -----------------------------------------------------------------------------
    Returns:
    --------
    a pandas dataframe.
    -----------------------------------------------------------------------------
    Example:
    --------
    get_groupby_modes(source=df, 
                      keys=df.columns[:2].tolist(), 
                      values=df.columns[-2:].tolist(), 
                      dropna=True,
                      return_counts=False)
    """
    
    def _get_counts(df, keys, v, dropna):
        c = df.value_counts(keys+v, dropna=dropna)
        return c[~c.droplevel(v).index.duplicated()]
    
    counts = _get_counts(source, keys, values[:1], dropna)
    
    if len(values) == 1:
        if return_counts:
            final = counts.reset_index(name=f"{values[0]}_count")
        else:
            final = counts.reset_index()[keys+values[:1]]
    else:
        final = counts.reset_index(name=f"{values[0]}_count", level=values[0])
        if not return_counts:
            final = final.drop(columns=f"{values[0]}_count")
        for v in values:
            counts = _get_counts(source, keys, [v], dropna).reset_index(level=v)
            if return_counts:
                final[[v, f"{v}_count"]] = counts
            else:
                final[v] = counts[v]
        final = final.reset_index()
    return final

解决方案 8：

如果您不想包含 NaN 值，则使用比或Counter快得多：pd.Series.mode`pd.Series.value_counts()[0]`

def get_most_common(srs):
    x = list(srs)
    my_counter = Counter(x)
    return my_counter.most_common(1)[0][0]

df.groupby(col).agg(get_most_common)

应该可以工作。当您有 NaN 值时，此方法会失败，因为每个 NaN 将被单独计算。

解决方案 9：

不要使用“.agg”，请尝试使用“.apply”，它速度更快并且可以提供跨列的结果。

source = pd.DataFrame({'Country' : ['USA', 'USA', 'Russia','USA'], 
              'City' : ['New-York', 'New-York', 'Sankt-Petersburg', 'New-York'],
              'Short name' : ['NY','New','Spb','NY']})
source.groupby(['Country', 'City'])['Short name'].apply(pd.Series.mode).reset_index()

解决方案 10：

如果您想要另一种不依赖于的方法来解决它value_counts，或者scipy.stats您可以使用Counter集合

from collections import Counter
get_most_common = lambda values: max(Counter(values).items(), key = lambda x: x[1])[0]

可以像这样应用到上面的例子

src = pd.DataFrame({'Country' : ['USA', 'USA', 'Russia','USA'], 
              'City' : ['New-York', 'New-York', 'Sankt-Petersburg', 'New-York'],
              'Short_name' : ['NY','New','Spb','NY']})

src.groupby(['Country','City']).agg(get_most_common)

解决方案 11：

这里的问题是性能，如果有很多行，这将是一个问题。

如果是您的这种情况，请尝试以下方法：

import pandas as pd

source = pd.DataFrame({'Country' : ['USA', 'USA', 'Russia','USA'], 
              'City' : ['New-York', 'New-York', 'Sankt-Petersburg', 'New-York'],
              'Short_name' : ['NY','New','Spb','NY']})

source.groupby(['Country','City']).agg(lambda x:x.value_counts().index[0])

source.groupby(['Country','City']).Short_name.value_counts().groupby['Country','City']).first()

解决方案 12：

对于较大的数据集，有一种稍微笨拙但更快的方法，即获取感兴趣的列的计数，按从高到低的顺序对计数进行排序，然后在子集上进行重复数据删除以仅保留最大的情况。代码示例如下：

>>> import pandas as pd
>>> source = pd.DataFrame(
        {
            'Country': ['USA', 'USA', 'Russia', 'USA'], 
            'City': ['New-York', 'New-York', 'Sankt-Petersburg', 'New-York'],
            'Short name': ['NY', 'New', 'Spb', 'NY']
        }
    )
>>> grouped_df = source\n        .groupby(['Country','City','Short name'])[['Short name']]\n        .count()\n        .rename(columns={'Short name':'count'})\n        .reset_index()\n        .sort_values('count', ascending=False)\n        .drop_duplicates(subset=['Country', 'City'])\n        .drop('count', axis=1)
>>> print(grouped_df)
  Country              City Short name
1     USA          New-York         NY
0  Russia  Sankt-Petersburg        Spb

解决方案 13：

要始终返回所有模式（单个或多个类似模式）.agg，您可以创建一个以列表形式返回模式的函数。

df.agg(lambda x: x.mode().to_list())

def lmode(x): return x.mode().to_list()
df.agg(lmode)

如果您希望将单一模式作为标量返回，则可以使用以下函数：

def lmode(x): a = x.mode(); return a.to_list() if len(a) > 1 else a.squeeze()

好处：

返回所有模式
- 单一模式为标量，多种模式为列表
适用于groupby和agg
可与其他骨料结合使用（例如df.agg([lmode, 'nunique'])）
返回lmode而不是作为lambda聚合名称
当分组模式为时不会触发错误，而是np.nan返回[]

具有多个聚合的示例

import numpy as np
import pandas as pd

df = pd.DataFrame({
    'i': [1, 3, 2, np.nan, 3, 1],
    's': ['a', 'a', 'b', 'c', 'c', np.nan],
})

def lmode(x): a = x.mode(); return a.to_list() if len(a) > 1 else a.squeeze()

# Combined aggregates with multiple modes
print(df.agg([lmode, 'nunique']))

                  i  s
lmode    [1.0, 3.0]  a
nunique           3  4

来自OP的示例

source = pd.DataFrame({
    'Country': ['USA', 'USA', 'Russia', 'USA'],
    'City': ['New-York', 'New-York', 'Sankt-Petersburg', 'New-York'],
    'Short name': ['NY', 'New', 'Spb', 'NY']})

source.groupby(['Country','City']).agg(lmode)

                         Short name
Country City                       
Russia  Sankt-Petersburg        Spb
USA     New-York                 NY

解决方案 14：

这应该快速而简单。使用 .value_counts() 将返回按最高频率排序的值，然后仅取第一次出现的值：

df = source.groupby(['Country', 'City'])['Short name'].value_counts().reset_index()
df = df.groupby(['Country', 'City'])['Short name'].first().reset_index()

>>>print(df)

  Country              City Short name
0  Russia  Sankt-Petersburg        Spb
1     USA          New-York         NY