摘要：问题描述：这显然很简单，但作为一个 numpy 新手，我陷入了困境。我有一个 CSV 文件，其中包含 3 列：州、办公室 ID 和该办公室的销售额。我想计算某个州每个办事处的销售百分比（每个州的所有百分比总和为 100％）。df = pd.DataFrame({'state': ['CA', 'WA', 'C...

问题描述：

这显然很简单，但作为一个 numpy 新手，我陷入了困境。

我有一个 CSV 文件，其中包含 3 列：州、办公室 ID 和该办公室的销售额。

我想计算某个州每个办事处的销售百分比（每个州的所有百分比总和为 100％）。

df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
                   'office_id': list(range(1, 7)) * 2,
                   'sales': [np.random.randint(100000, 999999)
                             for _ in range(12)]})

df.groupby(['state', 'office_id']).agg({'sales': 'sum'})

                  sales
state office_id        
AZ    2          839507
      4          373917
      6          347225
CA    1          798585
      3          890850
      5          454423
CO    1          819975
      3          202969
      5          614011
WA    2          163942
      4          369858
      6          959285

我似乎无法弄清楚如何“达到”将整个加起来state的水平来计算分数。groupby`sales`state

解决方案 1：

更新 2022-03

caner using的这个答案transform看起来比我原来的答案好多了！

df['sales'] / df.groupby('state')['sales'].transform('sum')

感谢Paul Rougieux提出此评论。

原始答案（2014年）

Paul H 的回答是正确的，你必须制作第二个groupby对象，但你可以用更简单的方法计算百分比——只需groupby将state_office列除sales以其总和即可。复制 Paul H 回答的开头：

# From Paul H
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
                   'office_id': list(range(1, 7)) * 2,
                   'sales': [np.random.randint(100000, 999999)
                             for _ in range(12)]})
state_office = df.groupby(['state', 'office_id']).agg({'sales': 'sum'})
# Change: groupby state_office and divide by sum
state_pcts = state_office.groupby(level=0).apply(lambda x:
                                                 100 * x / float(x.sum()))

                     sales
state office_id           
AZ    2          16.981365
      4          19.250033
      6          63.768601
CA    1          19.331879
      3          33.858747
      5          46.809373
CO    1          36.851857
      3          19.874290
      5          43.273852
WA    2          34.707233
      4          35.511259
      6          29.781508

解决方案 2：

（该解决方案的灵感来自这篇文章https://pbpython.com/pandas_transform.html）

我发现以下解决方案是最简单的（也可能是最快的）transformation：

转换：聚合必须返回数据的简化版本，而转换可以返回完整数据的某些转换版本以供重新组合。对于这样的转换，输出与输入的形状相同。

因此使用transformation，解决方案是一行：

df['%'] = 100 * df['sales'] / df.groupby('state')['sales'].transform('sum')

如果你打印：

print(df.sort_values(['state', 'office_id']).reset_index(drop=True))

   state  office_id   sales          %
0     AZ          2  195197   9.844309
1     AZ          4  877890  44.274352
2     AZ          6  909754  45.881339
3     CA          1  614752  50.415708
4     CA          3  395340  32.421767
5     CA          5  209274  17.162525
6     CO          1  549430  42.659629
7     CO          3  457514  35.522956
8     CO          5  280995  21.817415
9     WA          2  828238  35.696929
10    WA          4  719366  31.004563
11    WA          6  772590  33.298509

解决方案 3：

您需要创建第二个按状态分组的 groupby 对象，然后使用该div方法：

import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
               'office_id': list(range(1, 7)) * 2,
               'sales': [np.random.randint(100000, 999999) for _ in range(12)]})

state_office = df.groupby(['state', 'office_id']).agg({'sales': 'sum'})
state = df.groupby(['state']).agg({'sales': 'sum'})
state_office.div(state, level='state') * 100


                     sales
state office_id           
AZ    2          16.981365
      4          19.250033
      6          63.768601
CA    1          19.331879
      3          33.858747
      5          46.809373
CO    1          36.851857
      3          19.874290
      5          43.273852
WA    2          34.707233
      4          35.511259
      6          29.781508

kwarglevel='state'告诉divpandas 根据state索引级别的值广播/加入数据框。

解决方案 4：

为了简洁起见，我将使用 SeriesGroupBy：

In [11]: c = df.groupby(['state', 'office_id'])['sales'].sum().rename("count")

In [12]: c
Out[12]:
state  office_id
AZ     2            925105
       4            592852
       6            362198
CA     1            819164
       3            743055
       5            292885
CO     1            525994
       3            338378
       5            490335
WA     2            623380
       4            441560
       6            451428
Name: count, dtype: int64

In [13]: c / c.groupby(level=0).sum()
Out[13]:
state  office_id
AZ     2            0.492037
       4            0.315321
       6            0.192643
CA     1            0.441573
       3            0.400546
       5            0.157881
CO     1            0.388271
       3            0.249779
       5            0.361949
WA     2            0.411101
       4            0.291196
       6            0.297703
Name: count, dtype: float64

对于多个组，您必须使用变换（使用Radical 的 df）：

In [21]: c =  df.groupby(["Group 1","Group 2","Final Group"])["Numbers I want as percents"].sum().rename("count")

In [22]: c / c.groupby(level=[0, 1]).transform("sum")
Out[22]:
Group 1  Group 2  Final Group
AAHQ     BOSC     OWON           0.331006
                  TLAM           0.668994
         MQVF     BWSI           0.288961
                  FXZM           0.711039
         ODWV     NFCH           0.262395
...
Name: count, dtype: float64

这似乎比其他答案的性能略高一些（仅不到 Radical 答案速度的两倍，对我来说约为 0.08 秒）。

解决方案 5：

我认为这需要基准测试。使用 OP 的原始 DataFrame，

df = pd.DataFrame({
    'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
    'office_id': list(range(1, 7)) * 2,
    'sales': [np.random.randint(100000, 999999) for _ in range(12)]
})

零号癌症

新的Pandas Tranform 看起来更快。

df['sales'] / df.groupby('state')['sales'].transform('sum')

每循环1.32 毫秒± 352 微秒

（7 次运行的平均值 ± 标准差，每次 100 次循环）

第一名安迪·海登

正如对他的回答的评论，安迪充分利用了矢量化和熊猫索引。

c = df.groupby(['state', 'office_id'])['sales'].sum().rename("count")
c / c.groupby(level=0).sum()

每循环3.42 毫秒± 16.7 微秒

（7 次运行的平均值 ± 标准差，每次 100 次循环）

第二名保罗 H

state_office = df.groupby(['state', 'office_id']).agg({'sales': 'sum'})
state = df.groupby(['state']).agg({'sales': 'sum'})
state_office.div(state, level='state') * 100

每循环4.66 毫秒± 24.4 微秒

（7 次运行的平均值 ± 标准差，每次 100 次循环）

第三探索者

这是最慢的答案，因为它在 0 级中x.sum()对每个答案进行计算x。

对我来说，这仍然是一个有用的答案，尽管不是目前的形式。对于较小数据集上的快速 EDA，apply允许您使用方法链将其写在一行中。因此，我们消除了决定变量名称的需要，这实际上对您最宝贵的资源（您的大脑！！）来说非常耗费计算资源。

这是修改，

(
    df.groupby(['state', 'office_id'])
    .agg({'sales': 'sum'})
    .groupby(level=0)
    .apply(lambda x: 100 * x / float(x.sum()))
)

每循环10.6 毫秒± 81.5 微秒

（7 次运行的平均值 ± 标准差，每次 100 次循环）

因此，没有人会在意小数据集上的 6ms。然而，这是 3 倍的速度提升，而且在具有高基数分组的较大数据集上，这将产生巨大的差异。

在上述代码的基础上，我们创建一个 DataFrame，其形状为 (12,000,000, 3)，其中包含 14412 个州类别和 600 个 office_id，

import string

import numpy as np
import pandas as pd
np.random.seed(0)

groups = [
    ''.join(i) for i in zip(
    np.random.choice(np.array([i for i in string.ascii_lowercase]), 30000),
    np.random.choice(np.array([i for i in string.ascii_lowercase]), 30000),
    np.random.choice(np.array([i for i in string.ascii_lowercase]), 30000),
                       )
]

df = pd.DataFrame({'state': groups * 400,
               'office_id': list(range(1, 601)) * 20000,
               'sales': [np.random.randint(100000, 999999)
                         for _ in range(12)] * 1000000
})

使用 Caner 的，

每循环0.791 秒± 19.4 毫秒（7 次运行的平均值 ± 标准差，每次 1 个循环）

使用 Andy 的，

每循环2 秒± 10.4 毫秒

（7 次运行的平均值 ± 标准差，每次 1 个循环）

和 explorer

每循环19 秒± 77.1 毫秒

（7 次运行的平均值 ± 标准差，每次 1 个循环）

所以现在我们看到 Andy 在处理大型、高基数数据集时的速度提高了 10 倍，但使用 Caner 时的速度却提高了非常惊人的 20 倍。

如果你对这个答案进行 UV ，请务必对这三个答案进行 UV ！！

编辑：添加了 Caner 基准

解决方案 6：

我意识到这里已经有很好的答案了。

尽管如此，我还是想做出自己的贡献，因为我觉得对于像这样的基本、简单的问题，应该有一个一目了然的简短解决方案。

它还应该能够以某种方式工作，即我可以将百分比添加为新列，而数据框的其余部分保持不变。最后但并非最不重要的是，它应该以明显的方式推广到存在多个分组级别（例如，州和国家，而不仅仅是州）的情况。

以下代码片段满足这些条件：

df['sales_ratio'] = df.groupby(['state'])['sales'].transform(lambda x: x/x.sum())

请注意，如果您仍在使用 Python 2，则必须用 float(x) 替换 lambda 项分母中的 x。

解决方案 7：

我知道这是一个老问题，但exp1orer 的答案对于具有大量唯一组的数据集来说非常慢（可能是因为 lambda）。我根据他们的答案将其转换为数组计算，所以现在它超级快！以下是示例代码：

创建包含 50,000 个唯一组的测试数据框

import random
import string
import pandas as pd
import numpy as np
np.random.seed(0)

# This is the total number of groups to be created
NumberOfGroups = 50000

# Create a lot of groups (random strings of 4 letters)
Group1     = [''.join(random.choice(string.ascii_uppercase) for _ in range(4)) for x in range(NumberOfGroups/10)]*10
Group2     = [''.join(random.choice(string.ascii_uppercase) for _ in range(4)) for x in range(NumberOfGroups/2)]*2
FinalGroup = [''.join(random.choice(string.ascii_uppercase) for _ in range(4)) for x in range(NumberOfGroups)]

# Make the numbers
NumbersForPercents = [np.random.randint(100, 999) for _ in range(NumberOfGroups)]

# Make the dataframe
df = pd.DataFrame({'Group 1': Group1,
                   'Group 2': Group2,
                   'Final Group': FinalGroup,
                   'Numbers I want as percents': NumbersForPercents})

分组后看起来像：

                             Numbers I want as percents
Group 1 Group 2 Final Group                            
AAAH    AQYR    RMCH                                847
                XDCL                                182
        DQGO    ALVF                                132
                AVPH                                894
        OVGH    NVOO                                650
                VKQP                                857
        VNLY    HYFW                                884
                MOYH                                469
        XOOC    GIDS                                168
                HTOY                                544
AACE    HNXU    RAXK                                243
                YZNK                                750
        NOYI    NYGC                                399
                ZYCI                                614
        QKGK    CRLF                                520
                UXNA                                970
        TXAR    MLNB                                356
                NMFJ                                904
        VQYG    NPON                                504
                QPKQ                                948
...
[50000 rows x 1 columns]

查找百分比的数组方法：

# Initial grouping (basically a sorted version of df)
PreGroupby_df = df.groupby(["Group 1","Group 2","Final Group"]).agg({'Numbers I want as percents': 'sum'}).reset_index()
# Get the sum of values for the "final group", append "_Sum" to it's column name, and change it into a dataframe (.reset_index)
SumGroup_df = df.groupby(["Group 1","Group 2"]).agg({'Numbers I want as percents': 'sum'}).add_suffix('_Sum').reset_index()
# Merge the two dataframes
Percents_df = pd.merge(PreGroupby_df, SumGroup_df)
# Divide the two columns
Percents_df["Percent of Final Group"] = Percents_df["Numbers I want as percents"] / Percents_df["Numbers I want as percents_Sum"] * 100
# Drop the extra _Sum column
Percents_df.drop(["Numbers I want as percents_Sum"], inplace=True, axis=1)

此方法大约需要~0.15秒

最佳答案方法（使用lambda函数）：

state_office = df.groupby(['Group 1','Group 2','Final Group']).agg({'Numbers I want as percents': 'sum'})
state_pcts = state_office.groupby(level=['Group 1','Group 2']).apply(lambda x: 100 * x / float(x.sum()))

此方法大约需要~21秒才能产生相同的结果。

结果：

      Group 1 Group 2 Final Group  Numbers I want as percents  Percent of Final Group
0        AAAH    AQYR        RMCH                         847               82.312925
1        AAAH    AQYR        XDCL                         182               17.687075
2        AAAH    DQGO        ALVF                         132               12.865497
3        AAAH    DQGO        AVPH                         894               87.134503
4        AAAH    OVGH        NVOO                         650               43.132050
5        AAAH    OVGH        VKQP                         857               56.867950
6        AAAH    VNLY        HYFW                         884               65.336290
7        AAAH    VNLY        MOYH                         469               34.663710
8        AAAH    XOOC        GIDS                         168               23.595506
9        AAAH    XOOC        HTOY                         544               76.404494

解决方案 8：

跨列或索引查找百分比的最优雅的方法是使用pd.crosstab。

示例数据

df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
               'office_id': list(range(1, 7)) * 2,
               'sales': [np.random.randint(100000, 999999) for _ in range(12)]})

输出数据框如下

print(df)

        state   office_id   sales
    0   CA  1   764505
    1   WA  2   313980
    2   CO  3   558645
    3   AZ  4   883433
    4   CA  5   301244
    5   WA  6   752009
    6   CO  1   457208
    7   AZ  2   259657
    8   CA  3   584471
    9   WA  4   122358
    10  CO  5   721845
    11  AZ  6   136928

只需指定索引、列和要聚合的值。normalize 关键字将根据上下文计算跨索引或列的百分比。

result = pd.crosstab(index=df['state'], 
                     columns=df['office_id'], 
                     values=df['sales'], 
                     aggfunc='sum', 
                     normalize='index').applymap('{:.2f}%'.format)




print(result)
office_id   1   2   3   4   5   6
state                       
AZ  0.00%   0.20%   0.00%   0.69%   0.00%   0.11%
CA  0.46%   0.00%   0.35%   0.00%   0.18%   0.00%
CO  0.26%   0.00%   0.32%   0.00%   0.42%   0.00%
WA  0.00%   0.26%   0.00%   0.10%   0.00%   0.63%

解决方案 9：

您可以计算sum整体DataFrame并除以state总数：

# Copying setup from Paul H answer
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
               'office_id': list(range(1, 7)) * 2,
               'sales': [np.random.randint(100000, 999999) for _ in range(12)]})
# Add a column with the sales divided by state total sales.
df['sales_ratio'] = (df / df.groupby(['state']).transform(sum))['sales']

df

    office_id   sales state  sales_ratio
0           1  405711    CA     0.193319
1           2  535829    WA     0.347072
2           3  217952    CO     0.198743
3           4  252315    AZ     0.192500
4           5  982371    CA     0.468094
5           6  459783    WA     0.297815
6           1  404137    CO     0.368519
7           2  222579    AZ     0.169814
8           3  710581    CA     0.338587
9           4  548242    WA     0.355113
10          5  474564    CO     0.432739
11          6  835831    AZ     0.637686

但请注意，这仅在除之外的所有列都是state数字时才有效，从而可以对整个 DataFrame 进行求和。例如，如果office_id是字符，则会收到错误：

df.office_id = df.office_id.astype(str)
df['sales_ratio'] = (df / df.groupby(['state']).transform(sum))['sales']

TypeError：不支持 / 的操作数类型：'str' 和 'str'

解决方案 10：

import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
               'office_id': list(range(1, 7)) * 2,
               'sales': [np.random.randint(100000, 999999) for _ in range(12)]})

df.groupby(['state', 'office_id'])['sales'].sum().rename("weightage").groupby(level = 0).transform(lambda x: x/x.sum())
df.reset_index()

输出：

    state   office_id   weightage
0   AZ  2   0.169814
1   AZ  4   0.192500
2   AZ  6   0.637686
3   CA  1   0.193319
4   CA  3   0.338587
5   CA  5   0.468094
6   CO  1   0.368519
7   CO  3   0.198743
8   CO  5   0.432739
9   WA  2   0.347072
10  WA  4   0.355113
11  WA  6   0.297815

解决方案 11：

我认为这可以用一行代码来实现：

df.groupby(['state', 'office_id']).sum().transform(lambda x: x/np.sum(x)*100)

解决方案 12：

我使用过的简单方法是将 2 个 groupby 合并，然后进行简单的除法。

import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
               'office_id': list(range(1, 7)) * 2,
               'sales': [np.random.randint(100000, 999999) for _ in range(12)]})

state_office = df.groupby(['state', 'office_id'])['sales'].sum().reset_index()
state = df.groupby(['state'])['sales'].sum().reset_index()
state_office = state_office.merge(state, left_on='state', right_on ='state', how = 'left')
state_office['sales_ratio'] = 100*(state_office['sales_x']/state_office['sales_y'])

   state  office_id  sales_x  sales_y  sales_ratio
0     AZ          2   222579  1310725    16.981365
1     AZ          4   252315  1310725    19.250033
2     AZ          6   835831  1310725    63.768601
3     CA          1   405711  2098663    19.331879
4     CA          3   710581  2098663    33.858747
5     CA          5   982371  2098663    46.809373
6     CO          1   404137  1096653    36.851857
7     CO          3   217952  1096653    19.874290
8     CO          5   474564  1096653    43.273852
9     WA          2   535829  1543854    34.707233
10    WA          4   548242  1543854    35.511259
11    WA          6   459783  1543854    29.781508

解决方案 13：

df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
               'office_id': list(range(1, 7)) * 2,
               'sales': [np.random.randint(100000, 999999)
                         for _ in range(12)]})

grouped = df.groupby(['state', 'office_id'])
100*grouped.sum()/df[["state","sales"]].groupby('state').sum()

sales
state   office_id   
AZ  2   54.587910
    4   33.009225
    6   12.402865
CA  1   32.046582
    3   44.937684
    5   23.015735
CO  1   21.099989
    3   31.848658
    5   47.051353
WA  2   43.882790
    4   10.265275
    6   45.851935

解决方案 14：

作为一个也在学习 pandas 的人，我发现其他答案有点含蓄，因为 pandas 隐藏了大部分幕后工作。即操作如何通过自动匹配列和索引名称来工作。此代码应相当于 @exp1orer 接受答案的分步版本

使用df，我将用别名来调用它state_office_sales：

                  sales
state office_id        
AZ    2          839507
      4          373917
      6          347225
CA    1          798585
      3          890850
      5          454423
CO    1          819975
      3          202969
      5          614011
WA    2          163942
      4          369858
      6          959285

state_total_sales`state_office_sales按总和分组（index level 0`最左边）。

In:   state_total_sales = df.groupby(level=0).sum()
      state_total_sales

Out: 
       sales
state   
AZ     2448009
CA     2832270
CO     1495486
WA     595859

由于两个数据框共享一个索引名和一个列名，因此 pandas 将通过共享索引找到适当的位置，例如：

In:   state_office_sales / state_total_sales

Out:  

                   sales
state   office_id   
AZ      2          0.448640
        4          0.125865
        6          0.425496
CA      1          0.288022
        3          0.322169
        5          0.389809
CO      1          0.206684
        3          0.357891
        5          0.435425
WA      2          0.321689
        4          0.346325
        6          0.331986

为了更好地说明这一点，下面是XX没有等价项的部分总计。Pandas 将根据索引和列名匹配位置，如果没有重叠，Pandas 将忽略它：

In:   partial_total = pd.DataFrame(
                      data   =  {'sales' : [2448009, 595859, 99999]},
                      index  =             ['AZ',    'WA',   'XX' ]
                      )
      partial_total.index.name = 'state'


Out:  
         sales
state
AZ       2448009
WA       595859
XX       99999

In:   state_office_sales / partial_total

Out: 
                   sales
state   office_id   
AZ      2          0.448640
        4          0.125865
        6          0.425496
CA      1          NaN
        3          NaN
        5          NaN
CO      1          NaN
        3          NaN
        5          NaN
WA      2          0.321689
        4          0.346325
        6          0.331986

当没有共享索引或列时，这一点变得非常明显。这里missing_index_totals等于state_total_sales，只是它没有索引名。

In:   missing_index_totals = state_total_sales.rename_axis("")
      missing_index_totals

Out:  
       sales
AZ     2448009
CA     2832270
CO     1495486
WA     595859

In:   state_office_sales / missing_index_totals 

Out:  ValueError: cannot join with no overlapping index names

解决方案 15：

df.groupby('state').office_id.value_counts(normalize = True)

我使用了value_counts方法，但它返回的百分比像0.70和0.30，而不是像70和30。

解决方案 16：

单行解决方案：

df.join(
    df.groupby('state').agg(state_total=('sales', 'sum')),
    on='state'
).eval('sales / state_total')

这将返回一系列每个办公室的比率——可以单独使用或分配给原始数据框。

解决方案 17：

这是您所需要的。

解释：

groupby()多个属性
无需排序，但易于查看
分配一列，我们在 lambda 中捕获整个 DF 并按类别分组或使用，groupby(level=0)这意味着第一列。

我也对其进行了计时，因此您会看到groupby(level=0)它速度更快，因为它利用了groupby()上一步。

你可以选择最适合自己的解决方案。以下任一解决方案均可：

(
    df_xls.groupby(['Category','Sub-Category']).agg({'Sales': 'sum'})

    .sort_values(['Category','Sales'],ascending=[True,False])

    .assign(Sales_pct=lambda x: 100*x / (x.groupby(level=0).transform('sum')))
)

或者：

(
    df_xls.groupby(['Category','Sub-Category']).agg({'Sales': 'sum'})

    .sort_values(['Category','Sales'],ascending=[True,False])

    .assign(Sales_pct=lambda x: 100*x / (x.groupby(['Category']).transform('sum')))
)

结果如下：
在此处输入图片描述