如何根据多个条件为现有列分配值?
- 2025-02-13 08:35:00
- admin 原创
- 54
问题描述:
我想根据以下条件创建一个具有数值的新列:
一个。if gender is male & pet1==pet2, points = 5
b.if gender is female & (pet1 is 'cat' or pet1 is 'dog'), points = 5
所有其他组合,
points = 0
gender pet1 pet2
0 male dog dog
1 male cat cat
2 male dog cat
3 female cat squirrel
4 female dog dog
5 female squirrel cat
6 squirrel dog cat
我希望最终结果如下:
gender pet1 pet2 points
0 male dog dog 5
1 male cat cat 5
2 male dog cat 0
3 female cat squirrel 5
4 female dog dog 5
5 female squirrel cat 0
6 squirrel dog cat 0
我如何实现这个目标?
解决方案 1:
numpy.select
这是一个完美的例子,np.select
我们可以根据多个条件创建一个列,并且当有更多条件时它是一种可读的方法:
conditions = [
df['gender'].eq('male') & df['pet1'].eq(df['pet2']),
df['gender'].eq('female') & df['pet1'].isin(['cat', 'dog'])
]
choices = [5,5]
df['points'] = np.select(conditions, choices, default=0)
print(df)
gender pet1 pet2 points
0 male dog dog 5
1 male cat cat 5
2 male dog cat 0
3 female cat squirrel 5
4 female dog dog 5
5 female squirrel cat 0
6 squirrel dog cat 0
解决方案 2:
您可以使用 来执行此操作np.where
,条件使用按位&
和,并且|
由于运算符优先级,多个条件用括号括起来。因此,当条件为真时返回,否则:and
`or5
0`
In [29]:
df['points'] = np.where( ( (df['gender'] == 'male') & (df['pet1'] == df['pet2'] ) ) | ( (df['gender'] == 'female') & (df['pet1'].isin(['cat','dog'] ) ) ), 5, 0)
df
Out[29]:
gender pet1 pet2 points
0 male dog dog 5
1 male cat cat 5
2 male dog cat 0
3 female cat squirrel 5
4 female dog dog 5
5 female squirrel cat 0
6 squirrel dog cat 0
解决方案 3:
使用应用。
def f(x):
if x['gender'] == 'male' and x['pet1'] == x['pet2']: return 5
elif x['gender'] == 'female' and (x['pet1'] == 'cat' or x['pet1'] == 'dog'): return 5
else: return 0
data['points'] = data.apply(f, axis=1)
解决方案 4:
您也可以使用apply
函数。例如:
def myfunc(gender, pet1, pet2):
if gender=='male' and pet1==pet2:
myvalue=5
elif gender=='female' and (pet1=='cat' or pet1=='dog'):
myvalue=5
else:
myvalue=0
return myvalue
然后使用应用函数设置axis=1
df['points'] = df.apply(lambda x: myfunc(x['gender'], x['pet1'], x['pet2']), axis=1)
我们得到:
gender pet1 pet2 points
0 male dog dog 5
1 male cat cat 5
2 male dog cat 0
3 female cat squirrel 5
4 female dog dog 5
5 female squirrel cat 0
6 squirrel dog cat 0
解决方案 5:
@RuggeroTurra 描述的应用方法对于 50 万行来说需要更长的时间。我最终使用了类似
df['result'] = ((df.a == 0) & (df.b != 1)).astype(int) * 2 + \n ((df.a != 0) & (df.b != 1)).astype(int) * 3 + \n ((df.a == 0) & (df.b == 1)).astype(int) * 4 + \n ((df.a != 0) & (df.b == 1)).astype(int) * 5
其中应用方法花费 25 秒,而上述方法花费约 18 毫秒。
解决方案 6:
将条件写为字符串表达式并使用对其进行评估eval()
是评估条件并使用为列分配值的另一种方法numpy.where()
。
# evaluate the condition
condition = df.eval("gender=='male' and pet1==pet2 or gender=='female' and pet1==['cat','dog']")
# assign values
df['points'] = np.where(condition, 5, 0)
如果您有一个大型数据框(100k+行)并且需要评估大量比较,则此方法可能是构建布尔掩码最快的 pandas 方法。1
&
与链式和/或|
运算符(在此处的其他矢量化答案中使用)相比,此方法的另一个优点是可读性更好(可以说)。
1:对于具有 105k 行的数据框,如果评估 4 个条件,其中每个条件链接两个比较,eval()
则创建布尔掩码的速度比链接按位运算符快得多。
df = pd.DataFrame([{'gender': 'male', 'pet1': 'dog', 'pet2': 'dog'}, {'gender': 'male', 'pet1': 'cat', 'pet2': 'cat'}, {'gender': 'male', 'pet1': 'dog', 'pet2': 'cat'},{'gender': 'female', 'pet1': 'cat', 'pet2': 'squirrel'},{'gender': 'female', 'pet1': 'dog', 'pet2': 'dog'},{'gender': 'female', 'pet1': 'squirrel', 'pet2': 'cat'},{'gender': 'squirrel', 'pet1': 'dog', 'pet2': 'cat'}]*15_000)
%timeit np.where(df.eval("gender == 'male' and pet1 == pet2 or gender == 'female' and pet1 == ['cat','dog'] or gender == 'female' and pet2 == ['squirrel','dog'] or pet1 == 'cat' and pet2 == 'cat'"), 5, 0)
# 37.9 ms ± 847 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)
%timeit np.where( ( (df['gender'] == 'male') & (df['pet1'] == df['pet2'] ) ) | ( (df['gender'] == 'female') & (df['pet1'].isin(['cat','dog'] ) ) ) | ( (df['gender'] == 'female') & (df['pet2'].isin(['squirrel','dog'] ) ) ) | ( (df['pet1'] == 'cat') & (df['pet2'] == 'cat') ), 5, 0)
# 53.5 ms ± 1.38 ms per loop (mean ± std. dev. of 10 runs, 100 loops each)
%timeit np.select([df['gender'].eq('male') & df['pet1'].eq(df['pet2']), df['gender'].eq('female') & df['pet1'].isin(['cat', 'dog']), df['gender'].eq('female') & df['pet2'].isin(['squirrel', 'dog']), df['pet1'].eq('cat') & df['pet2'].eq('cat')], [5,5,5,5], default=0)
# 48.9 ms ± 5.06 ms per loop (mean ± std. dev. of 10 runs, 100 loops each)
解决方案 7:
一个选项是使用pyjanitor的case_when;它在底层使用。pd.Series.mask
基本思想是条件和预期值的配对;您可以根据需要传递任意数量的配对,后跟默认值和目标列名:
# pip install pyjanitor
import pandas as pd
import janitor
df.case_when(
# condition, value
df.gender.eq('male') & df.pet1.eq(df.pet2), 5,
df.gender.eq('female') & df.pet1.isin(['cat', 'dog']), 5,
0, # default
column_name = 'points')
gender pet1 pet2 points
0 male dog dog 5
1 male cat cat 5
2 male dog cat 0
3 female cat squirrel 5
4 female dog dog 5
5 female squirrel cat 0
6 squirrel dog cat 0
您可以使用字符串作为条件,只要它们可以通过pd.eval
父数据框进行评估即可 - 请注意,从速度上讲,对于小型数据集来说,这可能会更慢:
df.case_when(
"gender == 'male' and pet1 == pet2", 5,
"gender == 'female' and pet2 == ['cat', 'dog']", 5,
0,
column_name = 'points')
gender pet1 pet2 points
0 male dog dog 5
1 male cat cat 5
2 male dog cat 0
3 female cat squirrel 0
4 female dog dog 5
5 female squirrel cat 5
6 squirrel dog cat 0
匿名函数也是可能的,这在链式操作中很方便:
df.case_when(
lambda df: df.gender.eq('male') & df.pet1.eq(df.pet2), 5,
lambda df: df.gender.eq('female') & df.pet1.isin(['cat', 'dog']), 5,
0, # default
column_name = 'points')
gender pet1 pet2 points
0 male dog dog 5
1 male cat cat 5
2 male dog cat 0
3 female cat squirrel 5
4 female dog dog 5
5 female squirrel cat 0
6 squirrel dog cat 0