熊猫:删除连续的重复项

2024-12-26 08:43:00
admin
原创
122
摘要:问题描述:在熊猫中删除连续重复项的最有效方法是什么?drop_duplicates 给出的是:In [3]: a = pandas.Series([1,2,2,3,2], index=[1,2,3,4,5]) In [4]: a.drop_duplicates() Out[4]: 1 1 2 ...

问题描述:

在熊猫中删除连续重复项的最有效方法是什么?

drop_duplicates 给出的是:

In [3]: a = pandas.Series([1,2,2,3,2], index=[1,2,3,4,5])

In [4]: a.drop_duplicates()
Out[4]: 
1    1
2    2
4    3
dtype: int64

但我想要的是这个:

In [4]: a.something()
Out[4]: 
1    1
2    2
4    3
5    2
dtype: int64

解决方案 1:

使用shift

a.loc[a.shift(-1) != a]

Out[3]:

1    1
3    2
4    3
5    2
dtype: int64

因此,上面使用了布尔标准,我们将数据框与移动 -1 行的数据框进行比较以创建掩码

另一种方法是使用diff

In [82]:

a.loc[a.diff() != 0]
Out[82]:
1    1
2    2
4    3
5    2
dtype: int64

但如果行数很多的话,这会比原始方法慢。

更新

感谢 Bjarke Ebert 指出了一个细微的错误,我实际上应该使用shift(1)或只是shift()因为默认值是 1,这将返回第一个连续的值:

In [87]:

a.loc[a.shift() != a]
Out[87]:
1    1
2    2
4    3
5    2
dtype: int64

注意索引值的差异,感谢@BjarkeEbert!

解决方案 2:

以下是一个更新,可使其适用于多列。使用“.any(axis=1)”合并每列的结果:

cols = ["col1","col2","col3"]
de_dup = a[cols].loc[(a[cols].shift() != a[cols]).any(axis=1)]

解决方案 3:

由于我们追求的most efficient way是性能,让我们使用数组数据来利用 NumPy。我们将对一次性切片并进行比较,类似于前面在中讨论的移位方法@EdChum's post。但是使用 NumPy 切片,我们最终会得到一个无元素的数组,因此我们需要在开始时连接一个True元素以选择第一个元素,因此我们将得到如下实现 -

def drop_consecutive_duplicates(a):
    ar = a.values
    return a[np.concatenate(([True],ar[:-1]!= ar[1:]))]

样本运行 -

In [149]: a
Out[149]: 
1    1
2    2
3    2
4    3
5    2
dtype: int64

In [150]: drop_consecutive_duplicates(a)
Out[150]: 
1    1
2    2
4    3
5    2
dtype: int64

大型阵列上的时间比较@EdChum's solution-

In [142]: a = pd.Series(np.random.randint(1,5,(1000000)))

In [143]: %timeit a.loc[a.shift() != a]
100 loops, best of 3: 12.1 ms per loop

In [144]: %timeit drop_consecutive_duplicates(a)
100 loops, best of 3: 11 ms per loop

In [145]: a = pd.Series(np.random.randint(1,5,(10000000)))

In [146]: %timeit a.loc[a.shift() != a]
10 loops, best of 3: 136 ms per loop

In [147]: %timeit drop_consecutive_duplicates(a)
10 loops, best of 3: 114 ms per loop

因此,有一些进步!

仅获得价值的重大提升!

如果只需要值,我们可以通过简单地索引数组数据来获得很大的提升,就像这样 -

def drop_consecutive_duplicates(a):
    ar = a.values
    return ar[np.concatenate(([True],ar[:-1]!= ar[1:]))]

样本运行 -

In [170]: a = pandas.Series([1,2,2,3,2], index=[1,2,3,4,5])

In [171]: drop_consecutive_duplicates(a)
Out[171]: array([1, 2, 3, 2])

时间安排 -

In [173]: a = pd.Series(np.random.randint(1,5,(10000000)))

In [174]: %timeit a.loc[a.shift() != a]
10 loops, best of 3: 137 ms per loop

In [175]: %timeit drop_consecutive_duplicates(a)
10 loops, best of 3: 61.3 ms per loop

解决方案 4:

pd.Series这是一个处理和的函数pd.Dataframes。您可以屏蔽/删除,选择轴,最后选择使用“任何”或“所有”‘NaN’删除。它在计算时间方面没有优化,但它的优点是稳健且非常清晰。

import numpy as np
import pandas as pd

# To mask/drop successive values in pandas
def Mask_Or_Drop_Successive_Identical_Values(df, drop=False, 
                                             keep_first=True,
                                             axis=0, how='all'):

    '''
    #Function built with the help of:
    # 1) https://stackoverflow.com/questions/48428173/how-to-change-consecutive-repeating-values-in-pandas-dataframe-series-to-nan-or
    # 2) https://stackoverflow.com/questions/19463985/pandas-drop-consecutive-duplicates
    
    Input:
    df should be a pandas.DataFrame of a a pandas.Series
    Output:
    df of ts with masked or dropped values
    '''
    
    # Mask keeping the first occurrence
    if keep_first:
        df = df.mask(df.shift(1) == df)
    # Mask including the first occurrence
    else:
        df = df.mask((df.shift(1) == df) | (df.shift(-1) == df))

    # Drop the values (e.g. rows are deleted)    
    if drop:
        return df.dropna(axis=axis, how=how)        
    # Only mask the values (e.g. become 'NaN')
    else:
        return df   

以下是包含在脚本中的测试代码:


if __name__ == "__main__":
    
    # With time series
    print("With time series:
")
    ts = pd.Series([1,1,2,2,3,2,6,6,float('nan'), 6,6,float('nan'),float('nan')], 
                    index=[0,1,2,3,4,5,6,7,8,9,10,11,12])
    
    print("#Original ts:")    
    print(ts)

    print("
## 1) Mask keeping the first occurrence:")    
    print(Mask_Or_Drop_Successive_Identical_Values(ts, drop=False, 
                                                   keep_first=True))

    print("
## 2) Mask including the first occurrence:")    
    print(Mask_Or_Drop_Successive_Identical_Values(ts, drop=False, 
                                                   keep_first=False))
    
    print("
## 3) Drop keeping the first occurrence:")    
    print(Mask_Or_Drop_Successive_Identical_Values(ts, drop=True, 
                                                   keep_first=True))
    
    print("
## 4) Drop including the first occurrence:")        
    print(Mask_Or_Drop_Successive_Identical_Values(ts, drop=True, 
                                                   keep_first=False))
    
    
    # With dataframes
    print("With dataframe:
")
    df = pd.DataFrame(np.random.randn(15, 3))
    df.iloc[4:9,0]=40
    df.iloc[8:15,1]=22
    df.iloc[8:12,2]=0.23
        
    print("#Original df:")
    print(df)

    print("
## 5) Mask keeping the first occurrence:") 
    print(Mask_Or_Drop_Successive_Identical_Values(df, drop=False, 
                                                   keep_first=True))

    print("
## 6) Mask including the first occurrence:")    
    print(Mask_Or_Drop_Successive_Identical_Values(df, drop=False, 
                                                   keep_first=False))
    
    print("
## 7) Drop 'any' keeping the first occurrence:")    
    print(Mask_Or_Drop_Successive_Identical_Values(df, drop=True, 
                                                   keep_first=True,
                                                   how='any'))
    
    print("
## 8) Drop 'all' keeping the first occurrence:")    
    print(Mask_Or_Drop_Successive_Identical_Values(df, drop=True, 
                                                   keep_first=True,
                                                   how='all'))
    
    print("
## 9) Drop 'any' including the first occurrence:")        
    print(Mask_Or_Drop_Successive_Identical_Values(df, drop=True, 
                                                   keep_first=False,
                                                   how='any'))

    print("
## 10) Drop 'all' including the first occurrence:")        
    print(Mask_Or_Drop_Successive_Identical_Values(df, drop=True, 
                                                   keep_first=False,
                                                   how='all'))

预期结果如下:

With time series:

#Original ts:
0     1.0
1     1.0
2     2.0
3     2.0
4     3.0
5     2.0
6     6.0
7     6.0
8     NaN
9     6.0
10    6.0
11    NaN
12    NaN
dtype: float64

## 1) Mask keeping the first occurrence:
0     1.0
1     NaN
2     2.0
3     NaN
4     3.0
5     2.0
6     6.0
7     NaN
8     NaN
9     6.0
10    NaN
11    NaN
12    NaN
dtype: float64

## 2) Mask including the first occurrence:
0     NaN
1     NaN
2     NaN
3     NaN
4     3.0
5     2.0
6     NaN
7     NaN
8     NaN
9     NaN
10    NaN
11    NaN
12    NaN
dtype: float64

## 3) Drop keeping the first occurrence:
0    1.0
2    2.0
4    3.0
5    2.0
6    6.0
9    6.0
dtype: float64

## 4) Drop including the first occurrence:
4    3.0
5    2.0
dtype: float64
With dataframe:

#Original df:
            0          1         2
0   -1.890137  -3.125224 -1.029065
1   -0.224712  -0.194742  1.891365
2    1.009388   0.589445  0.927405
3    0.212746  -0.392314 -0.781851
4   40.000000   1.889781 -1.394573
5   40.000000  -0.470958 -0.339213
6   40.000000   1.613524  0.271641
7   40.000000  -1.810958 -1.568372
8   40.000000  22.000000  0.230000
9   -0.296557  22.000000  0.230000
10  -0.921238  22.000000  0.230000
11  -0.170195  22.000000  0.230000
12   1.460457  22.000000 -0.295418
13   0.307825  22.000000 -0.759131
14   0.287392  22.000000  0.378315

## 5) Mask keeping the first occurrence:
            0          1         2
0   -1.890137  -3.125224 -1.029065
1   -0.224712  -0.194742  1.891365
2    1.009388   0.589445  0.927405
3    0.212746  -0.392314 -0.781851
4   40.000000   1.889781 -1.394573
5         NaN  -0.470958 -0.339213
6         NaN   1.613524  0.271641
7         NaN  -1.810958 -1.568372
8         NaN  22.000000  0.230000
9   -0.296557        NaN       NaN
10  -0.921238        NaN       NaN
11  -0.170195        NaN       NaN
12   1.460457        NaN -0.295418
13   0.307825        NaN -0.759131
14   0.287392        NaN  0.378315

## 6) Mask including the first occurrence:
           0         1         2
0  -1.890137 -3.125224 -1.029065
1  -0.224712 -0.194742  1.891365
2   1.009388  0.589445  0.927405
3   0.212746 -0.392314 -0.781851
4        NaN  1.889781 -1.394573
5        NaN -0.470958 -0.339213
6        NaN  1.613524  0.271641
7        NaN -1.810958 -1.568372
8        NaN       NaN       NaN
9  -0.296557       NaN       NaN
10 -0.921238       NaN       NaN
11 -0.170195       NaN       NaN
12  1.460457       NaN -0.295418
13  0.307825       NaN -0.759131
14  0.287392       NaN  0.378315

## 7) Drop 'any' keeping the first occurrence:
           0         1         2
0  -1.890137 -3.125224 -1.029065
1  -0.224712 -0.194742  1.891365
2   1.009388  0.589445  0.927405
3   0.212746 -0.392314 -0.781851
4  40.000000  1.889781 -1.394573

## 8) Drop 'all' keeping the first occurrence:
            0          1         2
0   -1.890137  -3.125224 -1.029065
1   -0.224712  -0.194742  1.891365
2    1.009388   0.589445  0.927405
3    0.212746  -0.392314 -0.781851
4   40.000000   1.889781 -1.394573
5         NaN  -0.470958 -0.339213
6         NaN   1.613524  0.271641
7         NaN  -1.810958 -1.568372
8         NaN  22.000000  0.230000
9   -0.296557        NaN       NaN
10  -0.921238        NaN       NaN
11  -0.170195        NaN       NaN
12   1.460457        NaN -0.295418
13   0.307825        NaN -0.759131
14   0.287392        NaN  0.378315

## 9) Drop 'any' including the first occurrence:
          0         1         2
0 -1.890137 -3.125224 -1.029065
1 -0.224712 -0.194742  1.891365
2  1.009388  0.589445  0.927405
3  0.212746 -0.392314 -0.781851

## 10) Drop 'all' including the first occurrence:
           0         1         2
0  -1.890137 -3.125224 -1.029065
1  -0.224712 -0.194742  1.891365
2   1.009388  0.589445  0.927405
3   0.212746 -0.392314 -0.781851
4        NaN  1.889781 -1.394573
5        NaN -0.470958 -0.339213
6        NaN  1.613524  0.271641
7        NaN -1.810958 -1.568372
9  -0.296557       NaN       NaN
10 -0.921238       NaN       NaN
11 -0.170195       NaN       NaN
12  1.460457       NaN -0.295418
13  0.307825       NaN -0.759131
14  0.287392       NaN  0.378315

解决方案 5:

对于其他 Stack 浏览器,请参考上述 johnml1135 的回答。这将从多个列中删除下一个重复项,但不会删除所有列。当对数据框进行排序时,它将保留第一行,但如果“cols”匹配,则会删除第二行,即使有更多列包含不匹配的信息。

cols = ["col1","col2","col3"]
df = df.loc[(df[cols].shift() != df[cols]).any(axis=1)]

解决方案 6:

还有另一种方法:

a.loc[a.ne(a.shift())]

方法pandas.Series.ne不等于运算符,因此a.ne(a.shift())相当于a != a.shift()。文档在这里。

解决方案 7:

以下是EdChum 答案的一个变体,它也将连续的 NaN 视为重复项:

def remove_consecutive_duplicates_and_nans(s):
    # By default, `shift` uses NaN as a fill value, which breaks our
    # removal of consecutive NaNs. Hence we use a different sentinel
    # object instead.
    shifted = s.astype(object).shift(-1, fill_value=object())
    return s.loc[
        (shifted != s)
        & ~(shifted.isna() & s.isna())
    ]

解决方案 8:

创建新列。

df['match'] = df.col1.eq(df.col1.shift())

然后:

df = df[df['match']==False]

解决方案 9:

@johnml1135 推荐的方法对我来说不起作用。

我想出了一个类似的方法:

cols = ['Position', 'Offset']
df = df[df[cols] != df[cols].shift(-1)].dropna()

shift(-1)将保留最后重复的行,并且shift(1)将保留第一行。

相关推荐
  政府信创国产化的10大政策解读一、信创国产化的背景与意义信创国产化,即信息技术应用创新国产化,是当前中国信息技术领域的一个重要发展方向。其核心在于通过自主研发和创新,实现信息技术应用的自主可控,减少对外部技术的依赖,并规避潜在的技术制裁和风险。随着全球信息技术竞争的加剧,以及某些国家对中国在科技领域的打压,信创国产化显...
工程项目管理   1565  
  为什么项目管理通常仍然耗时且低效?您是否还在反复更新电子表格、淹没在便利贴中并参加每周更新会议?这确实是耗费时间和精力。借助软件工具的帮助,您可以一目了然地全面了解您的项目。如今,国内外有足够多优秀的项目管理软件可以帮助您掌控每个项目。什么是项目管理软件?项目管理软件是广泛行业用于项目规划、资源分配和调度的软件。它使项...
项目管理软件   1354  
  信创国产芯片作为信息技术创新的核心领域,对于推动国家自主可控生态建设具有至关重要的意义。在全球科技竞争日益激烈的背景下,实现信息技术的自主可控,摆脱对国外技术的依赖,已成为保障国家信息安全和产业可持续发展的关键。国产芯片作为信创产业的基石,其发展水平直接影响着整个信创生态的构建与完善。通过不断提升国产芯片的技术实力、产...
国产信创系统   21  
  信创生态建设旨在实现信息技术领域的自主创新和安全可控,涵盖了从硬件到软件的全产业链。随着数字化转型的加速,信创生态建设的重要性日益凸显,它不仅关乎国家的信息安全,更是推动产业升级和经济高质量发展的关键力量。然而,在推进信创生态建设的过程中,面临着诸多复杂且严峻的挑战,需要深入剖析并寻找切实可行的解决方案。技术创新难题技...
信创操作系统   27  
  信创产业作为国家信息技术创新发展的重要领域,对于保障国家信息安全、推动产业升级具有关键意义。而国产芯片作为信创产业的核心基石,其研发进展备受关注。在信创国产芯片的研发征程中,面临着诸多复杂且艰巨的难点,这些难点犹如一道道关卡,阻碍着国产芯片的快速发展。然而,科研人员和相关企业并未退缩,积极探索并提出了一系列切实可行的解...
国产化替代产品目录   28  
热门文章
项目管理软件有哪些?
云禅道AD
禅道项目管理软件

云端的项目管理软件

尊享禅道项目软件收费版功能

无需维护,随时随地协同办公

内置subversion和git源码管理

每天备份,随时转为私有部署

免费试用