如何替换数据框列中的 NaN 值-IT科技

摘要：问题描述：我有一个如下的 Pandas Dataframe： itm Date Amount 67 420 2012-09-30 00:00:00 65211 68 421 2012-09-09 00:00:00 29424 69 421...

问题描述：

我有一个如下的 Pandas Dataframe：

      itm Date                  Amount 
67    420 2012-09-30 00:00:00   65211
68    421 2012-09-09 00:00:00   29424
69    421 2012-09-16 00:00:00   29877
70    421 2012-09-23 00:00:00   30990
71    421 2012-09-30 00:00:00   61303
72    485 2012-09-09 00:00:00   71781
73    485 2012-09-16 00:00:00     NaN
74    485 2012-09-23 00:00:00   11072
75    485 2012-09-30 00:00:00  113702
76    489 2012-09-09 00:00:00   64731
77    489 2012-09-16 00:00:00     NaN

当我尝试将函数应用于金额列时，出现以下错误：

ValueError: cannot convert float NaN to integer

math.isnan我曾尝试使用、pandas.replace方法、.sparsepandas 0.9 中的数据属性、函数中的 if语句来应用函数NaN == NaN；我也看过这个问答；但它们都不起作用。

我该怎么做？

解决方案 1：

DataFrame.fillna()或者Series.fillna()会为您做这件事。

例子：

In [7]: df
Out[7]: 
          0         1
0       NaN       NaN
1 -0.494375  0.570994
2       NaN       NaN
3  1.876360 -0.229738
4       NaN       NaN

In [8]: df.fillna(0)
Out[8]: 
          0         1
0  0.000000  0.000000
1 -0.494375  0.570994
2  0.000000  0.000000
3  1.876360 -0.229738
4  0.000000  0.000000

要仅在一列中填充 NaN，请仅选择该列。

In [12]: df[1] = df[1].fillna(0)

In [13]: df
Out[13]: 
          0         1
0       NaN  0.000000
1 -0.494375  0.570994
2       NaN  0.000000
3  1.876360 -0.229738
4       NaN  0.000000

或者您可以使用内置的特定于列的功能：

df = df.fillna({1: 0})

解决方案 2：

不保证切片会返回视图或副本。你可以这样做

df['column'] = df['column'].fillna(value)

解决方案 3：

您可以使用replace更改NaN为0：

import pandas as pd
import numpy as np

# for column
df['column'] = df['column'].replace(np.nan, 0)

# for whole dataframe
df = df.replace(np.nan, 0)

# inplace
df.replace(np.nan, 0, inplace=True)

解决方案 4：

下面的代码对我有用。

import pandas

df = pandas.read_csv('somefile.txt')

df = df.fillna(0)

解决方案 5：

我只是想提供一个特殊情况。如果您使用多索引或以其他方式使用索引切片器，则该inplace=True选项可能不足以更新您选择的切片。例如，在 2x2 级多索引中，这不会更改任何值（截至 pandas 0.15）：

idx = pd.IndexSlice
df.loc[idx[:,mask_1], idx[mask_2,:]].fillna(value=0, inplace=True)

“问题”在于，链接破坏了 fillna 更新原始数据框的能力。我将“问题”放在引号中，因为设计决策有充分的理由导致在某些情况下不通过这些链进行解释。此外，这是一个复杂的例子（虽然我确实遇到过），但同样的情况可能适用于更少级别的索引，具体取决于您如何切片。

解决方案是DataFrame.update：

df.update(df.loc[idx[:,mask_1], idx[[mask_2],:]].fillna(value=0))

它只有一行，读起来相当好（有点），并且消除了任何不必要的中间变量或循环的混乱，同时允许您将 fillna 应用于您喜欢的任何多级切片！

如果有人能找到这个不起作用的地方，请在评论中发表，我一直在弄乱它并查看源代码，它似乎至少解决了我的多索引切片问题。

解决方案 6：

您还可以使用字典来填充 DataFrame 中特定列的 NaN 值，而不是用某个 oneValue 填充所有 DF。

import pandas as pd

df = pd.read_excel('example.xlsx')
df.fillna( {
        'column1': 'Write your values here',
        'column2': 'Write your values here',
        'column3': 'Write your values here',
        'column4': 'Write your values here',
        .
        .
        .
        'column-n': 'Write your values here'} , inplace=True)

解决方案 7：

填补缺失值的简单方法：-

填充 字符串列：当字符串列有缺失值和 NaN 值时。

df['string column name'].fillna(df['string column name'].mode().values[0], inplace = True)

填充 数字列：当数字列有缺失值和 NaN 值时。

df['numeric column name'].fillna(df['numeric column name'].mean(), inplace = True)

用零填充 NaN：

df['column name'].fillna(0, inplace = True)

解决方案 8：

替换 Pandas 中的 na 值

df['column_name'].fillna(value_to_be_replaced, inplace=True)

如果inplace=False，它不会更新 df（数据框），而是返回修改后的值。

解决方案 9：

考虑到上表中的特定列Amount是整数类型，以下是解决方案：

df['Amount'] = df['Amount'].fillna(0).astype(int)

类似地，您可以用各种数据类型填充它，例如float，str等等。

具体来说，我会考虑使用数据类型来比较同一列的各种值。

解决方案 10：

用不同的方法替换不同列中的 nan：

replacement = {'column_A': 0, 'column_B': -999, 'column_C': -99999}
df.fillna(value=replacement)

解决方案 11：

这对我有用，但是没有人提到它。它可能有什么问题吗？

df.loc[df['column_name'].isnull(), 'column_name'] = 0

解决方案 12：

主要有两个可用选项；在估算或填充缺失值NaN / np.nan的情况下，仅使用数值替换（跨列）：

df['Amount'].fillna(value=None, method= ,axis=1,)就足够了：

来自文档：

值：标量、字典、Series 或 DataFrame 值，用于填充空洞（例如 0），或者使用字典/Series/DataFrame 值指定每个索引（对于 Series）或列（对于 DataFrame）使用哪个值。（不在字典/Series/DataFrame 中的值将不会被填充）。此值不能是列表。

这意味着“字符串”或“常量”不再被允许被归类。

对于更专业的插补，请使用SimpleImputer()：

from sklearn.impute import SimpleImputer
si = SimpleImputer(strategy='constant', missing_values=np.nan, fill_value='Replacement_Value')
df[['Col-1', 'Col-2']] = si.fit_transform(X=df[['C-1', 'C-2']])

解决方案 13：

如果要为特定列填充 NaN，可以使用 loc：

d1 = {"Col1": ['A', 'B', 'C'],
      "fruits": ['Avocado', 'Banana', 'NaN']}
d1 = pd.DataFrame(d1)

输出：

  Col1   fruits
0    A  Avocado
1    B   Banana
2    C      NaN

d1.loc[d1.Col1=='C', 'fruits'] = 'Carrot'

输出：

  Col1   fruits
0    A  Avocado
1    B   Banana
2    C   Carrot

解决方案 14：

我认为还值得一提并解释一下 fillna() 的参数配置，如方法、轴、限制等。

根据文档我们可以得到：

Series.fillna(value=None, method=None, axis=None, 
                 inplace=False, limit=None, downcast=None)
Fill NA/NaN values using the specified method.

参数

value [scalar, dict, Series, or DataFrame] Value to use to 
 fill holes (e.g. 0), alternately a dict/Series/DataFrame 
 of values specifying which value to use for each index 
 (for a Series) or column (for a DataFrame). Values not in 
 the dict/Series/DataFrame will not be filled. This 
 value cannot be a list.

method [{‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, 
 default None] Method to use for filling holes in 
 reindexed Series pad / ffill: propagate last valid 
 observation forward to next valid backfill / bfill: 
 use next valid observation to fill gap axis 
 [{0 or ‘index’}] Axis along which to fill missing values.

inplace [bool, default False] If True, fill 
 in-place. Note: this will modify any other views
 on this object (e.g., a no-copy slice for a 
 column in a DataFrame).

limit [int,defaultNone] If method is specified, 
 this is the maximum number of consecutive NaN 
 values to forward/backward fill. In other words, 
 if there is a gap with more than this number of 
 consecutive NaNs, it will only be partially filled. 
 If method is not specified, this is the maximum 
 number of entries along the entire axis where NaNs
 will be filled. Must be greater than 0 if not None.

downcast [dict, default is None] A dict of item->dtype 
 of what to downcast if possible, or the string ‘infer’ 
 which will try to downcast to an appropriate equal 
 type (e.g. float64 to int64 if possible).

好的。让我们从method=参数开始，它有前向填充（ffill）和后向填充（bfill），ffill 正在向前复制前一个非缺失值。

例如：

import pandas as pd
import numpy as np
inp = [{'c1':10, 'c2':np.nan, 'c3':200}, {'c1':np.nan,'c2':110, 'c3':210}, {'c1':12,'c2':np.nan, 'c3':220},{'c1':12,'c2':130, 'c3':np.nan},{'c1':12,'c2':np.nan, 'c3':240}]
df = pd.DataFrame(inp)

  c1       c2      c3
0   10.0     NaN      200.0
1   NaN   110.0 210.0
2   12.0     NaN      220.0
3   12.0     130.0 NaN
4   12.0     NaN      240.0

前向填充：

df.fillna(method="ffill")

    c1     c2      c3
0   10.0      NaN 200.0
1   10.0    110.0   210.0
2   12.0    110.0   220.0
3   12.0    130.0   220.0
4   12.0    130.0   240.0

向后填充：

df.fillna(method="bfill")

    c1      c2     c3
0   10.0    110.0   200.0
1   12.0    110.0   210.0
2   12.0    130.0   220.0
3   12.0    130.0   240.0
4   12.0      NaN   240.0

轴参数帮助我们选择填充的方向：

填充方向：

填寫：

Axis = 1 
Method = 'ffill'
----------->
  direction 

df.fillna(method="ffill", axis=1)

       c1   c2      c3
0   10.0     10.0   200.0
1    NaN    110.0   210.0
2   12.0     12.0   220.0
3   12.0    130.0   130.0
4   12.0    12.0    240.0

Axis = 0 # by default 
Method = 'ffill'
|
|       # direction 
|
V
e.g: # This is the ffill default
df.fillna(method="ffill", axis=0)

    c1     c2      c3
0   10.0      NaN   200.0
1   10.0    110.0   210.0
2   12.0    110.0   220.0
3   12.0    130.0   220.0
4   12.0    130.0   240.0

填充：

axis= 0
method = 'bfill'
^
|
|
|
df.fillna(method="bfill", axis=0)

    c1     c2      c3
0   10.0    110.0   200.0
1   12.0    110.0   210.0
2   12.0    130.0   220.0
3   12.0    130.0   240.0
4   12.0      NaN   240.0

axis = 1
method = 'bfill'
<-----------
df.fillna(method="bfill", axis=1)
        c1     c2       c3
0    10.0   200.0   200.0
1   110.0   110.0   210.0
2    12.0   220.0   220.0
3    12.0   130.0     NaN
4    12.0   240.0   240.0

# alias:
#  'fill' == 'pad' 
#   bfill == backfill

限制参数：

df
    c1     c2      c3
0   10.0      NaN   200.0
1    NaN    110.0   210.0
2   12.0      NaN   220.0
3   12.0    130.0     NaN
4   12.0      NaN   240.0

仅替换列中第一个 NaN 元素：

df.fillna(value = 'Unavailable', limit=1)
            c1           c2          c3
0          10.0 Unavailable       200.0
1   Unavailable       110.0       210.0
2          12.0         NaN       220.0
3          12.0       130.0 Unavailable
4          12.0         NaN       240.0

df.fillna(value = 'Unavailable', limit=2)

           c1            c2          c3
0          10.0 Unavailable       200.0
1   Unavailable       110.0       210.0
2          12.0 Unavailable       220.0
3          12.0       130.0 Unavailable
4          12.0         NaN       240.0

向下转换参数：

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   c1      4 non-null      float64
 1   c2      2 non-null      float64
 2   c3      4 non-null      float64
dtypes: float64(3)
memory usage: 248.0 bytes

df.fillna(method="ffill",downcast='infer').info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   c1      5 non-null      int64  
 1   c2      4 non-null      float64
 2   c3      5 non-null      int64  
dtypes: float64(1), int64(2)
memory usage: 248.0 bytes

解决方案 15：

如果您使用等从文件中读取缺失值的数据read_csv，则可以keep_default_na=False将缺失值作为空字符串 ( "") 传递。在特定情况下，这很有用，因为它可以在一次函数调用中实现fillna或replace执行的操作（内存中少了一个副本）。

df = pd.read_csv(filepath, keep_default_na=False)

# the above is same as
df = pd.read_csv(filepath).fillna("")
# or
df = pd.read_csv(filepath).replace(np.nan, "")

如果数据框包含数字，那么您可以传递 dtypes 来read_csv构建具有所需 dtype 列的数据框。

df = pd.read_csv(filepath, keep_default_na=False, dtype={"col1": "Int64", "col2": "string", "col3": "Float64"})

替换 NaN 的另一种方法是通过mask()/where()方法。它们是类似的方法，mask替换满足条件的值，而where替换不满足条件的值。因此，要使用，我们只需过滤 NaN 值并将其替换为所需的值。

import pandas as pd

df = pd.DataFrame({'a': [1, float('nan'), float('nan')], 'b': [float('nan'), 'a', 'b']})

df = df.where(df.notna(), 10)                 # for the entire dataframe
df['a'] = df['a'].where(df['a'].notna(), 10)  # for a single column

这种方法的优点是，我们可以有条件地用它替换 NaN 值。以下是一个例子，其中如果条件满足，则将中的 NaN 值df替换为。10`cond`

cond = pd.DataFrame({'a': [True, True, False], 'b':[False, True, True]})
df = df.mask(df.isna() & cond, 10)

在底层，fillna()调用where()（sourcenumpy.where() ），如果数据帧较小，则依次调用（source），numexpr.evaluate如果数据帧较大，则依次调用（source）。因此，//本质上fillna是用于替换 NaN 值的相同方法。另一方面，（本页提供的另一种方法）是一种操作（source）。由于它比大型数组更快，因此对于非常大的数据帧，其他方法可能会胜过它。mask`wherereplace()numpy.putmasknumexprnumpy`replace

顺便提一下，数据框通常有一个文字字符串'NaN'而不是实际的 NaN 值。要确保数据框确实有 NaN 值，请使用进行检查df.isna().any()。如果它返回 False（而它应该包含 NaN），那么您可能有'NaN'字符串，在这种情况下，使用replace将它们转换为 NaN，或者更好的是，用您要替换的值替换它。例如：

df = pd.DataFrame({'a': ['a', 'b', 'NaN']})
df = df.replace('NaN', 'c')

解决方案 16：

使用lambda表达式，也可以用 0 替换 NaN。

下面是一个例子：

dss3 = dss2['Score'].apply(lambda x: 0 if dss2['Score'].isnull else x)
print(dss3)