比较两个 DataFrame 并并排输出它们的差异
- 2025-01-22 08:45:00
- admin 原创
- 80
问题描述:
我试图准确突出两个数据框之间发生的变化。
假设我有两个 Python Pandas 数据框:
"StudentRoster Jan-1":
id Name score isEnrolled Comment
111 Jack 2.17 True He was late to class
112 Nick 1.11 False Graduated
113 Zoe 4.12 True
"StudentRoster Jan-2":
id Name score isEnrolled Comment
111 Jack 2.17 True He was late to class
112 Nick 1.21 False Graduated
113 Zoe 4.12 False On vacation
我的目标是输出一个 HTML 表:
标识已更改的行(可以是整数、浮点数、布尔值或字符串)
输出具有相同、OLD 和 NEW 值的行(最好是输出到 HTML 表中),以便消费者可以清楚地看到两个数据框之间发生了什么变化:
"StudentRoster Difference Jan-1 - Jan-2":
id Name score isEnrolled Comment
112 Nick was 1.11| now 1.21 False Graduated
113 Zoe 4.12 was True | now False was "" | now "On vacation"
我想我可以逐行、逐列进行比较,但有没有更简单的方法呢?
解决方案 1:
第一部分与Constantine类似,可以获取哪些行为空的布尔值*:
In [21]: ne = (df1 != df2).any(1)
In [22]: ne
Out[22]:
0 False
1 True
2 True
dtype: bool
然后我们可以看到哪些条目发生了变化:
In [23]: ne_stacked = (df1 != df2).stack()
In [24]: changed = ne_stacked[ne_stacked]
In [25]: changed.index.names = ['id', 'col']
In [26]: changed
Out[26]:
id col
1 score True
2 isEnrolled True
Comment True
dtype: bool
这里第一个条目是索引,第二个条目是已更改的列。
In [27]: difference_locations = np.where(df1 != df2)
In [28]: changed_from = df1.values[difference_locations]
In [29]: changed_to = df2.values[difference_locations]
In [30]: pd.DataFrame({'from': changed_from, 'to': changed_to}, index=changed.index)
Out[30]:
from to
id col
1 score 1.11 1.21
2 isEnrolled True False
Comment None On vacation
df1
* 注意:和在这里共享相同的索引非常重要df2
。为了克服这种歧义,您可以确保只使用查看共享标签df1.index & df2.index
,但我想我会把它留作练习。
解决方案 2:
突出显示两个 DataFrame 之间的差异
可以使用 DataFrame 样式属性来突出显示存在差异的单元格的背景颜色。
使用原始问题中的示例数据
第一步是使用concat
函数水平连接 DataFrames,并使用参数区分每个帧keys
:
df_all = pd.concat([df.set_index('id'), df2.set_index('id')],
axis='columns', keys=['First', 'Second'])
df_all
交换列级别并将相同的列名放在一起可能会更容易:
df_final = df_all.swaplevel(axis='columns')[df.columns[1:]]
df_final
现在,更容易发现框架中的差异。但是,我们可以更进一步,使用该style
属性突出显示不同的单元格。我们定义了一个自定义函数来执行此操作,您可以在文档的这一部分中看到。
def highlight_diff(data, color='yellow'):
attr = 'background-color: {}'.format(color)
other = data.xs('First', axis='columns', level=-1)
return pd.DataFrame(np.where(data.ne(other, level=0), attr, ''),
index=data.index, columns=data.columns)
df_final.style.apply(highlight_diff, axis=None)
这将突出显示具有缺失值的单元格。您可以填充它们或提供额外的逻辑,以便它们不会被突出显示。
解决方案 3:
这个答案只是扩展了@Andy Hayden 的答案,使其能够适应数字字段的情况nan
,并将其包装成一个函数。
import pandas as pd
import numpy as np
def diff_pd(df1, df2):
"""Identify differences between two pandas DataFrames"""
assert (df1.columns == df2.columns).all(), \n "DataFrame column names are different"
if any(df1.dtypes != df2.dtypes):
"Data Types are different, trying to convert"
df2 = df2.astype(df1.dtypes)
if df1.equals(df2):
return None
else:
# need to account for np.nan != np.nan returning True
diff_mask = (df1 != df2) & ~(df1.isnull() & df2.isnull())
ne_stacked = diff_mask.stack()
changed = ne_stacked[ne_stacked]
changed.index.names = ['id', 'col']
difference_locations = np.where(diff_mask)
changed_from = df1.values[difference_locations]
changed_to = df2.values[difference_locations]
return pd.DataFrame({'from': changed_from, 'to': changed_to},
index=changed.index)
因此,对于您的数据(稍微编辑一下,在分数列中有一个 NaN):
import sys
if sys.version_info[0] < 3:
from StringIO import StringIO
else:
from io import StringIO
DF1 = StringIO("""id Name score isEnrolled Comment
111 Jack 2.17 True "He was late to class"
112 Nick 1.11 False "Graduated"
113 Zoe NaN True " "
""")
DF2 = StringIO("""id Name score isEnrolled Comment
111 Jack 2.17 True "He was late to class"
112 Nick 1.21 False "Graduated"
113 Zoe NaN False "On vacation" """)
df1 = pd.read_table(DF1, sep='s+', index_col='id')
df2 = pd.read_table(DF2, sep='s+', index_col='id')
diff_pd(df1, df2)
输出:
from to
id col
112 score 1.11 1.21
113 isEnrolled True False
Comment On vacation
解决方案 4:
import pandas as pd
import io
texts = ['''\nid Name score isEnrolled Comment
111 Jack 2.17 True He was late to class
112 Nick 1.11 False Graduated
113 Zoe 4.12 True ''',
'''\nid Name score isEnrolled Comment
111 Jack 2.17 True He was late to class
112 Nick 1.21 False Graduated
113 Zoe 4.12 False On vacation''']
df1 = pd.read_fwf(io.StringIO(texts[0]), widths=[5,7,25,21,20])
df2 = pd.read_fwf(io.StringIO(texts[1]), widths=[5,7,25,21,20])
df = pd.concat([df1,df2])
print(df)
# id Name score isEnrolled Comment
# 0 111 Jack 2.17 True He was late to class
# 1 112 Nick 1.11 False Graduated
# 2 113 Zoe 4.12 True NaN
# 0 111 Jack 2.17 True He was late to class
# 1 112 Nick 1.21 False Graduated
# 2 113 Zoe 4.12 False On vacation
df.set_index(['id', 'Name'], inplace=True)
print(df)
# score isEnrolled Comment
# id Name
# 111 Jack 2.17 True He was late to class
# 112 Nick 1.11 False Graduated
# 113 Zoe 4.12 True NaN
# 111 Jack 2.17 True He was late to class
# 112 Nick 1.21 False Graduated
# 113 Zoe 4.12 False On vacation
def report_diff(x):
return x[0] if x[0] == x[1] else '{} | {}'.format(*x)
changes = df.groupby(level=['id', 'Name']).agg(report_diff)
print(changes)
印刷
score isEnrolled Comment
id Name
111 Jack 2.17 True He was late to class
112 Nick 1.11 | 1.21 False Graduated
113 Zoe 4.12 True | False nan | On vacation
解决方案 5:
熊猫> = 1.1:DataFrame.compare
使用 pandas 1.1,您只需一次函数调用即可复制 Ted Petrou 的输出。示例取自文档:
pd.__version__
# '1.1.0'
df1.compare(df2)
score isEnrolled Comment
self other self other self other
1 1.11 1.21 NaN NaN NaN NaN
2 NaN NaN 1.0 0.0 NaN On vacation
此处,“self” 指的是 LHS 数据帧,而“other” 指的是 RHS 数据帧。默认情况下,相等的值会替换为 NaN,因此您可以只关注差异。如果您还想显示相等的值,请使用
df1.compare(df2, keep_equal=True, keep_shape=True)
score isEnrolled Comment
self other self other self other
1 1.11 1.21 False False Graduated Graduated
2 4.12 4.12 True False NaN On vacation
您还可以使用以下方法更改比较轴align_axis
:
df1.compare(df2, align_axis='index')
score isEnrolled Comment
1 self 1.11 NaN NaN
other 1.21 NaN NaN
2 self NaN 1.0 NaN
other NaN 0.0 On vacation
这将按行而不是按列比较值。
解决方案 6:
我遇到过这个问题,但在找到这篇文章之前找到了答案:
根据 unutbu 的回答,加载您的数据......
import pandas as pd
import io
texts = ['''\nid Name score isEnrolled Date
111 Jack True 2013-05-01 12:00:00
112 Nick 1.11 False 2013-05-12 15:05:23
Zoe 4.12 True ''',
'''\nid Name score isEnrolled Date
111 Jack 2.17 True 2013-05-01 12:00:00
112 Nick 1.21 False
Zoe 4.12 False 2013-05-01 12:00:00''']
df1 = pd.read_fwf(io.StringIO(texts[0]), widths=[5,7,25,17,20], parse_dates=[4])
df2 = pd.read_fwf(io.StringIO(texts[1]), widths=[5,7,25,17,20], parse_dates=[4])
...定义你的diff函数...
def report_diff(x):
return x[0] if x[0] == x[1] else '{} | {}'.format(*x)
然后您可以简单地使用面板来得出结论:
my_panel = pd.Panel(dict(df1=df1,df2=df2))
print my_panel.apply(report_diff, axis=0)
# id Name score isEnrolled Date
#0 111 Jack nan | 2.17 True 2013-05-01 12:00:00
#1 112 Nick 1.11 | 1.21 False 2013-05-12 15:05:23 | NaT
#2 nan | nan Zoe 4.12 True | False NaT | 2013-05-01 12:00:00
顺便说一句,如果你使用 IPython Notebook,你可能喜欢使用彩色差异函数来根据单元格是不同、相等还是左/右为空来赋予颜色:
from IPython.display import HTML
pd.options.display.max_colwidth = 500 # You need this, otherwise pandas
# will limit your HTML strings to 50 characters
def report_diff(x):
if x[0]==x[1]:
return unicode(x[0].__str__())
elif pd.isnull(x[0]) and pd.isnull(x[1]):
return u'<table style="background-color:#00ff00;font-weight:bold;">'+\n '<tr><td>%s</td></tr><tr><td>%s</td></tr></table>' % ('nan', 'nan')
elif pd.isnull(x[0]) and ~pd.isnull(x[1]):
return u'<table style="background-color:#ffff00;font-weight:bold;">'+\n '<tr><td>%s</td></tr><tr><td>%s</td></tr></table>' % ('nan', x[1])
elif ~pd.isnull(x[0]) and pd.isnull(x[1]):
return u'<table style="background-color:#0000ff;font-weight:bold;">'+\n '<tr><td>%s</td></tr><tr><td>%s</td></tr></table>' % (x[0],'nan')
else:
return u'<table style="background-color:#ff0000;font-weight:bold;">'+\n '<tr><td>%s</td></tr><tr><td>%s</td></tr></table>' % (x[0], x[1])
HTML(my_panel.apply(report_diff, axis=0).to_html(escape=False))
解决方案 7:
使用 concat 和 drop_duplicates 的不同方法:
import sys
if sys.version_info[0] < 3:
from StringIO import StringIO
else:
from io import StringIO
import pandas as pd
DF1 = StringIO("""id Name score isEnrolled Comment
111 Jack 2.17 True "He was late to class"
112 Nick 1.11 False "Graduated"
113 Zoe NaN True " "
""")
DF2 = StringIO("""id Name score isEnrolled Comment
111 Jack 2.17 True "He was late to class"
112 Nick 1.21 False "Graduated"
113 Zoe NaN False "On vacation" """)
df1 = pd.read_table(DF1, sep='s+', index_col='id')
df2 = pd.read_table(DF2, sep='s+', index_col='id')
#%%
dictionary = {1:df1,2:df2}
df=pd.concat(dictionary)
df.drop_duplicates(keep=False)
输出:
Name score isEnrolled Comment
id
1 112 Nick 1.11 False Graduated
113 Zoe NaN True
2 112 Nick 1.21 False Graduated
113 Zoe NaN False On vacation
解决方案 8:
如果两个数据框中的 ID 相同,那么找出更改的内容实际上非常容易。只需执行frame1 != frame2
即可获得一个布尔数据框,其中每个True
数据都是已更改的数据。从中,您可以通过 轻松获取每个已更改行的索引changedids = frame1.index[np.any(frame1 != frame2,axis=1)]
。
解决方案 9:
在摆弄了@journois的答案之后,我能够使用 MultiIndex 而不是 Panel 来使其工作,因为Panel 已被贬低。
首先,创建一些虚拟数据:
df1 = pd.DataFrame({
'id': ['111', '222', '333', '444', '555'],
'let': ['a', 'b', 'c', 'd', 'e'],
'num': ['1', '2', '3', '4', '5']
})
df2 = pd.DataFrame({
'id': ['111', '222', '333', '444', '666'],
'let': ['a', 'b', 'c', 'D', 'f'],
'num': ['1', '2', 'Three', '4', '6'],
})
然后,定义你的diff函数,在这种情况下,我将使用他的答案中的函数report_diff
保持不变:
def report_diff(x):
return x[0] if x[0] == x[1] else '{} | {}'.format(*x)
然后,我将把数据连接到 MultiIndex 数据框中:
df_all = pd.concat(
[df1.set_index('id'), df2.set_index('id')],
axis='columns',
keys=['df1', 'df2'],
join='outer'
)
df_all = df_all.swaplevel(axis='columns')[df1.columns[1:]]
最后,我将应用report_diff
每个列组:
df_final.groupby(level=0, axis=1).apply(lambda frame: frame.apply(report_diff, axis=1))
输出:
let num
111 a 1
222 b 2
333 c 3 | Three
444 d | D 4
555 e | nan 5 | nan
666 nan | f nan | 6
就这样!
解决方案 10:
扩展@cge 的答案,这对于提高结果的可读性来说非常酷:
a[a != b][np.any(a != b, axis=1)].join(pd.DataFrame('a<->b', index=a.index, columns=['a<=>b'])).join(
b[a != b][np.any(a != b, axis=1)]
,rsuffix='_b', how='outer'
).fillna('')
完整演示示例:
import numpy as np, pandas as pd
a = pd.DataFrame(np.random.randn(7,3), columns=list('ABC'))
b = a.copy()
b.iloc[0,2] = np.nan
b.iloc[1,0] = 7
b.iloc[3,1] = 77
b.iloc[4,2] = 777
a[a != b][np.any(a != b, axis=1)].join(pd.DataFrame('a<->b', index=a.index, columns=['a<=>b'])).join(
b[a != b][np.any(a != b, axis=1)]
,rsuffix='_b', how='outer'
).fillna('')
示例结果:
在线演示
解决方案 11:
以下是使用选择和合并的另一种方法:
In [6]: # first lets create some dummy dataframes with some column(s) different
...: df1 = pd.DataFrame({'a': range(-5,0), 'b': range(10,15), 'c': range(20,25)})
...: df2 = pd.DataFrame({'a': range(-5,0), 'b': range(10,15), 'c': [20] + list(range(101,105))})
In [7]: df1
Out[7]:
a b c
0 -5 10 20
1 -4 11 21
2 -3 12 22
3 -2 13 23
4 -1 14 24
In [8]: df2
Out[8]:
a b c
0 -5 10 20
1 -4 11 101
2 -3 12 102
3 -2 13 103
4 -1 14 104
In [10]: # make condition over the columns you want to comapre
...: condition = df1['c'] != df2['c']
...:
...: # select rows from each dataframe where the condition holds
...: diff1 = df1[condition]
...: diff2 = df2[condition]
In [11]: # merge the selected rows (dataframes) with some suffixes (optional)
...: diff1.merge(diff2, on=['a','b'], suffixes=('_before', '_after'))
Out[11]:
a b c_before c_after
0 -4 11 21 101
1 -3 12 22 102
2 -2 13 23 103
3 -1 14 24 104
以下是 Jupyter 截图中的相同内容:
解决方案 12:
如果您发现此线程试图比较测试中的数据帧,请查看assert_frame_equal
方法:https ://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.testing.assert_frame_equal.html
解决方案 13:
下面实现了一个查找两个数据框之间不对称差异的函数:(基于pandas 的集合差异) GIST: https://gist.github.com/oneryalcin/68cf25f536a25e65f0b3c84f9c118e03
def diff_df(df1, df2, how="left"):
"""
Find Difference of rows for given two dataframes
this function is not symmetric, means
diff(x, y) != diff(y, x)
however
diff(x, y, how='left') == diff(y, x, how='right')
Ref: https://stackoverflow.com/questions/18180763/set-difference-for-pandas/40209800#40209800
"""
if (df1.columns != df2.columns).any():
raise ValueError("Two dataframe columns must match")
if df1.equals(df2):
return None
elif how == 'right':
return pd.concat([df2, df1, df1]).drop_duplicates(keep=False)
elif how == 'left':
return pd.concat([df1, df2, df2]).drop_duplicates(keep=False)
else:
raise ValueError('how parameter supports only "left" or "right keywords"')
例子:
df1 = pd.DataFrame(d1)
Out[1]:
Comment Name isEnrolled score
0 He was late to class Jack True 2.17
1 Graduated Nick False 1.11
2 Zoe True 4.12
df2 = pd.DataFrame(d2)
Out[2]:
Comment Name isEnrolled score
0 He was late to class Jack True 2.17
1 On vacation Zoe True 4.12
diff_df(df1, df2)
Out[3]:
Comment Name isEnrolled score
1 Graduated Nick False 1.11
2 Zoe True 4.12
diff_df(df2, df1)
Out[4]:
Comment Name isEnrolled score
1 On vacation Zoe True 4.12
# This gives the same result as above
diff_df(df1, df2, how='right')
Out[22]:
Comment Name isEnrolled score
1 On vacation Zoe True 4.12
解决方案 14:
从 v1.1.0 开始,您应该能够使用 Dataframe.compare 来执行此操作
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.compare.html
解决方案 15:
这个答案是关于如何获取一个简单的数据框,这是与其他两个答案的区别,这与操作问题类似,但并不完全相同。
方法
具体来说,您有 2 个选择:
您将获得一个数据框,其中包含第一个数据框中的所有行,但不包含在第二个数据框中。
您将获得一个数据框,其中包含第二个数据框中的所有行,这些行不包含在第一个数据框中
您将获得一个包含选项 1 和选项 2 的所有行的数据框
代码
# option 1
def diff_pd(df1: pd.DataFrame, df2: pd.DataFrame) -> pd.DataFrame:
common_rows = df1[df1.isin(df2.to_dict(orient='list')).all(axis=1)]
return = df1[~df1.isin(common_rows.to_dict(orient='list')).all(axis=1)]
# option 2
def diff_pd(df1: pd.DataFrame, df2: pd.DataFrame) -> pd.DataFrame:
common_rows = df1[df1.isin(df2.to_dict(orient='list')).all(axis=1)]
return = df2[~df2.isin(common_rows.to_dict(orient='list')).all(axis=1)]
对于第三个也是最复杂的选项,您只需将选项 1 和 2 的结果连接起来。
# option 3
def diff_pd(df1: pd.DataFrame, df2: pd.DataFrame) -> pd.DataFrame:
common_rows = df1[df1.isin(df2.to_dict(orient='list')).all(axis=1)]
df1 = df1[~df1.isin(common_rows.to_dict(orient='list')).all(axis=1)]
df2 = df2[~df2.isin(common_rows.to_dict(orient='list')).all(axis=1)]
return pd.concat([df1, df2])
解决方案 16:
import pandas as pd
import numpy as np
df = pd.read_excel('D:\\HARISH\\DATA SCIENCE\\1 MY Training\\SAMPLE DATA & projs\\CRICKET DATA\\IPL PLAYER LIST\\IPL PLAYER LIST _ harish.xlsx')
df1= srh = df[df['TEAM'].str.contains("SRH")]
df2 = csk = df[df['TEAM'].str.contains("CSK")]
srh = srh.iloc[:,0:2]
csk = csk.iloc[:,0:2]
csk = csk.reset_index(drop=True)
csk
srh = srh.reset_index(drop=True)
srh
new = pd.concat([srh, csk], axis=1)
new.head()
** PLAYER TYPE PLAYER TYPE 0 David Warner Batsman ... MS Dhoni Captain 1 Bhuvaneshwar Kumar Bowler ... Ravindra Jadeja All-Rounder 2 Manish Pandey Batsman ... Suresh Raina All-Rounder 3 Rashid Khan Arman Bowler ... Kedar Jadhav All-Rounder 4 Shikhar Dhawan Batsman .... Dwayne Bravo All-Rounder