从条目长度不同的字典中创建数据框
- 2024-12-30 08:42:00
- admin 原创
- 127
问题描述:
假设我有一个包含 10 个键值对的字典。每个条目包含一个numpy array
。但是,所有 的长度array
并不相同。
如何创建一个每列包含不同条目的数据框?
当我尝试时:
import pandas as pd
import numpy as np
from string import ascii_uppercase # from the standard library
# repeatable sample data
np.random.seed(2023)
data = {k: np.random.randn(v) for k, v in zip(ascii_uppercase[:10], range(10, 20))}
df = pd.DataFrame(data)
我得到:
ValueError: arrays must all be the same length
有什么办法可以解决这个问题?我很高兴使用 PandasNaN
来填充较短条目的列。
期望结果
A B C D E F G H I J
0 0.711674 -1.076522 -1.502178 -1.519748 0.340619 0.051132 0.036537 0.367296 1.056500 -1.186943
1 -0.324485 -0.325682 -1.379593 2.097329 -1.253501 -0.238061 2.431822 -0.576828 -0.733918 -0.540638
2 -1.001871 -1.035498 -0.204455 0.892562 0.370788 -0.208009 0.422599 -0.416005 -0.083968 -0.638495
3 0.236251 -0.426320 0.642125 1.596488 0.455254 0.401304 1.843922 -0.137542 0.127288 0.150411
4 -0.102160 -1.029361 -0.181176 -0.638762 -2.283720 0.183169 -0.221562 1.294987 0.344423 0.919450
5 -1.141293 -0.521774 0.771749 -1.133047 -0.000822 1.235830 0.337117 0.520589 0.685970 0.910146
6 2.654407 -0.422758 0.741523 0.656597 2.398876 -0.291800 -0.557180 -0.194273 0.399908 1.605234
7 1.440605 -0.099244 1.324763 0.595787 -2.583105 0.029992 0.053141 -0.385593 0.893458 0.667165
8 0.098902 -1.380258 0.439287 -0.811120 1.311009 -0.868404 1.053804 -3.065784 0.384793 0.950338
9 -3.121532 0.301903 -0.557873 -0.300535 -1.579478 0.604346 -0.658515 -0.668181 0.641113 0.734329
10 NaN -1.033599 0.927080 1.008391 -0.840683 0.728554 1.844449 0.056965 -0.577314 1.015465
11 NaN NaN -0.600727 -1.087762 -0.165509 1.364820 -0.075514 -0.909368 -0.819947 0.627386
12 NaN NaN NaN -1.787079 -2.068410 1.342694 0.264263 -1.487910 0.746819 1.062655
13 NaN NaN NaN NaN 0.452739 -1.456708 -1.395359 1.169611 1.836805 0.262885
14 NaN NaN NaN NaN NaN 0.969357 0.708416 0.393677 -1.455490 -2.086486
15 NaN NaN NaN NaN NaN NaN 0.762756 0.530569 -0.828721 -1.076369
16 NaN NaN NaN NaN NaN NaN NaN -0.586429 -0.609144 -0.507519
17 NaN NaN NaN NaN NaN NaN NaN NaN -1.071297 -0.274501
18 NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.848811
解决方案 1:
在 Python 3.x 中:
import pandas as pd
import numpy as np
d = dict( A = np.array([1,2]), B = np.array([1,2,3,4]) )
pd.DataFrame(dict([ (k,pd.Series(v)) for k,v in d.items() ]))
Out[7]:
A B
0 1 1
1 2 2
2 NaN 3
3 NaN 4
在 Python 2.x 中:
替换d.items()
为d.iteritems()
。
解决方案 2:
这是一个简单的方法:
In[20]: my_dict = dict( A = np.array([1,2]), B = np.array([1,2,3,4]) )
In[21]: df = pd.DataFrame.from_dict(my_dict, orient='index')
In[22]: df
Out[22]:
0 1 2 3
A 1 2 NaN NaN
B 1 2 3 4
In[23]: df.transpose()
Out[23]:
A B
0 1 1
1 2 2
2 NaN 3
3 NaN 4
解决方案 3:
以下是整理语法的一种方法,但本质上仍然与其他答案做相同的事情:
>>> mydict = {'one': [1,2,3], 2: [4,5,6,7], 3: 8}
>>> dict_df = pd.DataFrame({ key:pd.Series(value) for key, value in mydict.items() })
>>> dict_df
one 2 3
0 1.0 4 8.0
1 2.0 5 NaN
2 3.0 6 NaN
3 NaN 7 NaN
列表也存在类似的语法:
>>> mylist = [ [1,2,3], [4,5], 6 ]
>>> list_df = pd.DataFrame([ pd.Series(value) for value in mylist ])
>>> list_df
0 1 2
0 1.0 2.0 3.0
1 4.0 5.0 NaN
2 6.0 NaN NaN
列表的另一种语法是:
>>> mylist = [ [1,2,3], [4,5], 6 ]
>>> list_df = pd.DataFrame({ i:pd.Series(value) for i, value in enumerate(mylist) })
>>> list_df
0 1 2
0 1 4.0 6.0
1 2 5.0 NaN
2 3 NaN NaN
您可能还需要转置结果和/或更改列数据类型(浮点数、整数等)。
解决方案 4:
使用pandas.DataFrame
和pandas.concat
下面的代码使用列表推导式从不均匀的中创建
list
一个DataFrames
,带有,然后将 DataFrames 与 结合起来。pandas.DataFrame
`dictarrays
concat`axis=1
沿着宽数据框的列连接,而默认值axis=0
沿着长数据框的索引连接。宽数据和窄数据
使用
df = pd.DataFrame(date)
for 来获取dict
具有相等长度值的aarrays
。
import pandas as pd
# create the dataframe
df = pd.concat([pd.DataFrame(v, columns=[k]) for k, v in data.items()], axis=1)
使用pandas.DataFrame
和itertools.zip_longest
对于长度不均匀的可迭代对象,
zip_longest
用填充缺失值fillvalue
。zip 生成器需要解压,因为
DataFrame
构造函数不会解压它。
from itertools import zip_longest
# zip all the values together
zl = list(zip_longest(*data.values()))
# create dataframe
df = pd.DataFrame(zl, columns=data.keys())
阴谋
ax = df.plot(marker='o', figsize=[10, 5])
df
结果
A B C D E F G H I J
0 0.711674 -1.076522 -1.502178 -1.519748 0.340619 0.051132 0.036537 0.367296 1.056500 -1.186943
1 -0.324485 -0.325682 -1.379593 2.097329 -1.253501 -0.238061 2.431822 -0.576828 -0.733918 -0.540638
2 -1.001871 -1.035498 -0.204455 0.892562 0.370788 -0.208009 0.422599 -0.416005 -0.083968 -0.638495
3 0.236251 -0.426320 0.642125 1.596488 0.455254 0.401304 1.843922 -0.137542 0.127288 0.150411
4 -0.102160 -1.029361 -0.181176 -0.638762 -2.283720 0.183169 -0.221562 1.294987 0.344423 0.919450
5 -1.141293 -0.521774 0.771749 -1.133047 -0.000822 1.235830 0.337117 0.520589 0.685970 0.910146
6 2.654407 -0.422758 0.741523 0.656597 2.398876 -0.291800 -0.557180 -0.194273 0.399908 1.605234
7 1.440605 -0.099244 1.324763 0.595787 -2.583105 0.029992 0.053141 -0.385593 0.893458 0.667165
8 0.098902 -1.380258 0.439287 -0.811120 1.311009 -0.868404 1.053804 -3.065784 0.384793 0.950338
9 -3.121532 0.301903 -0.557873 -0.300535 -1.579478 0.604346 -0.658515 -0.668181 0.641113 0.734329
10 NaN -1.033599 0.927080 1.008391 -0.840683 0.728554 1.844449 0.056965 -0.577314 1.015465
11 NaN NaN -0.600727 -1.087762 -0.165509 1.364820 -0.075514 -0.909368 -0.819947 0.627386
12 NaN NaN NaN -1.787079 -2.068410 1.342694 0.264263 -1.487910 0.746819 1.062655
13 NaN NaN NaN NaN 0.452739 -1.456708 -1.395359 1.169611 1.836805 0.262885
14 NaN NaN NaN NaN NaN 0.969357 0.708416 0.393677 -1.455490 -2.086486
15 NaN NaN NaN NaN NaN NaN 0.762756 0.530569 -0.828721 -1.076369
16 NaN NaN NaN NaN NaN NaN NaN -0.586429 -0.609144 -0.507519
17 NaN NaN NaN NaN NaN NaN NaN NaN -1.071297 -0.274501
18 NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.848811
解决方案 5:
虽然这并没有直接回答 OP 的问题。我发现当我有不相等的数组时,这是一个很好的解决方案,我想分享一下:
来自 Pandas 文档
In [31]: d = {'one' : Series([1., 2., 3.], index=['a', 'b', 'c']),
....: 'two' : Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
....:
In [32]: df = DataFrame(d)
In [33]: df
Out[33]:
one two
a 1 1
b 2 2
c 3 3
d NaN 4
解决方案 6:
您还可以与对象列表pd.concat
一起使用:axis=1
`pd.Series`
import pandas as pd, numpy as np
d = {'A': np.array([1,2]), 'B': np.array([1,2,3,4])}
res = pd.concat([pd.Series(v, name=k) for k, v in d.items()], axis=1)
print(res)
A B
0 1.0 1
1 2.0 2
2 NaN 3
3 NaN 4
解决方案 7:
以下两行都可以完美运行:
pd.DataFrame.from_dict(df, orient='index').transpose() #A
pd.DataFrame(dict([ (k,pd.Series(v)) for k,v in df.items() ])) #B (Better)
但是使用 Jupyter 上的 %timeit,我得到了 B 与 A 的 4 倍速度比率,这非常令人印象深刻,尤其是在处理庞大的数据集(主要是具有大量列/特征)时。
解决方案 8:
如果您不想显示它NaN
并且有两种特定的长度,那么在每个剩余的单元格中添加一个“空格”也是可行的。
import pandas
long = [6, 4, 7, 3]
short = [5, 6]
for n in range(len(long) - len(short)):
short.append(' ')
df = pd.DataFrame({'A':long, 'B':short}]
# Make sure Excel file exists in the working directory
datatoexcel = pd.ExcelWriter('example1.xlsx',engine = 'xlsxwriter')
df.to_excel(datatoexcel,sheet_name = 'Sheet1')
datatoexcel.save()
A B
0 6 5
1 4 6
2 7
3 3
如果您有超过 2 个长度的条目,建议创建一个使用类似方法的函数。
解决方案 9:
这是一个不同的解决方案,它没有 NaN 值,而是有一个额外的列来提供数据源:
pd.concat([pd.DataFrame({"score":v, "type":k}) for k, v in d.items()])
例如
import pandas as pd
x1 = [2,3,4]
x2 = [5,6]
x3 = [100]
data = {'x1': x1, 'x2': x2, 'x3': x3}
pd.concat([pd.DataFrame({"score":v, "type":k}) for k, v in data.items()])
给出此数据框:
score type
0 2 x1
1 3 x1
2 4 x1
0 5 x2
1 6 x2
2 7 x2
0 100 x3