使用散点数据集生成热图-IT科技

摘要：问题描述：我有一组 X、Y 数据点（约 10k），它们很容易绘制为散点图，但我想将其表示为热图。我查看了 Matplotlib 中的示例，它们似乎都已经从热图单元格值开始生成图像。是否有一种方法可以将一堆不同的 x、y 转换为热图（其中 x、y 频率较高的区域会“更温暖”）？解决方案 1：如果你不想要六边形，...

问题描述：

我有一组 X、Y 数据点（约 10k），它们很容易绘制为散点图，但我想将其表示为热图。

我查看了 Matplotlib 中的示例，它们似乎都已经从热图单元格值开始生成图像。

是否有一种方法可以将一堆不同的 x、y 转换为热图（其中 x、y 频率较高的区域会“更温暖”）？

解决方案 1：

如果你不想要六边形，你可以使用numpy的histogram2d函数：

import numpy as np
import numpy.random
import matplotlib.pyplot as plt

# Generate some test data
x = np.random.randn(8873)
y = np.random.randn(8873)

heatmap, xedges, yedges = np.histogram2d(x, y, bins=50)
extent = [xedges[0], xedges[-1], yedges[0], yedges[-1]]

plt.clf()
plt.imshow(heatmap.T, extent=extent, origin='lower')
plt.show()

这将生成一个 50x50 的热图。如果您想要 512x384，则可以bins=(512, 384)调用histogram2d。

例子： Matplotlib 热图示例

解决方案 2：

在Matplotlib词典中，我认为您想要一个六边形图。

如果您不熟悉这种类型的图，它只是一个双变量直方图，其中 xy 平面由规则的六边形网格镶嵌。

因此，从直方图中，您只需计算每个六边形中的点数，将绘图区域离散化为一组窗口，将每个点分配给其中一个窗口；最后，将窗口映射到颜色数组上，您就会得到一个六边形图。

尽管六边形不如圆形或正方形那么常用，但直观地看，六边形是装箱容器几何形状的更好选择：

六边形具有最近邻对称性（例如，正方形箱体不具有这种对称性，例如，正方形边界上的点到正方形内部的点的距离并不处处相等），并且
六边形是提供规则平面镶嵌的最高 n 多边形（即，您可以安全地用六边形瓷砖重新塑造您的厨房地板，因为当您完成后瓷砖之间不会有任何空隙 - 但对于所有其他更高 n、n >= 7 的多边形，情况并非如此）。

（Matplotlib使用术语hexbin plot；据我所知，所有R的绘图库也都这样；但我仍然不知道这是否是此类绘图的普遍接受的术语，但我怀疑它很可能是，因为hexbin是六边形分箱的缩写，它描述了准备数据以供显示的基本步骤。）

from matplotlib import pyplot as PLT
from matplotlib import cm as CM
from matplotlib import mlab as ML
import numpy as NP

n = 1e5
x = y = NP.linspace(-5, 5, 100)
X, Y = NP.meshgrid(x, y)
Z1 = ML.bivariate_normal(X, Y, 2, 2, 0, 0)
Z2 = ML.bivariate_normal(X, Y, 4, 1, 1, 1)
ZD = Z2 - Z1
x = X.ravel()
y = Y.ravel()
z = ZD.ravel()
gridsize=30
PLT.subplot(111)

# if 'bins=None', then color of each hexagon corresponds directly to its count
# 'C' is optional--it maps values to x-y coordinates; if 'C' is None (default) then 
# the result is a pure 2D histogram 

PLT.hexbin(x, y, C=z, gridsize=gridsize, cmap=CM.jet, bins=None)
PLT.axis([x.min(), x.max(), y.min(), y.max()])

cb = PLT.colorbar()
cb.set_label('mean value')
PLT.show()

在此处输入图片描述

解决方案 3：

编辑：为了更好地近似 Alejandro 的答案，请参见下文。

我知道这是一个老问题，但我想在 Alejandro 的回答中添加一些内容：如果您想要一个漂亮的平滑图像而不使用 py-sphviewer，您可以改用np.histogram2d并将高斯滤波器（来自scipy.ndimage.filters）应用于热图：

import numpy as np
import matplotlib.pyplot as plt
import matplotlib.cm as cm
from scipy.ndimage.filters import gaussian_filter


def myplot(x, y, s, bins=1000):
    heatmap, xedges, yedges = np.histogram2d(x, y, bins=bins)
    heatmap = gaussian_filter(heatmap, sigma=s)

    extent = [xedges[0], xedges[-1], yedges[0], yedges[-1]]
    return heatmap.T, extent


fig, axs = plt.subplots(2, 2)

# Generate some test data
x = np.random.randn(1000)
y = np.random.randn(1000)

sigmas = [0, 16, 32, 64]

for ax, s in zip(axs.flatten(), sigmas):
    if s == 0:
        ax.plot(x, y, 'k.', markersize=5)
        ax.set_title("Scatter plot")
    else:
        img, extent = myplot(x, y, s)
        ax.imshow(img, extent=extent, origin='lower', cmap=cm.jet)
        ax.set_title("Smoothing with  $sigma$ = %d" % s)

plt.show()

生成：

输出图像

散点图和 s=16 绘制在 Agape Gal'lo 的上方（单击可获得更佳视图）：

互相叠合

我注意到我的高斯滤波方法和 Alejandro 的方法之间的一个区别是，他的方法比我的方法更好地显示了局部结构。因此，我在像素级别实现了一个简单的最近邻方法。此方法为每个像素计算n数据中最近点距离的倒数和。此方法在高分辨率下计算成本相当高，我认为还有一种更快的方法，所以如果您有任何改进，请告诉我。

更新：正如我所怀疑的，使用 Scipy 的方法要快得多scipy.cKDTree。有关实现，请参阅Gabriel 的答案。

无论如何，这是我的代码：

import numpy as np
import matplotlib.pyplot as plt
import matplotlib.cm as cm


def data_coord2view_coord(p, vlen, pmin, pmax):
    dp = pmax - pmin
    dv = (p - pmin) / dp * vlen
    return dv


def nearest_neighbours(xs, ys, reso, n_neighbours):
    im = np.zeros([reso, reso])
    extent = [np.min(xs), np.max(xs), np.min(ys), np.max(ys)]

    xv = data_coord2view_coord(xs, reso, extent[0], extent[1])
    yv = data_coord2view_coord(ys, reso, extent[2], extent[3])
    for x in range(reso):
        for y in range(reso):
            xp = (xv - x)
            yp = (yv - y)

            d = np.sqrt(xp**2 + yp**2)

            im[y][x] = 1 / np.sum(d[np.argpartition(d.ravel(), n_neighbours)[:n_neighbours]])

    return im, extent


n = 1000
xs = np.random.randn(n)
ys = np.random.randn(n)
resolution = 250

fig, axes = plt.subplots(2, 2)

for ax, neighbours in zip(axes.flatten(), [0, 16, 32, 64]):
    if neighbours == 0:
        ax.plot(xs, ys, 'k.', markersize=2)
        ax.set_aspect('equal')
        ax.set_title("Scatter Plot")
    else:
        im, extent = nearest_neighbours(xs, ys, resolution, neighbours)
        ax.imshow(im, origin='lower', extent=extent, cmap=cm.jet)
        ax.set_title("Smoothing over %d neighbours" % neighbours)
        ax.set_xlim(extent[0], extent[1])
        ax.set_ylim(extent[2], extent[3])
plt.show()

结果：

解决方案 4：

我不想使用 np.hist2d （它通常会生成非常难看的直方图），而是想重复使用py-sphviewer，这是一个使用自适应平滑核渲染粒子模拟的 Python 包，可以从 pip 轻松安装（请参阅网页文档）。考虑以下基于示例的代码：

import numpy as np
import numpy.random
import matplotlib.pyplot as plt
import sphviewer as sph

def myplot(x, y, nb=32, xsize=500, ysize=500):   
    xmin = np.min(x)
    xmax = np.max(x)
    ymin = np.min(y)
    ymax = np.max(y)

    x0 = (xmin+xmax)/2.
    y0 = (ymin+ymax)/2.

    pos = np.zeros([len(x),3])
    pos[:,0] = x
    pos[:,1] = y
    w = np.ones(len(x))

    P = sph.Particles(pos, w, nb=nb)
    S = sph.Scene(P)
    S.update_camera(r='infinity', x=x0, y=y0, z=0, 
                    xsize=xsize, ysize=ysize)
    R = sph.Render(S)
    R.set_logscale()
    img = R.get_image()
    extent = R.get_extent()
    for i, j in zip(xrange(4), [x0,x0,y0,y0]):
        extent[i] += j
    print extent
    return img, extent
    
fig = plt.figure(1, figsize=(10,10))
ax1 = fig.add_subplot(221)
ax2 = fig.add_subplot(222)
ax3 = fig.add_subplot(223)
ax4 = fig.add_subplot(224)


# Generate some test data
x = np.random.randn(1000)
y = np.random.randn(1000)

#Plotting a regular scatter plot
ax1.plot(x,y,'k.', markersize=5)
ax1.set_xlim(-3,3)
ax1.set_ylim(-3,3)

heatmap_16, extent_16 = myplot(x,y, nb=16)
heatmap_32, extent_32 = myplot(x,y, nb=32)
heatmap_64, extent_64 = myplot(x,y, nb=64)

ax2.imshow(heatmap_16, extent=extent_16, origin='lower', aspect='auto')
ax2.set_title("Smoothing over 16 neighbors")

ax3.imshow(heatmap_32, extent=extent_32, origin='lower', aspect='auto')
ax3.set_title("Smoothing over 32 neighbors")

#Make the heatmap using a smoothing over 64 neighbors
ax4.imshow(heatmap_64, extent=extent_64, origin='lower', aspect='auto')
ax4.set_title("Smoothing over 64 neighbors")

plt.show()

生成以下图像：

在此处输入图片描述

如您所见，这些图像看起来非常漂亮，我们能够识别出其中的不同子结构。这些图像的构建方式是将给定权重分散到特定域中的每个点，该域由平滑长度定义，而平滑长度又由与较近的nb邻居的距离给出（我选择了 16、32 和 64 作为示例）。因此，与密度较低的区域相比，密度较高的区域通常分散在较小的区域上。

函数 myplot 只是一个非常简单的函数，我编写它的目的是将 x、y 数据提供给 py-sphviewer 来发挥神奇的作用。

解决方案 5：

如果你使用的是 1.2.x

import numpy as np
import matplotlib.pyplot as plt

x = np.random.randn(100000)
y = np.random.randn(100000)
plt.hist2d(x,y,bins=100)
plt.show()

高斯二维热图

解决方案 6：

Seaborn 现在具有jointplot 函数，可以很好地发挥作用：

import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Generate some test data
x = np.random.randn(8873)
y = np.random.randn(8873)

sns.jointplot(x=x, y=y, kind='hex')
plt.show()

演示图像

解决方案 7：

这是Jurgy 的出色最近邻方法，但使用scipy.cKDTree实现。在我的测试中，它的速度大约快 100 倍。

在此处输入图片描述

import numpy as np
import matplotlib.pyplot as plt
import matplotlib.cm as cm
from scipy.spatial import cKDTree


def data_coord2view_coord(p, resolution, pmin, pmax):
    dp = pmax - pmin
    dv = (p - pmin) / dp * resolution
    return dv


n = 1000
xs = np.random.randn(n)
ys = np.random.randn(n)

resolution = 250

extent = [np.min(xs), np.max(xs), np.min(ys), np.max(ys)]
xv = data_coord2view_coord(xs, resolution, extent[0], extent[1])
yv = data_coord2view_coord(ys, resolution, extent[2], extent[3])


def kNN2DDens(xv, yv, resolution, neighbours, dim=2):
    """
    """
    # Create the tree
    tree = cKDTree(np.array([xv, yv]).T)
    # Find the closest nnmax-1 neighbors (first entry is the point itself)
    grid = np.mgrid[0:resolution, 0:resolution].T.reshape(resolution**2, dim)
    dists = tree.query(grid, neighbours)
    # Inverse of the sum of distances to each grid point.
    inv_sum_dists = 1. / dists[0].sum(1)

    # Reshape
    im = inv_sum_dists.reshape(resolution, resolution)
    return im


fig, axes = plt.subplots(2, 2, figsize=(15, 15))
for ax, neighbours in zip(axes.flatten(), [0, 16, 32, 63]):

    if neighbours == 0:
        ax.plot(xs, ys, 'k.', markersize=5)
        ax.set_aspect('equal')
        ax.set_title("Scatter Plot")
    else:

        im = kNN2DDens(xv, yv, resolution, neighbours)

        ax.imshow(im, origin='lower', extent=extent, cmap=cm.Blues)
        ax.set_title("Smoothing over %d neighbours" % neighbours)
        ax.set_xlim(extent[0], extent[1])
        ax.set_ylim(extent[2], extent[3])

plt.savefig('new.png', dpi=150, bbox_inches='tight')

解决方案 8：

最初的问题是......如何将散点值转换为网格值，对吗？
histogram2d确实计算每个细胞的频率，但是，如果每个细胞除了频率之外还有其他数据，那么您将需要做一些额外的工作。

x = data_x # between -10 and 4, log-gamma of an svc
y = data_y # between -4 and 11, log-C of an svc
z = data_z #between 0 and 0.78, f1-values from a difficult dataset

因此，我有一个包含 X 和 Y 坐标的 Z 结果的数据集。但是，我计算了感兴趣区域之外的几个点（较大的间隙），以及小感兴趣区域中的大量点。

是的，这里变得更难，但也更有趣。一些库（抱歉）：

from matplotlib import pyplot as plt
from matplotlib import cm
import numpy as np
from scipy.interpolate import griddata

pyplot 是我的今天的图形引擎，cm 是一系列具有一些有趣选择的颜色图。numpy 用于计算，griddata 用于将值附加到固定网格。

最后一个很重要，特别是因为 xy 点的频率在我的数据中分布不均。首先，让我们从适合我的数据的一些边界和任意网格大小开始。原始数据也有超出这些 x 和 y 边界的数据点。

#determine grid boundaries
gridsize = 500
x_min = -8
x_max = 2.5
y_min = -2
y_max = 7

因此，我们在 x 和 y 的最小值和最大值之间定义了一个 500 像素的网格。

在我的数据中，在高关注区域中可用的值远多于 500 个；而在低关注区域中，整个网格中的值甚至不到 200 个；在x_min和的图形边界之间x_max甚至更少。

因此，为了获得一张漂亮的图片，任务是获取高兴趣值的平均值并填补其他地方的空白。

我现在定义我的网格。对于每个 xx-yy 对，我希望有一个颜色。

xx = np.linspace(x_min, x_max, gridsize) # array of x values
yy = np.linspace(y_min, y_max, gridsize) # array of y values
grid = np.array(np.meshgrid(xx, yy.T))
grid = grid.reshape(2, grid.shape[1]*grid.shape[2]).T

为什么形状奇怪？scipy.griddata想要一个 (n, D) 的形状。

Griddata 通过预定义的方法计算网格中每个点的一个值。我选择“最近”——空网格点将用最近邻居的值填充。这看起来好像信息较少的区域有更大的单元格（即使事实并非如此）。可以选择插入“线性”，那么信息较少的区域看起来就不那么清晰了。这真的是个人喜好问题。

points = np.array([x, y]).T # because griddata wants it that way
z_grid2 = griddata(points, z, grid, method='nearest')
# you get a 1D vector as result. Reshape to picture format!
z_grid2 = z_grid2.reshape(xx.shape[0], yy.shape[0])

现在，我们交给 matplotlib 来显示图表

fig = plt.figure(1, figsize=(10, 10))
ax1 = fig.add_subplot(111)
ax1.imshow(z_grid2, extent=[x_min, x_max,y_min, y_max,  ],
            origin='lower', cmap=cm.magma)
ax1.set_title("SVC: empty spots filled by nearest neighbours")
ax1.set_xlabel('log gamma')
ax1.set_ylabel('log C')
plt.show()

在 V 形的尖端部分周围，您会看到，我在寻找最佳位置的过程中做了很多计算，而几乎所有其他不太有趣的部分的分辨率都较低。

高分辨率 SVC 热图

解决方案 9：

创建一个与最终图像中的单元格相对应的二维数组，称为 sayheatmap_cells并将其实例化为全零。

选择两个缩放因子，以实际单位定义每个数组元素之间的差异，对于每个维度，例如x_scale和y_scale。选择这些，以便所有数据点都落在热图数组的范围内。

x_value对于每个具有和的原始数据点y_value：

heatmap_cells[floor(x_value/x_scale),floor(y_value/y_scale)]+=1

解决方案 10：

与@Piti 的答案非常相似，但使用 1 次调用而不是 2 次来生成点：

import numpy as np
import matplotlib.pyplot as plt

pts = 1000000
mean = [0.0, 0.0]
cov = [[1.0,0.0],[0.0,1.0]]

x,y = np.random.multivariate_normal(mean, cov, pts).T
plt.hist2d(x, y, bins=50, cmap=plt.cm.jet)
plt.show()

输出：

2d_gaussian_热图

解决方案 11：

在此处输入图片描述

这是我在 100 万个点集上制作的，有 3 个类别（红色、绿色和蓝色）。如果您想尝试此功能，这里有一个存储库链接。Github Repo

histplot(
    X,
    Y,
    labels,
    bins=2000,
    range=((-3,3),(-3,3)),
    normalize_each_label=True,
    colors = [
        [1,0,0],
        [0,1,0],
        [0,0,1]],
    gain=50)

解决方案 12：

恐怕我来晚了，但不久前我也遇到过类似的问题。被接受的答案（来自@ptomato）帮助了我，但我也想发布这个，以防它对某人有用。


''' I wanted to create a heatmap resembling a football pitch which would show the different actions performed '''

import numpy as np
import matplotlib.pyplot as plt
import random

#fixing random state for reproducibility
np.random.seed(1234324)

fig = plt.figure(12)
ax1 = fig.add_subplot(121)
ax2 = fig.add_subplot(122)

#Ratio of the pitch with respect to UEFA standards 
hmap= np.full((6, 10), 0)
#print(hmap)

xlist = np.random.uniform(low=0.0, high=100.0, size=(20))
ylist = np.random.uniform(low=0.0, high =100.0, size =(20))

#UEFA Pitch Standards are 105m x 68m
xlist = (xlist/100)*10.5
ylist = (ylist/100)*6.5

ax1.scatter(xlist,ylist)

#int of the co-ordinates to populate the array
xlist_int = xlist.astype (int)
ylist_int = ylist.astype (int)

#print(xlist_int, ylist_int)

for i, j in zip(xlist_int, ylist_int):
    #this populates the array according to the x,y co-ordinate values it encounters 
    hmap[j][i]= hmap[j][i] + 1   

#Reversing the rows is necessary 
hmap = hmap[::-1]

#print(hmap)
im = ax2.imshow(hmap)

结果如下
在此处输入图片描述

解决方案 13：

这些解决方案都不适用于我的应用程序，所以我想出了这个。本质上我在每个点上都放置了一个 2D 高斯：

import cv2
import numpy as np
import matplotlib.pyplot as plt

def getGaussian2D(ksize, sigma, norm=True):
    oneD = cv2.getGaussianKernel(ksize=ksize, sigma=sigma)
    twoD = np.outer(oneD.T, oneD)
    return twoD / np.sum(twoD) if norm else twoD

def pt2heat(pts, shape, kernel=16, sigma=5):
    heat = np.zeros(shape)
    k = getGaussian2D(kernel, sigma)
    for y,x in pts:
        x, y = int(x), int(y)
        for i in range(-kernel//2, kernel//2):
            for j in range(-kernel//2, kernel//2):
                if 0 <= x+i < shape[0] and 0 <= y+j < shape[1]:
                    heat[x+i, y+j] = heat[x+i, y+j] + k[i+kernel//2, j+kernel//2]
    return heat


heat = pts2heat(pts, img.shape[:2])
plt.imshow(heat, cmap='heat')

以下是覆盖在相关图像上的点以及生成的热图：