Python 中的多元线性回归
- 2025-03-13 09:20:00
- admin 原创
- 18
问题描述:
我似乎找不到任何可以进行多元回归的 Python 库。我找到的唯一库只能进行简单回归。我需要将我的因变量 (y) 与几个自变量 (x1、x2、x3 等) 进行回归。
例如,有以下数据:
print 'y x1 x2 x3 x4 x5 x6 x7'
for t in texts:
print "{:>7.1f}{:>10.2f}{:>9.2f}{:>9.2f}{:>10.2f}{:>7.2f}{:>7.2f}{:>9.2f}" /
.format(t.y,t.x1,t.x2,t.x3,t.x4,t.x5,t.x6,t.x7)
(上面的输出:)
y x1 x2 x3 x4 x5 x6 x7
-6.0 -4.95 -5.87 -0.76 14.73 4.02 0.20 0.45
-5.0 -4.55 -4.52 -0.71 13.74 4.47 0.16 0.50
-10.0 -10.96 -11.64 -0.98 15.49 4.18 0.19 0.53
-5.0 -1.08 -3.36 0.75 24.72 4.96 0.16 0.60
-8.0 -6.52 -7.45 -0.86 16.59 4.29 0.10 0.48
-3.0 -0.81 -2.36 -0.50 22.44 4.81 0.15 0.53
-6.0 -7.01 -7.33 -0.33 13.93 4.32 0.21 0.50
-8.0 -4.46 -7.65 -0.94 11.40 4.43 0.16 0.49
-8.0 -11.54 -10.03 -1.03 18.18 4.28 0.21 0.55
我如何在 Python 中对这些进行回归,以获得线性回归公式:
Y = a1x1 + a2x2 + a3x3 + a4x4 + a5x5 + a6x6 + +a7x7 + c
解决方案 1:
sklearn.linear_model.LinearRegression
将会这样做:
from sklearn import linear_model
clf = linear_model.LinearRegression()
clf.fit([[getattr(t, 'x%d' % i) for i in range(1, 8)] for t in texts],
[t.y for t in texts])
然后clf.coef_
就会得到回归系数。
sklearn.linear_model
也有类似的接口来对回归进行各种正则化。
解决方案 2:
这是我创建的一个小解决方案。我用 R 检查了一下,它工作正常。
import numpy as np
import statsmodels.api as sm
y = [1,2,3,4,3,4,5,4,5,5,4,5,4,5,4,5,6,5,4,5,4,3,4]
x = [
[4,2,3,4,5,4,5,6,7,4,8,9,8,8,6,6,5,5,5,5,5,5,5],
[4,1,2,3,4,5,6,7,5,8,7,8,7,8,7,8,7,7,7,7,7,6,5],
[4,1,2,5,6,7,8,9,7,8,7,8,7,7,7,7,7,7,6,6,4,4,4]
]
def reg_m(y, x):
ones = np.ones(len(x[0]))
X = sm.add_constant(np.column_stack((x[0], ones)))
for ele in x[1:]:
X = sm.add_constant(np.column_stack((ele, X)))
results = sm.OLS(y, X).fit()
return results
结果:
print reg_m(y, x).summary()
输出:
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.535
Model: OLS Adj. R-squared: 0.461
Method: Least Squares F-statistic: 7.281
Date: Tue, 19 Feb 2013 Prob (F-statistic): 0.00191
Time: 21:51:28 Log-Likelihood: -26.025
No. Observations: 23 AIC: 60.05
Df Residuals: 19 BIC: 64.59
Df Model: 3
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1 0.2424 0.139 1.739 0.098 -0.049 0.534
x2 0.2360 0.149 1.587 0.129 -0.075 0.547
x3 -0.0618 0.145 -0.427 0.674 -0.365 0.241
const 1.5704 0.633 2.481 0.023 0.245 2.895
==============================================================================
Omnibus: 6.904 Durbin-Watson: 1.905
Prob(Omnibus): 0.032 Jarque-Bera (JB): 4.708
Skew: -0.849 Prob(JB): 0.0950
Kurtosis: 4.426 Cond. No. 38.6
pandas
提供了一种运行 OLS 的便捷方法,如以下答案所示:
使用 Pandas 数据框运行 OLS 回归
解决方案 3:
需要澄清的是,您给出的例子是多元线性回归,而不是多元线性回归。区别:
单个标量预测变量 x 和单个标量响应变量 y 的最简单情况称为简单线性回归。多个和/或矢量值预测变量(用大写 X 表示)的扩展称为多元线性回归,也称为多变量线性回归。几乎所有现实世界的回归模型都涉及多个预测变量,线性回归的基本描述通常以多元回归模型的形式表述。但请注意,在这些情况下,响应变量 y 仍然是一个标量。另一个术语多元线性回归指的是 y 是矢量的情况,即与一般线性回归相同。应该强调多元线性回归和多变量线性回归之间的差异,因为它在文献中引起了很多混淆和误解。
简而言之:
多元线性回归:响应 y 是一个标量。
多元线性回归:响应 y 是一个向量。
(另一个来源。)
解决方案 4:
您可以使用numpy.linalg.lstsq:
import numpy as np
y = np.array([-6, -5, -10, -5, -8, -3, -6, -8, -8])
X = np.array(
[
[-4.95, -4.55, -10.96, -1.08, -6.52, -0.81, -7.01, -4.46, -11.54],
[-5.87, -4.52, -11.64, -3.36, -7.45, -2.36, -7.33, -7.65, -10.03],
[-0.76, -0.71, -0.98, 0.75, -0.86, -0.50, -0.33, -0.94, -1.03],
[14.73, 13.74, 15.49, 24.72, 16.59, 22.44, 13.93, 11.40, 18.18],
[4.02, 4.47, 4.18, 4.96, 4.29, 4.81, 4.32, 4.43, 4.28],
[0.20, 0.16, 0.19, 0.16, 0.10, 0.15, 0.21, 0.16, 0.21],
[0.45, 0.50, 0.53, 0.60, 0.48, 0.53, 0.50, 0.49, 0.55],
]
)
X = X.T # transpose so input vectors are along the rows
X = np.c_[X, np.ones(X.shape[0])] # add bias term
beta_hat = np.linalg.lstsq(X, y, rcond=None)[0]
print(beta_hat)
结果:
[ -0.49104607 0.83271938 0.0860167 0.1326091 6.85681762 22.98163883 -41.08437805 -19.08085066]
您可以使用以下代码查看估计的输出:
print(np.dot(X,beta_hat))
结果:
[ -5.97751163, -5.06465759, -10.16873217, -4.96959788, -7.96356915, -3.06176313, -6.01818435, -7.90878145, -7.86720264]
解决方案 5:
使用scipy.optimize.curve_fit
。并且不仅适用于线性拟合。
from scipy.optimize import curve_fit
import scipy
def fn(x, a, b, c):
return a + b*x[0] + c*x[1]
# y(x0,x1) data:
# x0=0 1 2
# ___________
# x1=0 |0 1 2
# x1=1 |1 2 3
# x1=2 |2 3 4
x = scipy.array([[0,1,2,0,1,2,0,1,2,],[0,0,0,1,1,1,2,2,2]])
y = scipy.array([0,1,2,1,2,3,2,3,4])
popt, pcov = curve_fit(fn, x, y)
print popt
解决方案 6:
将数据转换为 Pandas 数据框 ( df
) 后,
import statsmodels.formula.api as smf
lm = smf.ols(formula='y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7', data=df).fit()
print(lm.params)
默认包含截距项。
请参阅此笔记本以获取更多示例。
解决方案 7:
我认为这可能是完成这项工作最简单的方法:
from random import random
from pandas import DataFrame
from statsmodels.api import OLS
lr = lambda : [random() for i in range(100)]
x = DataFrame({'x1': lr(), 'x2':lr(), 'x3':lr()})
x['b'] = 1
y = x.x1 + x.x2 * 2 + x.x3 * 3 + 4
print x.head()
x1 x2 x3 b
0 0.433681 0.946723 0.103422 1
1 0.400423 0.527179 0.131674 1
2 0.992441 0.900678 0.360140 1
3 0.413757 0.099319 0.825181 1
4 0.796491 0.862593 0.193554 1
print y.head()
0 6.637392
1 5.849802
2 7.874218
3 7.087938
4 7.102337
dtype: float64
model = OLS(y, x)
result = model.fit()
print result.summary()
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 1.000
Model: OLS Adj. R-squared: 1.000
Method: Least Squares F-statistic: 5.859e+30
Date: Wed, 09 Dec 2015 Prob (F-statistic): 0.00
Time: 15:17:32 Log-Likelihood: 3224.9
No. Observations: 100 AIC: -6442.
Df Residuals: 96 BIC: -6431.
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1 1.0000 8.98e-16 1.11e+15 0.000 1.000 1.000
x2 2.0000 8.28e-16 2.41e+15 0.000 2.000 2.000
x3 3.0000 8.34e-16 3.6e+15 0.000 3.000 3.000
b 4.0000 8.51e-16 4.7e+15 0.000 4.000 4.000
==============================================================================
Omnibus: 7.675 Durbin-Watson: 1.614
Prob(Omnibus): 0.022 Jarque-Bera (JB): 3.118
Skew: 0.045 Prob(JB): 0.210
Kurtosis: 2.140 Cond. No. 6.89
==============================================================================
解决方案 8:
可以使用上面提到的 sklearn 库来处理多元线性回归。我使用的是 Python 3.6 的 Anaconda 安装。
按如下方式创建模型:
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X, y)
# display coefficients
print(regressor.coef_)
解决方案 9:
您可以使用numpy.linalg.lstsq
解决方案 10:
您可以使用下面的函数并向其传递一个 DataFrame:
def linear(x, y=None, show=True):
"""
@param x: pd.DataFrame
@param y: pd.DataFrame or pd.Series or None
if None, then use last column of x as y
@param show: if show regression summary
"""
import statsmodels.api as sm
xy = sm.add_constant(x if y is None else pd.concat([x, y], axis=1))
res = sm.OLS(xy.ix[:, -1], xy.ix[:, :-1], missing='drop').fit()
if show: print res.summary()
return res
解决方案 11:
Scikit-learn 是一个 Python 机器学习库,可以帮你完成这项工作。只需将 sklearn.linear_model 模块导入到你的脚本中即可。
在 Python 中使用 sklearn 查找多元线性回归的代码模板:
import numpy as np
import matplotlib.pyplot as plt #to plot visualizations
import pandas as pd
# Importing the dataset
df = pd.read_csv(<Your-dataset-path>)
# Assigning feature and target variables
X = df.iloc[:,:-1]
y = df.iloc[:,-1]
# Use label encoders, if you have any categorical variable
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
X['<column-name>'] = labelencoder.fit_transform(X['<column-name>'])
from sklearn.preprocessing import OneHotEncoder
onehotencoder = OneHotEncoder(categorical_features = ['<index-value>'])
X = onehotencoder.fit_transform(X).toarray()
# Avoiding the dummy variable trap
X = X[:,1:] # Usually done by the algorithm itself
#Spliting the data into test and train set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state = 0, test_size = 0.2)
# Fitting the model
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
# Predicting the test set results
y_pred = regressor.predict(X_test)
就是这样。你可以将此代码用作在任何数据集中实现多元线性回归的模板。为了更好地理解示例,请访问:线性回归示例
解决方案 12:
以下是一种替代的基本方法:
from patsy import dmatrices
import statsmodels.api as sm
y,x = dmatrices("y_data ~ x_1 + x_2 ", data = my_data)
### y_data is the name of the dependent variable in your data ###
model_fit = sm.OLS(y,x)
results = model_fit.fit()
print(results.summary())
您还sm.OLS
可以使用sm.Logit
或sm.Probit
和等。
解决方案 13:
可以使用OpenTURNS来处理此类线性模型。
在 OpenTURNS 中,这是通过LinearModelAlgorithm
从数值样本创建线性模型的类来实现的。更具体地说,它构建了以下线性模型:
Y = a0 + a1.X1 + ... + an.Xn + epsilon,
其中误差 epsilon 是具有零均值和单位方差的高斯分布。假设您的数据在 csv 文件中,下面是获取回归系数 ai 的简单脚本:
from __future__ import print_function
import pandas as pd
import openturns as ot
# Assuming the data is a csv file with the given structure
# Y X1 X2 .. X7
df = pd.read_csv("./data.csv", sep="s+")
# Build a sample from the pandas dataframe
sample = ot.Sample(df.values)
# The observation points are in the first column (dimension 1)
Y = sample[:, 0]
# The input vector (X1,..,X7) of dimension 7
X = sample[:, 1::]
# Build a Linear model approximation
result = ot.LinearModelAlgorithm(X, Y).getResult()
# Get the coefficients ai
print("coefficients of the linear regression model = ", result.getCoefficients())
然后您可以通过以下调用轻松获得置信区间:
# Get the confidence intervals at 90% of the ai coefficients
print(
"confidence intervals of the coefficients = ",
ot.LinearModelAnalysis(result).getCoefficientsConfidenceInterval(0.9),
)
您可以在 OpenTURNS 示例中找到更详细的示例。
解决方案 14:
尝试使用高斯族的广义线性模型
y = np.array([-6, -5, -10, -5, -8, -3, -6, -8, -8])
X = np.array([
[-4.95, -4.55, -10.96, -1.08, -6.52, -0.81, -7.01, -4.46, -11.54],
[-5.87, -4.52, -11.64, -3.36, -7.45, -2.36, -7.33, -7.65, -10.03],
[-0.76, -0.71, -0.98, 0.75, -0.86, -0.50, -0.33, -0.94, -1.03],
[14.73, 13.74, 15.49, 24.72, 16.59, 22.44, 13.93, 11.40, 18.18],
[4.02, 4.47, 4.18, 4.96, 4.29, 4.81, 4.32, 4.43, 4.28],
[0.20, 0.16, 0.19, 0.16, 0.10, 0.15, 0.21, 0.16, 0.21],
[0.45, 0.50, 0.53, 0.60, 0.48, 0.53, 0.50, 0.49, 0.55],
])
X=zip(*reversed(X))
df=pd.DataFrame({'X':X,'y':y})
columns=7
for i in range(0,columns):
df['X'+str(i)]=df.apply(lambda row: row['X'][i],axis=1)
df=df.drop('X',axis=1)
print(df)
#model_formula='y ~ X0+X1+X2+X3+X4+X5+X6'
model_formula='y ~ X0'
model_family = sm.families.Gaussian()
model_fit = glm(formula = model_formula,
data = df,
family = model_family).fit()
print(model_fit.summary())
# Extract coefficients from the fitted model wells_fit
#print(model_fit.params)
intercept, slope = model_fit.params
# Print coefficients
print('Intercept =', intercept)
print('Slope =', slope)
# Extract and print confidence intervals
print(model_fit.conf_int())
df2=pd.DataFrame()
df2['X0']=np.linspace(0.50,0.70,50)
df3=pd.DataFrame()
df3['X1']=np.linspace(0.20,0.60,50)
prediction0=model_fit.predict(df2)
#prediction1=model_fit.predict(df3)
plt.plot(df2['X0'],prediction0,label='X0')
plt.ylabel("y")
plt.xlabel("X0")
plt.show()
解决方案 15:
线性回归是人工智能入门的一个很好的例子
以下是使用 Python 的多元线性回归机器学习算法的一个很好的例子:
##### Predicting House Prices Using Multiple Linear Regression - @Y_T_Akademi
#### In this project we are gonna see how machine learning algorithms help us predict house prices. Linear Regression is a model of predicting new future data by using the existing correlation between the old data. Here, machine learning helps us identify this relationship between feature data and output, so we can predict future values.
import pandas as pd
##### we use sklearn library in many machine learning calculations..
from sklearn import linear_model
##### we import out dataset: housepricesdataset.csv
df = pd.read_csv("housepricesdataset.csv",sep = ";")
##### The following is our feature set:
##### The following is the output(result) data:
##### we define a linear regression model here:
reg = linear_model.LinearRegression()
reg.fit(df[['area', 'roomcount', 'buildingage']], df['price'])
# Since our model is ready, we can make predictions now:
# lets predict a house with 230 square meters, 4 rooms and 10 years old building..
reg.predict([[230,4,10]])
# Now lets predict a house with 230 square meters, 6 rooms and 0 years old building - its new building..
reg.predict([[230,6,0]])
# Now lets predict a house with 355 square meters, 3 rooms and 20 years old building
reg.predict([[355,3,20]])
# You can make as many prediction as you want..
reg.predict([[230,4,10], [230,6,0], [355,3,20], [275, 5, 17]])
我的数据集如下: