如何使用for循环或条件在pandas数据框的子集中创建多元回归模型(statsmodel)?

2024-10-02 06:32:38 发布

您现在位置:Python中文网/ 问答频道 /正文

如何使用for循环或conditon在pandas数据框的子集中创建多元回归模型(statsmodel)?在

我有一个datframe,它有一个变量状态,有51个唯一值。我要为每个州做一个模型。由于某些原因,我仅限于回归(statsmodel) 假设变量V1由变量X1,X2,X3预测

状态为1:51,将用作拆分该数据帧的条件

如何使用for循环自动执行此任务?在


Tags: 数据模型pandasfor状态原因条件v1
2条回答

假设您只关心循环,而不是将数据帧分成51个子部分,下面是我对您的问题的尝试:

假设您将OLS函数定义为:

def OLSfunction(y):

    y_train = traindf[y]
    y_test = testdf[y]
    from statsmodels.api import OLS
    x_train = x_traindf
    x_test = x_testdf
    model = OLS(y_train, x_train)
    result = model.fit()
    print (result.summary())
    pred_OLS = result.predict(x_test)
    print("R2", r2_score(y_test, pred_OLS))



Y_s = ['1','2',.....'51']
for y in Y_s:
    y=y
    OLSfunction(y)

请注意,您必须为您要构建模型的特定Y适当地导出traindf和testdf。 这些必须正确地传递到OLSfunction。 因为我不知道你的数据是什么样子,所以我不打算拆分/创建traindf/testdf。。。在

import pandas as pd
import os as os
import numpy as np
import statsmodels.formula.api as sm

首先我创建了一个dict来保存51个数据集

^{pr2}$

检查

d[1].head()

然后我使用dict中的position在循环中运行代码

results={}
for x in range(1, 51):
        results[x] = sm.Logit(d[x].fraudRisk, d[x][names]).fit().summary2()

但是我觉得我应该在sklearn中使用多个分类器。首先,我需要像上面提到的那样分割数据。在

from sklearn.model_selection import train_test_split

# Multiple Models
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.neighbors import KNeighborsClassifier  
from sklearn.ensemble import RandomForestClassifier 
from sklearn.naive_bayes import GaussianNB

#Model Metrics
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score


lr={}
gnb={}
svc={}
rfc={}
classifier={}
regr_1={}
regr_2={}
import datetime 
datetime.datetime.now() 

for x in range(1, 51):
    X_train, X_test, y_train, y_test = train_test_split(d[x][names], d[x].fraudRisk, test_size=0.3)
    print(len(X_train))
    print(len(y_test))

    # Create classifiers

    lr[x] = LogisticRegression().fit(X_train, y_train).predict(X_test)
    gnb[x] = GaussianNB().fit(X_train, y_train).predict(X_test)
    svc[x] = LinearSVC(C=1.0).fit(X_train, y_train).predict(X_test)
    rfc[x] = RandomForestClassifier(n_estimators=1).fit(X_train, y_train).predict(X_test)
    classifier[x] = KNeighborsClassifier(n_neighbors=3).fit(X_train, y_train).predict(X_test)  

    print(datetime.datetime.now())
    print("Accuracy Score for model for  state ",x, 'is  ')

    print('LogisticRegression',accuracy_score(y_test,lr[x]))
    print('GaussianNB',accuracy_score(y_test,gnb[x]))
    print('LinearSVC',accuracy_score(y_test,svc[x]))
    print('RandomForestClassifier',accuracy_score(y_test,rfc[x]))
    print('KNeighborsClassifier',accuracy_score(y_test,classifier[x]))

    print("Classification Report for model for state ",x, 'is  ')

    print('LogisticRegression',classification_report(y_test,lr[x]))
    print('GaussianNB',classification_report(y_test,gnb[x]))
    print('LinearSVC',classification_report(y_test,svc[x]))
    print('RandomForestClassifier',classification_report(y_test,rfc[x]))
    print('KNeighborsClassifier',classification_report(y_test,classifier[x]))

    print("Confusion Matrix Report for model for state ",x, 'is  ')  

    print('LogisticRegression',confusion_matrix(y_test,lr[x]))
    print('GaussianNB',confusion_matrix(y_test,gnb[x]))
    print('LinearSVC',confusion_matrix(y_test,svc[x]))
    print('RandomForestClassifier',confusion_matrix(y_test,rfc[x]))
    print('KNeighborsClassifier',confusion_matrix(y_test,classifier[x]))

    print("Area Under Curve for model for state ",x, 'is  ') 

    print('LogisticRegression',roc_auc_score(y_test,lr[x]))
    print('GaussianNB',roc_auc_score(y_test,gnb[x]))
    print('LinearSVC',roc_auc_score(y_test,svc[x]))
    print('RandomForestClassifier',roc_auc_score(y_test,rfc[x]))
    print('KNeighborsClassifier',roc_auc_score(y_test,classifier[x]))

花了很长的时间为5个模型X 51州和多个指标,但值得。让我知道是否有一种更快或更好的方法来编写更优雅、更少黑客的代码

相关问题 更多 >

    热门问题