<pre><code>import pandas as pd
import os as os
import numpy as np
import statsmodels.formula.api as sm
</code></pre>
<p>首先我创建了一个dict来保存51个数据集</p>
^{pr2}$
<p>检查</p>
<pre><code>d[1].head()
</code></pre>
<p>然后我使用dict中的position在循环中运行代码</p>
<pre><code>results={}
for x in range(1, 51):
results[x] = sm.Logit(d[x].fraudRisk, d[x][names]).fit().summary2()
</code></pre>
<p>但是我觉得我应该在sklearn中使用多个分类器。首先,我需要像上面提到的那样分割数据。在</p>
<pre><code>from sklearn.model_selection import train_test_split
# Multiple Models
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
#Model Metrics
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score
lr={}
gnb={}
svc={}
rfc={}
classifier={}
regr_1={}
regr_2={}
import datetime
datetime.datetime.now()
for x in range(1, 51):
X_train, X_test, y_train, y_test = train_test_split(d[x][names], d[x].fraudRisk, test_size=0.3)
print(len(X_train))
print(len(y_test))
# Create classifiers
lr[x] = LogisticRegression().fit(X_train, y_train).predict(X_test)
gnb[x] = GaussianNB().fit(X_train, y_train).predict(X_test)
svc[x] = LinearSVC(C=1.0).fit(X_train, y_train).predict(X_test)
rfc[x] = RandomForestClassifier(n_estimators=1).fit(X_train, y_train).predict(X_test)
classifier[x] = KNeighborsClassifier(n_neighbors=3).fit(X_train, y_train).predict(X_test)
print(datetime.datetime.now())
print("Accuracy Score for model for state ",x, 'is ')
print('LogisticRegression',accuracy_score(y_test,lr[x]))
print('GaussianNB',accuracy_score(y_test,gnb[x]))
print('LinearSVC',accuracy_score(y_test,svc[x]))
print('RandomForestClassifier',accuracy_score(y_test,rfc[x]))
print('KNeighborsClassifier',accuracy_score(y_test,classifier[x]))
print("Classification Report for model for state ",x, 'is ')
print('LogisticRegression',classification_report(y_test,lr[x]))
print('GaussianNB',classification_report(y_test,gnb[x]))
print('LinearSVC',classification_report(y_test,svc[x]))
print('RandomForestClassifier',classification_report(y_test,rfc[x]))
print('KNeighborsClassifier',classification_report(y_test,classifier[x]))
print("Confusion Matrix Report for model for state ",x, 'is ')
print('LogisticRegression',confusion_matrix(y_test,lr[x]))
print('GaussianNB',confusion_matrix(y_test,gnb[x]))
print('LinearSVC',confusion_matrix(y_test,svc[x]))
print('RandomForestClassifier',confusion_matrix(y_test,rfc[x]))
print('KNeighborsClassifier',confusion_matrix(y_test,classifier[x]))
print("Area Under Curve for model for state ",x, 'is ')
print('LogisticRegression',roc_auc_score(y_test,lr[x]))
print('GaussianNB',roc_auc_score(y_test,gnb[x]))
print('LinearSVC',roc_auc_score(y_test,svc[x]))
print('RandomForestClassifier',roc_auc_score(y_test,rfc[x]))
print('KNeighborsClassifier',roc_auc_score(y_test,classifier[x]))
</code></pre>
<p>花了很长的时间为5个模型X 51州和多个指标,但值得。让我知道是否有一种更快或更好的方法来编写更优雅、更少黑客的代码</p>