<p>要进行预测,您需要通过所有预处理步骤传递数据,以训练模型:</p>
<pre><code>single_address = '1100 112th Ave NE #400, Bellevue, WA 98004, United States'
normalized_address = normalize_text(single_address)
vectorized_address = vectorizer.transform([normalized_address])
#expected output
nb.predict(vectorized_address)
</code></pre>
<p>注意:改进代码的两种方法:</p>
<ol>
<li><p><code>normalize_text</code>步骤实际上并不必要,因为它所做的一切都将被CountVectorizer的标记器regex<code>token_pattern='(?u)\\b\\w\\w+\\b'</code>和<code>lowercase=True</code>捕获</p>
</li>
<li><p>将所有预处理保持在sklearn<code>Pipeline</code>中。这样,您的代码将更干净,更不容易出错(而且您肯定会避免像以前那样的错误)</p>
</li>
</ol>
<p>工作[canonical?]模板如何实现:</p>
<pre><code>from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.pipeline import Pipeline
X = 30*['1100 112th Ave NE #400, Bellevue, WA 98004, United States']
y = 10*['US','France','Germany']
le = LabelEncoder()
y = le.fit_transform(y)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
vectorizer = CountVectorizer()
mnb = MultinomialNB()
ppl = Pipeline(steps=[('vectorizer',vectorizer),('mnb',mnb)])
ppl.fit(X_train, y_train)
single_address = '1100 112th Ave NE #400, Bellevue, WA 98004, United States'
ppl.predict([single_address])
</code></pre>
<p>拥有<code>Pipeline</code>的额外好处是,您可以通过<code>GridSearchCV</code>传递它,以便通过交叉验证选择最佳参数</p>