<p>我将使用FunctionTransformer的组合来只选择某些列,然后使用FeatureUnion在每个列上组合TFIDF、字计数等特性。也许有一个稍微干净一点的方法,但是我认为不管怎样,你最终都会得到某种FeatureUnion和管道嵌套。在</p>
<pre><code>from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
def first_column(X):
return X.iloc[:, 0]
def second_column(X):
return X.iloc[:, 1]
# pipeline to get all tfidf and word count for first column
pipeline_one = Pipeline([
('column_selection', FunctionTransformer(first_column, validate=False)),
('feature-extractors', FeatureUnion([('tfidf', TfidfVectorizer()),
('counts', CountVectorizer())
]))
])
# Then a second pipeline to do the same for the second column
pipeline_two = Pipeline([
('column_selection', FunctionTransformer(second_column, validate=False)),
('feature-extractors', FeatureUnion([('tfidf', TfidfVectorizer()),
('counts', CountVectorizer())
]))
])
# Then you would again feature union these pipelines
# to get different feature selection for each column
final_transformer = FeatureUnion([('first-column-features', pipeline_one),
('second-column-feature', pipeline_two)])
# Your dataframe has your target as the first column, so make sure to drop first
y = df['label']
df = df.drop('label', axis=1)
# Now fit transform should work
final_transformer.fit_transform(df)
</code></pre>
<p>如果您不想对每个列应用多个transformer(tfidf和counts这两种方法可能都没有用处),那么您可以减少一步嵌套。在</p>