如何为多个数据帧列建立管道？

import pandas as pd df = pd.DataFrame([{ 'title': 'batman', 'text': 'man bat man bat', 'url': 'batman.com', 'label':1}, {'title': 'spiderman', 'text': 'spiderman man spider', 'url': 'spiderman.com', 'label':1}, {'title': 'doctor evil', 'text': 'a super evil doctor', 'url': 'evilempyre.com', 'label':0},])

3条回答

网友

1楼 · 编辑于 2024-06-15 01:53:48

我将使用FunctionTransformer的组合来只选择某些列，然后使用FeatureUnion在每个列上组合TFIDF、字计数等特性。也许有一个稍微干净一点的方法，但是我认为不管怎样，你最终都会得到某种FeatureUnion和管道嵌套。在

from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

def first_column(X):
    return X.iloc[:, 0]

def second_column(X):
    return X.iloc[:, 1]

# pipeline to get all tfidf and word count for first column
pipeline_one = Pipeline([
    ('column_selection', FunctionTransformer(first_column, validate=False)),
    ('feature-extractors', FeatureUnion([('tfidf', TfidfVectorizer()),
                                        ('counts', CountVectorizer())

    ]))
])

# Then a second pipeline to do the same for the second column
pipeline_two = Pipeline([
    ('column_selection', FunctionTransformer(second_column, validate=False)),
    ('feature-extractors', FeatureUnion([('tfidf', TfidfVectorizer()),
                                        ('counts', CountVectorizer())

    ]))
])


# Then you would again feature union these pipelines 
# to get different feature selection for each column
final_transformer = FeatureUnion([('first-column-features', pipeline_one),
                                  ('second-column-feature', pipeline_two)])

# Your dataframe has your target as the first column, so make sure to drop first
y = df['label']
df = df.drop('label', axis=1)

# Now fit transform should work
final_transformer.fit_transform(df)

如果您不想对每个列应用多个transformer（tfidf和counts这两种方法可能都没有用处），那么您可以减少一步嵌套。在

网友

2楼 · 编辑于 2024-06-15 01:53:48

@elphz-answer是一个很好的介绍如何使用^{}和{a2}来完成这一任务，但我认为它可以使用更多的细节。在

首先，我想说您需要定义您的FunctionTransformer函数，以便它们能够正确地处理和返回您的输入数据。在本例中，我假设您只想传递数据帧，但要确保返回一个形状正确的数组以供下游使用。因此，我建议只传递DataFrame并按列名访问。是这样的：

def text(X):
    return X.text.values

def title(X):
    return X.title.values

pipe_text = Pipeline([('col_text', FunctionTransformer(text, validate=False))])

pipe_title = Pipeline([('col_title', FunctionTransformer(title, validate=False))])

现在，测试变压器和分类器的变化。我建议使用一个transformer列表和一个分类器列表，并简单地遍历它们，就像一个gridsearch一样。在

^{pr2}$

这是一个简单的示例，但是您可以看到如何以这种方式插入任何种类的转换和分类器。在

网友

3楼 · 编辑于 2024-06-15 01:53:48

请看以下链接： http://scikit-learn.org/0.18/auto_examples/hetero_feature_union.html

class ItemSelector(BaseEstimator, TransformerMixin):
def __init__(self, key):
    self.key = key

def fit(self, x, y=None):
    return self

def transform(self, data_dict):
    return data_dict[self.key]

键值接受panda dataframe列标签。在管道中使用时，可以将其应用为：

^{pr2}$

相关问题更多 >

编程相关推荐

热门问题

热门文章