非UTF8列准备存在问题:在转换期间发现未知类别['FèsMekn¨s']

2024-09-30 20:39:09 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图为特征选择问题准备输入和输出数据,但在某些列上发现一个似乎不是unicode的问题:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-89-78f2cf157d88> in <module>
      1 # prepare input data
----> 2 X_train_enc, X_test_enc = prepare_inputs(X_train, X_test)
      3 # prepare output
      4 y_train_enc, y_test_enc = prepare_targets(y_train, y_test)

<ipython-input-86-e63e5d5fad63> in prepare_inputs(X_train, X_test)
      3     oe.fit(X_train)
      4     X_train_enc = oe.transform(X_train)
----> 5     X_test_enc = oe.transform(X_test)
      6     return X_train_enc, X_test_enc
      7 

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\preprocessing\_encoders.py in transform(self, X)
    812 
    813         """
--> 814         X_int, _ = self._transform(X)
    815         return X_int.astype(self.dtype, copy=False)
    816 

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\preprocessing\_encoders.py in _transform(self, X, handle_unknown)
    105                     msg = ("Found unknown categories {0} in column {1}"
    106                            " during transform".format(diff, i))
--> 107                     raise ValueError(msg)
    108                 else:
    109                     # Set the problematic rows to an acceptable value and

ValueError: Found unknown categories ['Fès-Meknès'] in column 4 during transform

以下是专栏的摘录:

    Do you agree    Gender  Age     City          Urban/Rural  Output
0   Yes             Female  25-34   Madrid        Urban        Will buy
1   No              Male    18-25   Fès-Meknès  Rural        Won't
2   ...             ...     ...     ...      ...               Undecided
....

F¨s-Mekn¨s应该是Fès-Meknès

下面是我用来获取数据的代码:

def load_dataset():
    connection = psycopg2.connect(user = "user",
                                  password = "passwd",
                                  host = "host",
                                  port = "5432",
                                  database = "database")

sql = "select * from capi limit 10;"
# load the table
df = pd.read_sql_query(sql, connection)
# retrieve numpy array
dataset = df.values

# split into input (X) and output (y) variables
cols = df.iloc[:,5:].columns.array
filtered_cols = ['TL_Segment']
cols = [col for col in cols if col not in filtered_cols]

X = df.loc[:, cols]  #independent columns
X = X.astype(str)
y = df['TL_Segment']    #target column i.e price range
return X.values, y.values

通过运行:print conn.encoding使用正确的编码

我试图在查询之前添加connection.set_client_encoding('UTF8'),但仍然存在相同的问题

不考虑编码错误的行

我尝试使用try-catch不考虑这些行:

def prepare_inputs(X_train, X_test):
    oe = OrdinalEncoder()
    oe.fit(X_train)
    try:
        X_train_enc = oe.transform(X_train)
        try: # imbricated in order not to return nothing in one of the two things returned
            X_test_enc = oe.transform(X_test)
        except ValueError as e:
            print(e)
    except ValueError as e:
        print(e)
    return X_train_enc, X_test_enc

但我仍然得到以下信息:

Found unknown categories ['Fès-Meknès'] in column 4 during transform

---------------------------------------------------------------------------
UnboundLocalError                         Traceback (most recent call last)
<ipython-input-126-78f2cf157d88> in <module>
      1 # prepare input data
----> 2 X_train_enc, X_test_enc = prepare_inputs(X_train, X_test)
      3 # prepare output
      4 y_train_enc, y_test_enc = prepare_targets(y_train, y_test)

<ipython-input-124-2376647ab46e> in prepare_inputs(X_train, X_test)
     10     except ValueError as e:
     11         print(e)
---> 12     return X_train_enc, X_test_enc
     13 

UnboundLocalError: local variable 'X_test_enc' referenced before assignment

Tags: intestdfinputreturnipythontransformtrain