Fuzzymatcher返回NaN以获得最佳匹配分数

2024-06-01 23:19:05 发布

您现在位置:Python中文网/ 问答频道 /正文

在从fuzzymatcher库执行fuzzy_left_join时,我观察到一些奇怪的行为。试图连接两个df,左一个有5217条记录,右一个有8734条记录,所有有best_match_score的记录是71条记录,这看起来很奇怪。为了获得更好的结果,我甚至删除了所有的数字,只留下字母字符来连接列。在合并表中,右表中的id列是NaN,这也是一个奇怪的结果

左表-联接“amazon\u s3\u名称”列。第一项-limonig

+------+---------+-------+-----------+------------------------------------+
|  id  | product | price | category  |           amazon_s3_name           |
+------+---------+-------+-----------+------------------------------------+
|    1 | A       |  1.49 | fruits    | limonig                            |
| 8964 | B       |  1.39 | beverages | studencajfuzelimonilimonetatrevaml |
| 9659 | C       |  2.79 | beverages | studencajfuzelimonilimtreval       |
+------+---------+-------+-----------+------------------------------------+

右表-联接“amazon_s3_name”的列-最后一项-limoni

+------+----------------------------------------------------------------------------------------------------------------------------+--------------------------------------------+
|  id  |                                                       picture                                                              |                    amazon_s3_name          |
+------+----------------------------------------------------------------------------------------------------------------------------+--------------------------------------------+
|  191 | https://s3.eu-central-1.amazonaws.com/groceries.pictures/images/AhmadCajLimonIDjindjifil20X2G.jpg                          | ahmadcajlimonidjindjifilxg                 |
|  192 | https://s3.eu-central-1.amazonaws.com/groceries.pictures/images/AhmadCajLimonIDjindjifil20X2G40g.jpg                       | ahmadcajlimonidjindjifilxgg                |
|  204 | https://s3.eu-central-1.amazonaws.com/groceries.pictures/images/Ahmadcajlimonidjindjifil20x2g40g00051265.jpg               | ahmadcajlimonidjindjifilxgg                |
| 1608 | https://s3.eu-central-1.amazonaws.com/groceries.pictures/images/Cajstudenfuzetealimonilimonovatreva15lpet.jpg              | cajstudenfuzetealimonilimonovatrevalpet    |
| 4689 | https://s3.eu-central-1.amazonaws.com/groceries.pictures/images/Lesieursalatensosslimonimaslinovomaslo.jpg                 | lesieursalatensosslimonimaslinovomaslo     |
| 4690 | https://s3.eu-central-1.amazonaws.com/groceries.pictures/images/Lesieursalatensosslimonimaslinovomaslo05l500ml01301150.jpg | lesieursalatensosslimonimaslinovomaslolml  |
| 4723 | https://s3.eu-central-1.amazonaws.com/groceries.pictures/images/Limoni.jpg                                                 | limoni                                     |
+------+----------------------------------------------------------------------------------------------------------------------------+--------------------------------------------+

合并表-正如我们在合并表中看到的best_match_scoreNaN

+----+------------------+-----------+------------+-------+----------+----------------------+------------+---------------------+-------------+----------------------+
| id | best_match_score | __id_left | __id_right | price | category | amazon_s3_name_left  | image_left | amazon_s3_name_left | image_right | amazon_s3_name_right |
+----+------------------+-----------+------------+-------+----------+----------------------+------------+---------------------+-------------+----------------------+
|  0 | NaN              | 0_left    | None       |  1.49 | Fruits   | Limoni500g09700112   | NaN        | limonig             | NaN         | NaN                  |
|  2 | NaN              | 2_left    | None       |  1.69 | Bio      | Morkovi1kgbr09700132 | NaN        | morkovikgbr         | NaN         | NaN                  |
+----+------------------+-----------+------------+-------+----------+----------------------+------------+---------------------+-------------+----------------------+

Tags: namehttpscomidamazons3nanleft
1条回答
网友
1楼 · 发布于 2024-06-01 23:19:05

你可以试试看。使用示例设置,例如使用TF-IDFBert,然后运行:

model = PolyFuzz(matchers).match(df1["amazon_s3_name"].tolist(), df2["amazon_s3_name"].to_list())
df1['To'] = model.get_matches()['To']

然后合并:

df1.merge(df2, left_on='To', right_on='amazon_s3_name')

相关问题 更多 >