用于将类似列值组合成更大超组的方法

1条回答

网友

1楼 · 发布于 2024-09-29 21:40:58

我们需要这些图书馆：

import pandas as pd
from fuzzywuzzy import fuzz
from itertools import combinations
import networkx as nx

假设Diagnosis是您的列系列：

Diagnosis = pd.Series(["headache","headache","headche","UTI",
"cough","cough","cough","UTIs","UTI","coughs","UTI"])

让我们进行一些字符串匹配：

Diagnosis_unique = Diagnosis.unique()
matches = pd.DataFrame(combinations(Diagnosis_unique,2))
matches['score'] = matches.apply(lambda x: fuzz.WRatio(x[0],x[1]), axis=1)

以下是matches数据帧：

|    | 0        | 1       |   score |
| -:|:    -|:    |    :|
|  0 | headache | headche |      93 |
|  1 | headache | UTI     |       0 |
|  2 | headache | cough   |      45 |
|  3 | headache | UTIs    |       0 |
|  4 | headache | coughs  |      14 |
|  5 | headche  | UTI     |       0 |
|  6 | headche  | cough   |      17 |
|  7 | headche  | UTIs    |       0 |
|  8 | headche  | coughs  |      15 |
|  9 | UTI      | cough   |      30 |
| 10 | UTI      | UTIs    |      86 |
| 11 | UTI      | coughs  |      30 |
| 12 | cough    | UTIs    |      22 |
| 13 | cough    | coughs  |      91 |
| 14 | UTIs     | coughs  |      45 |

现在，让我们删除不匹配的行。我用了80分。您可以使用您的首选分数：

matches = matches[matches['score']>=80]

现在我们有了匹配项，我们需要连接相似的名称。在您的示例中，每件作品中只有一种类型的打字错误。然而，可能还有更多。所以我们需要向图论寻求帮助：

G = nx.from_pandas_edgelist(matches,0,1)

connected_names=pd.DataFrame()
for cluster in nx.connected_components(G):
    if len(cluster) != 1:
        connected_names = connected_names.append([list(cluster)])

现在我们有了一个包含相似节点集群的图。我们需要将其转换为字典以替换原始数据：

connected_names = connected_names\
    .reset_index(drop=True)\
        .melt(id_vars=0)\
            .drop('variable', axis=1)\
                .dropna()\
                    .reset_index(drop=True)\
                        .set_index('value')

names_dict = connected_names.to_dict()[0]

以下是names_dict：

{'headache': 'headche', 'UTIs': 'UTI', 'cough': 'coughs'}

这种方法的缺点是您无法确定正确的拼写值。但是，您可以稍后手动修复此问题

现在让我们替换原来的系列：

Diagnosis = Diagnosis.replace(names_dict)

瞧

|    | 0       |
| -:|:    |
|  0 | headche |
|  1 | headche |
|  2 | headche |
|  3 | UTI     |
|  4 | coughs  |
|  5 | coughs  |
|  6 | coughs  |
|  7 | UTI     |
|  8 | UTI     |
|  9 | coughs  |
| 10 | UTI     |

最后，您可以做的是构建自己的正确字典来纠正统一值：

manual_correction = {"headche":"headache"}
Diagnosis = Diagnosis.replace(manual_correction)

结果:

|    | 0        |
| -:|:    -|
|  0 | headache |
|  1 | headache |
|  2 | headache |
|  3 | UTI      |
|  4 | coughs   |
|  5 | coughs   |
|  6 | coughs   |
|  7 | UTI      |
|  8 | UTI      |
|  9 | coughs   |
| 10 | UTI      |

相关问题更多 >

编程相关推荐

热门问题

热门文章

用于将类似列值组合成更大超组的方法

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >