如何处理包含字符串值的数据框列列表，获取唯一的单词

def dimension_unique_words(dimensions): if dimensions != 'None': for value in dimensions: new_value = re.sub(r'[^\w\s]|ft|feet', ' ', value) new_value = ''.join([i for i in new_value if not i.isdigit()]) return new_value df['new_col'] = df['dimensions'].apply(dimension_unique_words)

3条回答

网友

1楼 · 编辑于 2024-10-01 19:27:14

首先，我们要处理的麻烦是，您的“维度”列有时是无的，有时是一个字符串元素的列表。因此，当元素为非null时，提取该元素：

df['dimensions2'] = df['dimensions'].apply(lambda col: col[0] if col else None)

接下来，获取每行中的所有字母字符串，不包括测量值：

>>> df['dimensions2'].str.findall(r'\b([a-z]+)')
0                 [long]
1                   None
2    [long, wide, thick]
3     [high, long, wide]

注意，我们使用\b单词边界（从“30ft”中排除“ft”），为了避免将\b误解为反斜杠，我们必须在正则表达式上使用r''rawstring

网友

2楼 · 编辑于 2024-10-01 19:27:14

首先，我们要处理的麻烦是，您的“维度”列有时是无的，有时是一个字符串元素的列表。因此，当元素为非null时，提取该元素：

df['dimensions2'] = df['dimensions'].apply(lambda col: col[0] if col else None)

接下来，获取每行中的所有字母字符串，不包括测量值：

>>> df['dimensions2'].str.findall(r'\b([a-zA-Z]+)')
0                 [long]
1                   None
2    [long, wide, thick]
3     [high, long, wide]

注意，我们使用\b单词边界（从“30ft”中排除“ft”），为了避免将\b误解为反斜杠，我们必须在正则表达式上使用r''rawstring

这会给你一个列表。您需要一个集合，以防止重复发生，因此：

 df['dimensions2'].str.findall(r'\b([a-zA-Z]+)').apply(lambda l: set(l) if l else None)
0                 {long}
1                   None
2    {thick, long, wide}
3     {high, long, wide}

网友

3楼 · 编辑于 2024-10-01 19:27:14

使用str.findall查找列表中的所有维度值
使用explode将列表分解为具有相同索引的元素
然后使用groupby(level=0).unique()将重复项按索引放到列表中

df['new_col'] = (
  df['dimensions'].fillna('').astype(str)
 .str.findall(r'\b[a-zA-Z]+\b')
 .explode().dropna()
 .groupby(level=0).unique()
)

使用df['new_col'].explode().dropna().unique()获取唯一的维度值

array(['long', 'wide', 'thick', 'high'], dtype=object)

相关问题更多 >

编程相关推荐

热门问题

热门文章