如何在pandas dataframe中从文本字段提取数据?

2024-10-03 11:13:48 发布

您现在位置:Python中文网/ 问答频道 /正文

我想从这个数据帧获取标签的分布:

df=pd.DataFrame([
    [43,{"tags":["webcom","start","temp","webcomfoto","dance"],"image":["https://image.com/Kqk.jpg"]}],
    [83,{"tags":["yourself","start",""],"image":["https://images.com/test.jpg"]}],
    [76,{"tags":["en","webcom"],"links":["http://webcom.webcomdb.com","http://webcom.webcomstats.com"],"users":["otole"]}],
    [77,{"tags":["webcomznakomstvo","webcomzhiznx","webcomistoriya","webcomosebe","webcomfotografiya"],"image":["https://images.com/nt4wzguoh/y_a3d735b4.jpg","https://images.com/sucb0u24x/b1sd_Naju.jpg"]}],
    [81,{"tags":["webcomfotografiya"],"users":["myself","boattva"],"links":["https://webcom.com/nk"]}],
],columns=["_id","tags"])

我需要得到一个表,其中的'id'和特定数量的标签。 例如

^{pr2}$

当“tags”是唯一的字段时,我使用了this approach。在这个数据框中,我还有“image”、“users”和其他带值的文本字段。在这种情况下,我应该如何处理数据?在

谢谢你


Tags: 数据httpsimagecomidhttptags标签
3条回答

您可以使用str访问器来获取字典键,并使用value_counts获取{}:

df.tags.str['tags'].str.len().value_counts()\
  .rename('Posts')\
  .rename_axis('Tags')\
  .reset_index()

输出:

^{pr2}$

坚持collections.Counter,有一种方法:

from collections import Counter
from operator import itemgetter

c = Counter(map(len, map(itemgetter('tags'), df['tags'])))

res = pd.DataFrame.from_dict(c, orient='index').reset_index()
res.columns = ['Tags', 'Posts']

print(res)

   Tags  Posts
0     5      2
1     3      1
2     2      1
3     1      1

tags中的数据是strings,不是dictionaries,有问题。在

所以需要第一步:

import ast

df['tags'] = df['tags'].apply(ast.literal_eval)

然后应用原始答案,如果有多个字段,效果非常好。在

正在验证:

^{pr2}$
#convert column to string for verify solution
df['tags'] = df['tags'].astype(str)

print (df['tags'].apply(type))
0    <class 'str'>
1    <class 'str'>
2    <class 'str'>
3    <class 'str'>
4    <class 'str'>
Name: tags, dtype: object

#convert back
df['tags'] = df['tags'].apply(ast.literal_eval)

print (df['tags'].apply(type))
0    <class 'dict'>
1    <class 'dict'>
2    <class 'dict'>
3    <class 'dict'>
4    <class 'dict'>
Name: tags, dtype: object

c = Counter([len(x['tags']) for x in df['tags']])

df = pd.DataFrame({'Number of posts':list(c.values()), ' Number of tags ': list(c.keys())})
print (df)
   Number of posts   Number of tags 
0                1                 0
1                1                 3
2                1                 2
3                1                 5
4                1                 1

相关问题 更多 >