有没有办法从短语列表中创建列?

2024-09-28 03:24:40 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一些短语的列表,我想将它们转换成数据框中的列,作为机器学习模型的输入。代码应该在所有数据行中找到唯一的短语,为唯一的行创建列,并通过显示一个1来指示该短语是否存在于行中,如果该短语存在,则显示一个0来指示该短语是否存在于行中

这些短语如下所示:

{"TV", "Internet", "Wireless Internet", "Kitchen", "Free Parking on Premises",
 "Buzzer/Wireless Intercom", "Heating", "Family/Kid Friendly",
 "Washer,Dryer", "Smoke Detector", "Carbon Monoxide Detector",
 "First Aid Kit", "Safety Card", "Fire Extinguisher", "Essentials"
 }

{"TV", "Internet", "Wireless Internet", "Air Conditioning", "Kitchen",
 "Pets Allowed", "Pets live on this property", "Dog(s)", "Heating",
 "Family/Kid Friendly", "Washer", "Dryer", "Smoke Detector",
 "Carbon Monoxide Detector", "Fire Extinguisher", "Essentials",
 "Shampoo", "Lock on Bedroom Door", "Hangers", "Hair Dryer", "Iron"
 }

数据帧中所需的输出:

enter image description here


Tags: 数据ontvdetectorfamilyinternetsmokekid
3条回答

你可以这样做

import pandas as pd

rows = [["TV", "Internet", "Wireless Internet", "Kitchen", "Free Parking on Premises",
         "Buzzer/Wireless Intercom", "Heating", "Family/Kid Friendly",
         "Washer,Dryer", "Smoke Detector", "Carbon Monoxide Detector",
         "First Aid Kit", "Safety Card", "Fire Extinguisher", "Essentials"
         ],

        ["TV", "Internet", "Wireless Internet", "Air Conditioning", "Kitchen",
         "Pets Allowed", "Pets live on this property", "Dog(s)", "Heating",
         "Family/Kid Friendly", "Washer", "Dryer", "Smoke Detector",
         "Carbon Monoxide Detector", "Fire Extinguisher", "Essentials",
         "Shampoo", "Lock on Bedroom Door", "Hangers", "Hair Dryer", "Iron"
         ]
        ]

header = list(set(rows[0]+rows[1]))
words_count = {}
for i in header:
    words_count[i] = []
for row in rows:
    for i in header:
        words_count[i].append(row.count(i))

df = pd.DataFrame(data=words_count, columns=header)

print(df)

# Output
   Safety Card  Hair Dryer  ...  Carbon Monoxide Detector  Shampoo
0            1           0  ...                         1        0
1            0           1  ...                         1        1

[2 rows x 26 columns]

首先为DataFrame创建列:


set1 = {"TV", "Internet", "Wireless Internet", "Kitchen", "Free Parking on Premises",
 "Buzzer/Wireless Intercom", "Heating", "Family/Kid Friendly",
 "Washer,Dryer", "Smoke Detector", "Carbon Monoxide Detector",
 "First Aid Kit", "Safety Card", "Fire Extinguisher", "Essentials"
 }

set2 = {"TV", "Internet", "Wireless Internet", "Air Conditioning", "Kitchen",
 "Pets Allowed", "Pets live on this property", "Dog(s)", "Heating",
 "Family/Kid Friendly", "Washer", "Dryer", "Smoke Detector",
 "Carbon Monoxide Detector", "Fire Extinguisher", "Essentials",
 "Shampoo", "Lock on Bedroom Door", "Hangers", "Hair Dryer", "Iron"
 }

# Create a list of iterables for later
list_of_sets = [set1, set2]

# Create a list with the "splat" operator, and then create a set from the list
columns = set([*set1, *set2])

# Optionally remove spaces, commas, etc
columns_optional = set([x.replace(" ", "").replace(",", "").replace("/", "") for x in columns])

现在创建数据帧行:


def create_rows(list_of_iterables, columns):
    """Iterate through list of iterables (i.e. sets of words) 
    and check if they're in the columns"""
    
    list_of_df_rows = []
    for iterable in list_of_iterables:
        row_dict = {}
        for col in columns:
            # Set it to zero at first
            row_dict[col] = 0
            for item in iterable:
                if col == item:
                    # Change it to 1 if we found a match
                    row_dict[col] = 1
                    
        list_of_df_rows.append(row_dict)
    
    return list_of_df_rows

# Create DataFrame rows
rows = create_rows(list_of_sets, columns)

# Create DataFrame that's tall, not wide, at first
df = pd.DataFrame(rows, columns=columns)

print(df)
>>> Air Conditioning  TV  ...  Free Parking on Premises  Washer,Dryer
0                 0   1   ...                         1             1
1                 1   1   ...                         0             0

既然你不是在问为什么你的代码不起作用,你一定是在问一个算法。-创建一个字典,其中键是短语,值是每行0或1的列表。一个collections.defaultdict(list)应该会有帮助

d = {'phrase1':[row1,row2,...],'phrase2':[row1,row2,...],...}
  • 迭代显示为集合的,以便使用集合操作
  • 每行
    • 查找该行中词典中尚未列出的短语。这是行和字典键之间的区别-row - d.keys()
      • 对于该差异中的每个短语,在其值后面附加零
        • 对于第一行追加零,对于第二行追加一个零,对于第三行追加两个零
    • 查找不在此行中的上一个短语。这是字典键和行-d.keys() - row之间的区别
      • 对于该差异中的每个短语,附加一个零
    • 为行中的每个短语添加一个
  • 将字典馈送到数据帧构造函数-df = pandas.DataFrame(d)

相关问题 更多 >

    热门问题