Python:如何将多种类型的原始数据转换成一个漂亮的数据帧?

2024-09-27 00:11:40 发布

您现在位置:Python中文网/ 问答频道 /正文

我从一个爬虫程序得到这些数据,想把它转换成一个漂亮的数据帧。你知道吗

现在我看到的是:

+-----------+------------------------------------------+------------+---------------------+
| HotelName | RoomType                                 | RoomFloor  | RoomPrice           |
+-----------+------------------------------------------+------------+---------------------+
| Hotel1    | Standard,Standard,Standard,Deluxe,Deluxe | 10F,20F    | 100,105,108,200,205 |
| Hotel2    | Standard,Standard,Deluxe,Deluxe,Grande   | 30F,40F,50F| 90,95,250,240,300   |
+-----------+------------------------------------------+------------+---------------------+

我最终想要的是:

+-----------+----------+-----------+-----------+
| HotelName | RoomType | RoomFloor | RoomPrice |
+-----------+----------+-----------+-----------+
| Hotel1    | Standard | 10F       | 100       |
| Hotel1    | Standard | 10F       | 105       |
| Hotel1    | Standard | 10F       | 108       |
| Hotel1    | Deluxe   | 20F       | 200       |
| Hotel1    | Deluxe   | 20F       | 205       |
| Hotel2    | Standard | 30F       | 90        |
| Hotel2    | Standard | 30F       | 95        |
| Hotel2    | Deluxe   | 40F       | 250       |
| Hotel2    | Deluxe   | 40F       | 240       |
| Hotel2    | Grande   | 50F       | 300       |
+-----------+----------+-----------+-----------+

我是Python新手,我不能处理这个问题。有人能帮忙吗?谢谢!你知道吗


Tags: 数据程序爬虫standardgrande新手roomtypedeluxe
3条回答

如果分别定义RoomFloor,则解决方案为:

print (df)
  HotelName                                  RoomType            RoomFloor  \
0    Hotel1  Standard,Standard,Standard,Deluxe,Deluxe  10F,10F,10F,20F,20F   
1    Hotel2    Standard,Standard,Deluxe,Deluxe,Grande  30F,30F,40F,40F,50F   

             RoomPrice  
0  100,105,108,200,205  
1    90,95,250,240,300  

cols = ['RoomType','RoomFloor','RoomPrice']
a = df[cols].apply(lambda x: x.str.split(',', expand=True).stack()).reset_index(1, drop=True)
df = df.drop(cols, axis=1).join(a).reset_index(drop=True)
print (df)
  HotelName  RoomType RoomFloor RoomPrice
0    Hotel1  Standard       10F       100
1    Hotel1  Standard       10F       105
2    Hotel1  Standard       10F       108
3    Hotel1    Deluxe       20F       200
4    Hotel1    Deluxe       20F       205
5    Hotel2  Standard       30F        90
6    Hotel2  Standard       30F        95
7    Hotel2    Deluxe       40F       250
8    Hotel2    Deluxe       40F       240
9    Hotel2    Grande       50F       300

我认为循环将产生更可读的代码:

data = []
for idx, row in df.iterrows():
    room_types = pd.Series(row['RoomType'].split(','))
    room_floors = row['RoomFloor'].split(',')
    room_prices = row['RoomPrice'].split(',')
    mapping = dict(zip(room_types.unique(), room_floors))
    room_floors = room_types.map(mapping)
    for rm_type, rm_floor, rm_price in zip(room_types, room_floors, room_prices):
        data.append((row['HotelName'], rm_type, rm_floor, rm_price))


pd.DataFrame(data, columns=['HotelName', 'RoomType', 'RoomFloor', 'RoomPrice'])
Out[56]: 
  HotelName  RoomType RoomFloor RoomPrice
0    Hotel1  Standard       10F       100
1    Hotel1  Standard       10F       105
2    Hotel1  Standard       10F       108
3    Hotel1    Deluxe       20F       200
4    Hotel1    Deluxe       20F       205
5    Hotel2  Standard       30F        90
6    Hotel2  Standard       30F        95
7    Hotel2    Deluxe       40F       250
8    Hotel2    Deluxe       40F       240
9    Hotel2    Grande       50F       300

它遍历数据帧的行,并为每个酒店生成房间类型、房间楼层和房间价格的列表。mapping = dict(zip(room_types.unique(), room_floors))板条箱是房间类型和房间楼层之间的映射。使用这个映射,room_floors = room_types.map(mapping)创建一个长度相等的列表。既然room_typesroom_floorsroom_prices具有相同的长度,您就可以对它们进行迭代,并将每个记录添加为一个元组。最后,最后一行将元组列表转换为整洁的数据帧。你知道吗

我试图复制一个数据帧,我想应该和发布的一样:

import pandas as pd

raw_data = {'HotelName': ['Hotel1', 'Hotel2'],
            'RoomType': ['Standard,Standard,Standard,Deluxe,Deluxe', 'Standard,Standard,Deluxe,Deluxe,Grande'],
            'RoomFloor': ['10F,20F', '30F,40F,50F'],
            'RoomPrice': ['100,105,108,200,205', '90,95,250,240,300']}

data = pd.DataFrame(raw_data)

我想模块“orderedset”可能会有所帮助,希望下面的代码可以解决您的问题:

from ordered_set import OrderedSet # revise 'orderedset' to 'ordered_set'

cols_ordered = ['HotelName', 'RoomType', 'RoomFloor', 'RoomPrice']
data = data[cols_ordered]

data = data[['HotelName', 'RoomType', 'RoomFloor', 'RoomPrice']].applymap(lambda x: x.split(','))
dummies = data.applymap(lambda x: len(x)).apply(max, 1)

for i in range(len(data)):
    room_type, room_floor = data[['RoomType', 'RoomFloor']].iloc[i]
    type_floor_dict = dict(zip(OrderedSet(room_type), room_floor))
    data['RoomFloor'].iloc[i] = [type_floor_dict[t] for t in room_type]
    data['HotelName'].iloc[i] *= dummies[i]

new_data = [pd.DataFrame(data.loc[i].tolist(), index=cols_ordered).T for i in data.index]
new_data = pd.concat(new_data, ignore_index=True)

print(new_data)

相关问题 更多 >

    热门问题