在读取csv文件时，如何按时间升序方式获取最近一天的行数？

label uId adId operTime siteId slotId contentId netType 0 0 u147333631 3887 2019-03-30 15:01:55.617 10 30 2137 1 1 0 u146930169 1462 2019-03-31 09:51:15.275 3 32 1373 1 2 0 u139816523 2084 2019-03-27 08:10:41.769 10 30 2336 1 3 0 u106546472 1460 2019-03-31 08:51:41.085 3 32 1371 4 4 0 u106642861 2295 2019-03-27 22:58:03.679 3 32 2567 4

#this not a real data, just for examples. label uId adId operTime siteId slotId contentId netType 0 0 u147336431 3887 2019-04-04 00:08:42.315 1 54 2427 2 1 0 u146933269 1462 2019-04-04 01:06:16.417 30 36 1343 6 2 0 u139536523 2084 2019-04-04 02:08:58.079 15 23 1536 7 3 0 u106663472 1460 2019-04-04 03:21:13.050 32 45 1352 2 4 0 u121642861 2295 2019-04-04 04:36:08.653 3 33 3267 4

3条回答

网友

1楼 · 编辑于 2024-09-29 21:32:43

我假设你不能把整个文件读入内存，而且文件是随机排列的。您可以分块读取文件并遍历这些块。你知道吗

# read 50,000 lines of the file at a time
reader = pd.read_csv(
    'csv_file.csv',
    parse_dates=True,
    chunksize=5e5,
    header=0
)

recent_day=pd.datetime(2019,4,4)
next_day=recent_day + pd.Timedelta(days=1)
df_list=[]

for chunk in reader:
    #check if any rows match the date range
    date_rows = chunk.loc[
        (chunk['operTime'] >= recent_day]) &\
        (chunk['operTime'] < next_day)
    ]
    #append dataframe of matching rows to the list
    if date_rows.empty:
        pass
    else:
        df_list.append(date_rows)


final_df = pd.concat(df_list)
final_df = final_df.sort_values('operTime')

网友

2楼 · 编辑于 2024-09-29 21:32:43

就像提到的@anky\u91一样，您可以使用sort_values函数。下面是一个简单的例子：

df = pd.DataFrame( {'Symbol':['A','A','A'] ,
    'Date':['02/20/2015','01/15/2016','08/21/2015']})
df.sort_values(by='Date')

输出：

Date    Symbol
2   08/21/2015  A
0   02/20/2015  A
1   01/15/2016  A

网友

3楼 · 编辑于 2024-09-29 21:32:43

支持anky\u 91所说的，sort\u values（）在这里会很有帮助。你知道吗

import pandas as pd

df = pd.read_csv('file.csv')

# >>> df
#    label         uId  adId                 operTime  siteId  slotId  contentId  netType
# 0      0  u147333631  3887  2019-03-30 15:01:55.617      10      30       2137        1
# 1      0  u146930169  1462  2019-03-31 09:51:15.275       3      32       1373        1
# 2      0  u139816523  2084  2019-03-27 08:10:41.769      10      30       2336        1
# 3      0  u106546472  1460  2019-03-31 08:51:41.085       3      32       1371        4
# 4      0  u106642861  2295  2019-03-27 22:58:03.679       3      32       2567        4

sub_df = df[(df['operTime']>'2019-03-31') & (df['operTime']<'2019-04-01')]

# >>> sub_df
#    label         uId  adId                 operTime  siteId  slotId  contentId  netType
# 1      0  u146930169  1462  2019-03-31 09:51:15.275       3      32       1373        1
# 3      0  u106546472  1460  2019-03-31 08:51:41.085       3      32       1371        4

final_df = sub_df.sort_values(by=['operTime'])

# >>> final_df
#    label         uId  adId                 operTime  siteId  slotId  contentId  netType
# 3      0  u106546472  1460  2019-03-31 08:51:41.085       3      32       1371        4
# 1      0  u146930169  1462  2019-03-31 09:51:15.275       3      32       1373        1

我认为您也可以在这里使用datetimeindex；如果文件足够大，这可能是必要的。你知道吗

相关问题更多 >

编程相关推荐

热门问题

热门文章