奇怪的核型行为

2024-10-01 04:44:52 发布

您现在位置:Python中文网/ 问答频道 /正文

我在一个包含330万行的文件上迭代,检查该列的数据类型,并根据它是否包含整数执行操作。你知道吗

而像a55950602、a92300416这样的单元格值对于issubdtype as很容易被识别为Falsenp.整数,在ga99266e的情况下失败

代码: 作为pd导入 将numpy作为np导入 导入时间 导入数学

start_time = time.time()
lstNumberCounts = []
lstIllFormed = []

dfClicks = pd.read_csv('Oct3_distinct_Members.csv')
dfClicks['UNIV_MBR_ID'] = dfClicks['UNIV_MBR_ID'].str.split('-').str[0]
dfClicks['UNIV_MBR_ID'] = dfClicks['UNIV_MBR_ID'].apply(pd.to_numeric,errors='ignore')

for item in dfClicks['UNIV_MBR_ID']:
    if (np.issubdtype(item,np.integer)):
        lstNumberCounts.append(math.floor(math.log10(item))+1)
else:
    lstIllFormed.append(item)


print("---Processing Time: %s seconds ---" % (time.time() - start_time))

对于上述值,代码运行良好,但在控制台上出现如下错误: TypeError:无法理解数据类型“ga99266e”


Tags: idtimenp整数itemstartpd数据类型
1条回答
网友
1楼 · 发布于 2024-10-01 04:44:52

pd.to_numeric,errors='ignore'returns either a numeric value or the input。所以对于“ga99266e”,它返回“ga99266e”,这是一个字符串。如果您输入numpys issubdtype一个字符串,it checks if the string is the name of a dtype。(例如。np.ISUBD类型('int',int)返回True)。你知道吗

因此,您需要首先检查字段是否仍然是字符串,如果不是,则可以检查它是否是numpy整数。你知道吗

尝试:

import pandas as pd 
import numpy as np 
import time 
import math
start_time = time.time()
lstNumberCounts = []
lstIllFormed = []

dfClicks = pd.read_csv('Oct3_distinct_Members.csv')
dfClicks['UNIV_MBR_ID'] = dfClicks['UNIV_MBR_ID'].str.split('-').str[0]
dfClicks['UNIV_MBR_ID'] = dfClicks['UNIV_MBR_ID'].apply(pd.to_numeric,errors='ignore')

for item in dfClicks['UNIV_MBR_ID']:
    if not (isinstance(item,str)):
        if (np.issubdtype(item,np.integer)):
            lstNumberCounts.append(math.floor(math.log10(item))+1)
    else:
        lstIllFormed.append(item)


print(" -Processing Time: %s seconds  -" % (time.time() - start_time))

“a123456”或任何以“a”开头的字符串与np.issubdtype一起工作,因为numpy将其解释为一个代码,告诉它下面的数字是什么类型的数字。See:

Array-protocol type strings (see The Array Interface)

The first character specifies the kind of data and the remaining characters specify the number of bytes per item, except for Unicode, where it is interpreted as the number of characters. The item size must correspond to an existing type, or an error will be raised. The supported kinds are

'?' boolean

'b' (signed) byte

'B' unsigned byte

'i' (signed) integer

'u' unsigned integer

'f' floating-point

'c' complex-floating point

'm' timedelta

'M' datetime

'O' (Python) objects

'S', 'a' zero-terminated bytes (not recommended)

'U' Unicode string

'V' raw data (void)

相关问题 更多 >