在尝试操作/筛选由groupby操作创建的数据帧时使用(>=&<=)时遇到问题

2024-09-27 04:27:25 发布

您现在位置:Python中文网/ 问答频道 /正文

所以我试图从我的数据集中删除异常值。这是房地产数据,所以我使用groupby按“区域/区域”(代码上实际上写着“Zona”)分组,并用每个“区域/区域”的价格计算IQR,但现在我尝试使用“>;=&;<;=”来过滤异常值,我得到了一个类型错误。你知道吗

这是我的密码。你知道吗

首先,我创建了一个新的数据框,只包含“区域”和“价格”,并使用方框图检查是否存在异常值。你知道吗

#Create a new dataframe with only "Precio USD" & "Zona"
gt_venta_precio_zona = gt_venta[['Precio USD','Zona']]
#Group by "Zona"
grp = gt_venta_precio_zona.groupby('Zona')
#Iterate through the groups to get the keys (titles) of each "Zona" and plot the results
for key in grp.groups.keys():
    grp.get_group(key).plot.box(title=key)

在注意到许多按“区域”划分的异常值之后,我计算了IQR,并试图用它来按“区域”筛选异常值,下面是代码。你知道吗

Q1 = grp['Precio USD'].quantile(0.25)
Q3 = grp['Precio USD'].quantile(0.75)
IQR = Q3 - Q1
#Let's reshape the quartiles to be able to operate them
Q1 = Q1.values.reshape(-1,1)
Q3 = Q3.values.reshape(-1,1)
IQR = IQR.values.reshape(-1,1)
print(Q1.shape, Q3.shape, IQR.shape)

一切都很顺利,直到我尝试使用以下代码根据这些数据筛选数据帧:

#Let's filter the dataset based on the IQR * +- 1.5
filter = (grp['Precio USD'] >= Q1 - 1.5 * IQR) & (grp['Precio USD'] <= Q3 + 1.5 *IQR)
grp.loc[filter]

我得到了以下线索:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-117-09dffe5671dd> in <module>
      9 
     10 #Let's filter the dataset based on the IQR * +- 1.5
---> 11 filter = (grp['Precio USD'] >= Q1 - 1.5 * IQR) & (grp['Precio USD'] <= Q3 + 1.5 *IQR)
     12 grp.loc[filter]

TypeError: '<=' not supported between instances of 'float' and 'str'

我查看了每个“区域”下所有值“Price”的数据类型,它们都是浮点值,分位数和IQR也是如此。你知道吗

我试着把所有东西都转换成Int,但我也做不到,因为我用的是groupby。所以我被困在这里了。你知道吗

任何帮助都将不胜感激!你知道吗

另外,这是我的完整代码(到目前为止):

# Let's start by loading the dataset

# In[1]:


#Import the necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
get_ipython().run_line_magic('matplotlib', 'inline')

# Read the CSV file into a DataFrame: df
gt_df = pd.read_csv('RE_Data_GT.csv')
gt_df.tail()


# Let's do some simple statistical analysis to understand the variables we have and their behaviour a little better.

# In[2]:


#Fill in NaN's with the man of the column on the "Banos" column
gt_df['Banos'] = gt_df['Banos'].fillna(gt_df['Banos'].mean())
gt_df.info()


# In[3]:


gt_df.describe()


# From the table above we can see that a few of the columns have data very spread out (high standard deviation), this is not necessarily bad, because we know the dataset we understand that this could be caused by the two types of listings ('Venta' y 'Alquiler'), it makes sense to have variance if we look at prices by rent and sales at the same time. 
# 
# Now let's move to one of the most exciting parts, which is some exploratory data analysis (EDA). But before we do that, I think that with the information above it would make sense to have two different dataframes one for rentals and other for home sales. 

# In[4]:


gt_alquiler = gt_df[gt_df['Tipo de listing'] == 'Alquiler']
gt_venta = gt_df[gt_df['Tipo de listing'] == 'Venta']
gt_alquiler.info()
gt_venta.info()


# Excellent, it seems like we have 2128 data points for 'Alquiler'(rental) and 3004 for 'Venta' (sales). Now that we have our 2 dataframes, we can actually start to do some EDA, we'll start by looking at home sales (Tipo de listing =='Venta').

# In[5]:


_ = gt_venta['Precio USD'].plot.hist(title = 'Distribucion de Precios de Venta', colormap='Pastel2')
_ = plt.xlabel('Price in USD')


# In[6]:


#Declare a function to compute the ECDF of an array
def ecdf(data):
    """Compute ECDF for a one-dimensional array of measurements."""
    # Number of data points: n
    n = len(data)

    # x-data for the ECDF: x
    x = np.sort(data)

    # y-data for the ECDF: y
    y = np.arange(1, len(data)+1) / n

    return x, y


# In[7]:


#Create Variable to pass to the ECDF function
gt_venta_precio = gt_venta['Precio USD']

#Compute ECDF for
x, y = ecdf(gt_venta_precio)

# Generate plot
_ = plt.plot(x, y, marker='.', linestyle='none')

# Add title and label the axes
_ = plt.title('ECDF de Precio en USD')
_ = plt.xlabel('Precio en USD')
_ = plt.ylabel('ECDF')

# Display the plot
plt.show()


# Apparently there are a few outliers that require our attention. To better understand these points, it's better if we group them by a 'zona' (zona/area) to see which listing has such a high price. 
# 
# Let's start to understand the specific outliers by grouping the listings by "Zona" and then using a box plot for each to review each in more detail.

# In[8]:


#Create a new dataframe with only "Precio USD" & "Zona"
gt_venta_precio_zona = gt_venta[['Precio USD','Zona']]
#Group by "Zona"
grp = gt_venta_precio_zona.groupby('Zona')
#Iterate through the groups to get the keys (titles) of each "Zona" and plot the results
for key in grp.groups.keys():
    grp.get_group(key).plot.box(title=key)


# In[14]:


Q1 = grp['Precio USD'].quantile(0.25)
Q3 = grp['Precio USD'].quantile(0.75)
IQR = Q3 - Q1
#Let's reshape the quartiles to be able to operate them
Q1 = Q1.values.reshape(-1,1)
Q3 = Q3.values.reshape(-1,1)
IQR = IQR.values.reshape(-1,1)
print(Q1.shape, Q3.shape, IQR.shape)

#Let's filter the dataset based on the IQR * +- 1.5
filter = (grp['Precio USD'] >= (Q1 - 1.5 * IQR)) & (grp['Precio USD'] <= (Q3 + 1.5 *IQR))
grp.loc[filter]

数据集可以在这里下载:https://drive.google.com/file/d/1JXDm9iYem4DlMoIjx4f7yWBuwjaLRThe/view?usp=sharing


Tags: ofthetogtdfforweusd

热门问题