
2024-09-27 04:27:25 发布

您现在位置:Python中文网/ 问答频道 /正文




#Create a new dataframe with only "Precio USD" & "Zona"
gt_venta_precio_zona = gt_venta[['Precio USD','Zona']]
#Group by "Zona"
grp = gt_venta_precio_zona.groupby('Zona')
#Iterate through the groups to get the keys (titles) of each "Zona" and plot the results
for key in grp.groups.keys():


Q1 = grp['Precio USD'].quantile(0.25)
Q3 = grp['Precio USD'].quantile(0.75)
IQR = Q3 - Q1
#Let's reshape the quartiles to be able to operate them
Q1 = Q1.values.reshape(-1,1)
Q3 = Q3.values.reshape(-1,1)
IQR = IQR.values.reshape(-1,1)
print(Q1.shape, Q3.shape, IQR.shape)


#Let's filter the dataset based on the IQR * +- 1.5
filter = (grp['Precio USD'] >= Q1 - 1.5 * IQR) & (grp['Precio USD'] <= Q3 + 1.5 *IQR)


TypeError                                 Traceback (most recent call last)
<ipython-input-117-09dffe5671dd> in <module>
     10 #Let's filter the dataset based on the IQR * +- 1.5
---> 11 filter = (grp['Precio USD'] >= Q1 - 1.5 * IQR) & (grp['Precio USD'] <= Q3 + 1.5 *IQR)
     12 grp.loc[filter]

TypeError: '<=' not supported between instances of 'float' and 'str'





# Let's start by loading the dataset

# In[1]:

#Import the necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
get_ipython().run_line_magic('matplotlib', 'inline')

# Read the CSV file into a DataFrame: df
gt_df = pd.read_csv('RE_Data_GT.csv')

# Let's do some simple statistical analysis to understand the variables we have and their behaviour a little better.

# In[2]:

#Fill in NaN's with the man of the column on the "Banos" column
gt_df['Banos'] = gt_df['Banos'].fillna(gt_df['Banos'].mean())

# In[3]:


# From the table above we can see that a few of the columns have data very spread out (high standard deviation), this is not necessarily bad, because we know the dataset we understand that this could be caused by the two types of listings ('Venta' y 'Alquiler'), it makes sense to have variance if we look at prices by rent and sales at the same time. 
# Now let's move to one of the most exciting parts, which is some exploratory data analysis (EDA). But before we do that, I think that with the information above it would make sense to have two different dataframes one for rentals and other for home sales. 

# In[4]:

gt_alquiler = gt_df[gt_df['Tipo de listing'] == 'Alquiler']
gt_venta = gt_df[gt_df['Tipo de listing'] == 'Venta']

# Excellent, it seems like we have 2128 data points for 'Alquiler'(rental) and 3004 for 'Venta' (sales). Now that we have our 2 dataframes, we can actually start to do some EDA, we'll start by looking at home sales (Tipo de listing =='Venta').

# In[5]:

_ = gt_venta['Precio USD'].plot.hist(title = 'Distribucion de Precios de Venta', colormap='Pastel2')
_ = plt.xlabel('Price in USD')

# In[6]:

#Declare a function to compute the ECDF of an array
def ecdf(data):
    """Compute ECDF for a one-dimensional array of measurements."""
    # Number of data points: n
    n = len(data)

    # x-data for the ECDF: x
    x = np.sort(data)

    # y-data for the ECDF: y
    y = np.arange(1, len(data)+1) / n

    return x, y

# In[7]:

#Create Variable to pass to the ECDF function
gt_venta_precio = gt_venta['Precio USD']

#Compute ECDF for
x, y = ecdf(gt_venta_precio)

# Generate plot
_ = plt.plot(x, y, marker='.', linestyle='none')

# Add title and label the axes
_ = plt.title('ECDF de Precio en USD')
_ = plt.xlabel('Precio en USD')
_ = plt.ylabel('ECDF')

# Display the plot

# Apparently there are a few outliers that require our attention. To better understand these points, it's better if we group them by a 'zona' (zona/area) to see which listing has such a high price. 
# Let's start to understand the specific outliers by grouping the listings by "Zona" and then using a box plot for each to review each in more detail.

# In[8]:

#Create a new dataframe with only "Precio USD" & "Zona"
gt_venta_precio_zona = gt_venta[['Precio USD','Zona']]
#Group by "Zona"
grp = gt_venta_precio_zona.groupby('Zona')
#Iterate through the groups to get the keys (titles) of each "Zona" and plot the results
for key in grp.groups.keys():

# In[14]:

Q1 = grp['Precio USD'].quantile(0.25)
Q3 = grp['Precio USD'].quantile(0.75)
IQR = Q3 - Q1
#Let's reshape the quartiles to be able to operate them
Q1 = Q1.values.reshape(-1,1)
Q3 = Q3.values.reshape(-1,1)
IQR = IQR.values.reshape(-1,1)
print(Q1.shape, Q3.shape, IQR.shape)

#Let's filter the dataset based on the IQR * +- 1.5
filter = (grp['Precio USD'] >= (Q1 - 1.5 * IQR)) & (grp['Precio USD'] <= (Q3 + 1.5 *IQR))


Tags: ofthetogtdfforweusd
