如何像excel(在这种情况下是工作表)那样搜索重复项,然后在数据框中突出显示它们?

2024-09-30 16:37:16 发布

您现在位置:Python中文网/ 问答频道 /正文

我想在数据框中突出显示一些值。我需要比较不同数据帧(df1和df2)上的两列,然后突出显示重复的值,并在第一个数据帧df1中显示它们

为了让您了解这一点,在excel中,您可以通过使用countif公式实现这一点,下面是一个视频:

https://www.youtube.com/watch?v=VhECzNIQTIY

有什么办法可以让熊猫做到这一点?或者一般的python

谢谢


更新

代码如下:

import pandas as pd

#Exporting raw data from a csv file
DataOrigin = pd.read_csv('RAWDATA.csv')
#Sorting raw data per interesting columns
DataOriginSorted = DataOrigin.sort_values(['srcip','attack','dstip'])
#Exporting some columns of historical data and sorting them
Historicaldata2 = pd.read_excel('Historicaldata.xlsx', sheet_name=1, usecols = ['Source_IP','Ticket','Customer_Notification','Hostname','Service_desk_ticket'])
Historicaldata2Sorted = Historicaldata2.sort_values(['Source_IP','Ticket'])
#Creating a multindex variable with sorted raw data
index = pd.MultiIndex.from_frame(DataOriginSorted)
Sorted_DataOrigin = pd.DataFrame(index=index)
#Making a count of events per source IP and exporting them as a csv for the code to work (rename column oepration)
Daily_IncidentsIPS = pd.crosstab(DataOrigin.srcip,DataOrigin.attack).to_csv('ControlFile1.csv')
Daily_IncidentsIPS = pd.read_csv('ControlFile1.csv').rename(columns = {'srcip': 'Source_IP'}, inplace = False )
#Mergin 2 dataframes to find coincident data and exporting them to a csv for the next operations to take place and using only interesting columns
Historical2vsSortedOrigin = Historicaldata2Sorted.merge(Daily_IncidentsIPS,left_on='Source_IP',right_on='Source_IP', how='inner').to_csv('ControlFile2.csv')
Historical2vsSortedOrigin = pd.read_csv('ControlFile2.csv', usecols = ['Ticket','Hostname','Source_IP','Customer_Notification','Service_desk_ticket'])
#Searching for duplicated data between two interesting dataframes
duplicated = Daily_IncidentsIPS['Source_IP'].isin(Historical2vsSortedOrigin['Source_IP'])
#Creating a rule to color the rows where the duplicated values are present
def row_styler(row):
    return ['background-color: yellow' if duplicated[row.name] else ''] * len(row)

#Creating a multindex variable to show the data as I want it
index2 = pd.MultiIndex.from_frame(Historical2vsSortedOrigin)
IncidentMatching = pd.DataFrame(index=index2)
#Saving 3 interesting dataframes in an excel file, highlighting the results of previous "search for duplicated" operation
writer = pd.ExcelWriter('C:\\Users\myuser\Documents\Spyder\Results_IPS.xlsx', engine='xlsxwriter')
Daily_IncidentsIPS.style.apply(row_styler, axis=1).to_excel(writer, sheet_name='Sheet1')
Sorted_DataOrigin.to_excel(writer, sheet_name='Sheet2')
IncidentMatching.to_excel(writer, sheet_name='Sheet3')
writer.save()


Tags: csvthetonameipsourcedataexcel
1条回答
网友
1楼 · 发布于 2024-09-30 16:37:16

您可以在熊猫中使用样式系统:

# Some mock data
df1 = pd.DataFrame({
    'Name': ['David', 'Sue', 'Mary'],
    'Location': ['San Francisco', 'New York', 'Boston']
})

df2 = pd.DataFrame({
    'Name': ['Sue', 'Mary', 'Joe', 'Jack']
})

# Now determine what Name is duplicated across the frames
duplicated = df1['Name'].isin(df2['Name'])

下一步取决于要突出显示的内容。如果只想突出显示重复值(如Excel中的值):

def col_styler(col):
    if col.name != 'Name':
        return [''] * len(col)

    return duplicated.map({
        True: 'background-color: yellow',
        False: ''
    })

df1.style.apply(col_styler)

输出:

enter image description here

如果要高亮显示整行,请执行以下操作:

def row_styler(row):
    return ['background-color: yellow' if duplicated[row.name] else ''] * len(row)

df1.style.apply(row_styler, axis=1)

输出:

enter image description here

相关问题 更多 >