如何使用正则表达式在dataframe列中查找值

0 1 2 3 ... 9 A County 13.789 (+22) 1.566,0 10 My County 16.581 (+45) 3.040,0 11 Their County 7.445 (+15) 2.821,6 ... 55 Gesamt 304.950 (+820) 2.747,2

# Open LGA reports for yesterday and the day before # TO DO: Sometimes the LGA report is named COVID_Lagebericht_LGA_yymmdd.pdf or it ends in _01 # Add in a try/else statement to compensate for this rptyes = f'Reports_LGA/{yday_yymmdd}_COVID_Tagesbericht_LGA.pdf' rptdbf = f'Reports_LGA/{daybef_yymmdd}_COVID_Tagesbericht_LGA.pdf' # Read the LGA reports into dataframes. dfyes = camelot.read_pdf(rptyes, pages='2', flavor='stream') dfdbf = camelot.read_pdf(rptdbf, pages='2', flavor='stream') # Extract the statewide 7-D-I # TO DO: Sometimes the last line says "Gesamt", sometimes "Gesamtergebnis" or something else. # Add in some sort of error checking or try/else statement or regular expression to compensate landindexyes = lambda land: dfyes[0].df.loc[dfyes[0].df[0] == land].index[0] landindexdbf = lambda land: dfdbf[0].df.loc[dfdbf[0].df[0] == land].index[0] land = 'Gesamt' bwname = 'Baden-Württemberg' bwcases = int(dfyes[0].df.loc[landindexyes(land), 1].replace('.','')) bwcasesdiff = dfyes[0].df.loc[landindexyes(land), 2] bwdeaths = int(dfyes[0].df.loc[landindexyes(land), 4].replace('.','')) bwdeathsdiff = dfyes[0].df.loc[landindexyes(land), 5] bw7diyes = float(dfyes[0].df.loc[landindexyes(land), 7].replace(',','.')) bw7didbf = float(dfdbf[0].df.loc[landindexdbf(land), 7].replace(',','.')) bw7didiff = bw7diyes - bw7didbf rptrowsbw = [bwname, bwcases, bwcasesdiff, bwdeaths, bwdeathsdiff, bw7diyes, bw7didbf]

1条回答

网友

1楼 · 发布于 2024-10-04 11:30:00

不幸的是，我看不到您的数据帧，所以我无法写出100%正确的行。我想请你参考这里的第一个答案：Filtering DataFrame by finding exact word (not combined) in a column of strings

因此，在您的情况下，类似于：

df[df["column_name"].str.contains(r'(?:\s|^)Gesamt(?:\s|$)')]]==True

或

df[df["column_name"].str.contains(r'(?:\s|^)Gesamtergebnis(?:\s|$)')]]==True

如果不确定数据集中的拼写是否正确，可以尝试匹配算法，例如Fuzzy-Wuzzy:https://www.datacamp.com/community/tutorials/fuzzy-string-python

编辑（来自评论）：正则表达式大大降低了代码的速度，那么将列中的所有“Gesamtergebnis”值都更改为“Gesamt”怎么样？因此，您可以在待办事项部分使用以下内容：

df_name['column_name'] = df_name['column_name'].str.replace('Gesamtergebnis','Gesamt')

然后继续你的代码

相关问题更多 >

编程相关推荐

热门问题

热门文章