检查数据框列中的缩写

2024-06-28 10:06:21 发布

您现在位置:Python中文网/ 问答频道 /正文

如何才能最有效地识别和统计单词背后的缩写词,并将其输入新列,但前提是这些缩写词是正确的

期望输出:

|-------Name---------------------------||-Count-|
This is Ante Meridian (AM) not included||   3   |         
This is Ante Meridian (AM)     included||   3   |     
This is Ante Meridian (AM) not included||   3   |     
Extra module with Post Meridian (PM)   ||   1   |     
Post Meridian (PO) is not available    ||   0   |  #Mismatch   

Tags: nameiscountwithnotthisam单词
1条回答
网友
1楼 · 发布于 2024-06-28 10:06:21

首先,您需要使用正则表达式来确定()中的字母是否与前面的两个单词匹配

#get two words before (
wordsbefore = df['Name'].str.extract(r'(\w+) (\w+) (?=\()')

#get first letter of both words and make it what it should be in ()
check = wordsbefore[0].str.extract(r'(^.)') + wordsbefore[1].str.extract(r'(^.)')

#check if letters in () matches our check
df['count'] = np.where(df['Name'].str.extract(r"\((.*)\)") == check, df['Name'].str.extract(r"\((.*)\)"), 0)

现在您有了一个df,其中acynoym位于它自己的列中,如果它不匹配,则为0。现在我们只需要替换为计数

df['count'] = df['count'].map(dict(df[df['count']!=0]['count'].value_counts())).fillna(0)

              Name                          count
0   This is Ante Meridian (AM) not included   3.0
1   This is Ante Meridian (AM) included       3.0
2   This is Ante Meridian (AM) not included   3.0
3   Extra module with Post Meridian (PM)      1.0
4   Post Meridian (PO) is not available       0.0

如果一行中没有(),那么最后也会得到0


如果您只需遵循循环中的模式,则可调节3和更多:

acy = re.compile("\((.*)\)")
twoWords = re.compile('(\w+) (\w+) (?=\()')
threeWords = re.compile('(\w+) (\w+) (\w+) (?=\()')
firstLet = re.compile('(^.)')

acyList = []

#Pull the first letters out of the words before ()
for index, value in df['Name'].iteritems():
    #get letters in () two inspect to check if we need to check 2 or 3 words
    getAcy = acy.search(value)
    try:    
        #check if length of letters in () is 2
        if len(getAcy[1]) == 2:
            #search for two words
            words = twoWords.search(value)
            #get first letter of two words before () and add phrase to list
            acyList.append(firstLet.search(words[1])[1] + firstLet.search(words[2])[1])

        #check if length of letters in () is 3
        elif len(getAcy[1]) == 3:
            #search for three words
            words = threeWords.search(value)
            #get first letter of three words before () and add phrase to list
            acyList.append(firstLet.search(words[1])[1] + firstLet.search(words[2])[1] + firstLet.search(words[3])[1])

    except:
        acyList.append(np.NaN)

df['count'] = np.where(df['Name'].str.extract(r"\((.*)\)") == pd.DataFrame(acyList), df['Name'].str.extract(r"\((.*)\)"), 0)
df['count'] = df['count'].map(dict(df[df['count']!=0]['count'].value_counts())).fillna(0)

相关问题 更多 >