匹配内容创建新列

2024-09-28 22:53:34 发布

您现在位置:Python中文网/ 问答频道 /正文

你好,我有一个数据集,我想匹配我的关键字与位置。我遇到的问题是,在我的数据集中,“阿富汗”或“喀布尔”或“赫尔曼德”的位置以超过150种组合出现,包括拼写错误、大写和城市或城镇的名称。我想做的是创建一个单独的列,如果这些字符中的任何一个“afg”或“afg”或“kab”或“helm”或“helm”,则返回值1。我不确定大小写是否有区别

例如,有数百个这样的地点组合:杰格达拉克、阿富汗、阿富汗、加兹尼♥, 喀布尔/阿富汗

我试过这段代码,如果它与短语完全匹配是很好的,但是有太多的变化要写下每个异常

keywords= ['Afghanistan','Kabul','Herat','Jalalabad','Kandahar','Mazar-i-Sharif', 'Kunduz', 'Lashkargah', 'mazar', 'afghanistan','kabul','herat','jalalabad','kandahar']


#how to make a column that shows rows with a certain keyword..
def keyword_solution(value):
    strings = value.split()
    if any(word in strings for word in keywords):
        return 1
    else:
        return 0

taleban_2['keyword_solution'] = taleban_2['location'].apply(keyword_solution)

# below will return the 1 values

taleban_2[taleban_2['keyword_solution'].isin(['1'])].head(5)

只需要替换这个逻辑,所有结果都将放入匹配“Afg”或“Afg”或“kab”或“kab”或“kund”或“kund”的“keyword\u solution”列中


Tags: 数据inreturnvaluekeywordwordhelmsolution
1条回答
网友
1楼 · 发布于 2024-09-28 22:53:34

鉴于以下情况:

  • 纽约时报的句子
  • 删除所有非字母数字字符
  • 将所有内容更改为小写,从而消除了对不同单词变体的需要
  • 把句子分成listset。我用set是因为句子很长
  • 根据需要添加到keywords列表中
  • 匹配两个列表中的单词
    • 'afgh' in ['afghanistan']False
    • 'afgh' in 'afghanistan'True
    • 因此,列表理解在word_list的每个单词中搜索每个关键字
    • [True if word in y else False for y in x for word in keywords]
    • 这允许关键字列表更短(即给定afgh,不需要afghanistan
import re
import pandas as pd

keywords= ['jalalabad',
           'kunduz',
           'lashkargah',
           'mazar',
           'herat',
           'mazar',
           'afgh',
           'kab',
           'kand']

df = pd.DataFrame({'sentences': ['The Taliban have wanted the United States to pull troops out of Afghanistan Turkey has wanted the Americans out of northern Syria and North Korea has wanted them to at least stop military exercises with South Korea.',
                                 'President Trump has now to some extent at least obliged all three — but without getting much of anything in return. The self-styled dealmaker has given up the leverage of the United States’ military presence in multiple places around the world without negotiating concessions from those cheering for American forces to leave.',
                                 'For a president who has repeatedly promised to get America out of foreign wars, the decisions reflect a broader conviction that bringing troops home — or at least moving them out of hot spots — is more important than haggling for advantage. In his view, decades of overseas military adventurism has only cost the country enormous blood and treasure, and waiting for deals would prolong a national disaster.',
                                 'The top American commander in Afghanistan, Gen. Austin S. Miller, said Monday that the size of the force in the country had dropped by 2,000 over the last year, down to somewhere between 13,000 and 12,000.',
                                 '“The U.S. follows its interests everywhere, and once it doesn’t reach those interests, it leaves the area,” Khairullah Khairkhwa, a senior Taliban negotiator, said in an interview posted on the group’s website recently. “The best example of that is the abandoning of the Kurds in Syria. It’s clear the Kabul administration will face the same fate.”',
                                 'afghan']})

# substitute non-alphanumeric characters
df['sentences'] = df['sentences'].apply(lambda x: re.sub('[\W_]+', ' ', x))

# create a new column with a list of all the words
df['word_list'] = df['sentences'].apply(lambda x: set(x.lower().split()))

# check the list against the keywords
df['location'] = df.word_list.apply(lambda x: any([True if word in y else False for y in x for word in keywords]))

# final
print(df.location)

0     True
1    False
2    False
3     True
4     True
5     True
Name: location, dtype: bool

相关问题 更多 >