匹配内容创建新列

keywords= ['Afghanistan','Kabul','Herat','Jalalabad','Kandahar','Mazar-i-Sharif', 'Kunduz', 'Lashkargah', 'mazar', 'afghanistan','kabul','herat','jalalabad','kandahar'] #how to make a column that shows rows with a certain keyword.. def keyword_solution(value): strings = value.split() if any(word in strings for word in keywords): return 1 else: return 0 taleban_2['keyword_solution'] = taleban_2['location'].apply(keyword_solution) # below will return the 1 values taleban_2[taleban_2['keyword_solution'].isin(['1'])].head(5)

1条回答

网友

1楼 · 发布于 2024-09-28 22:53:34

鉴于以下情况：

纽约时报的句子
删除所有非字母数字字符
将所有内容更改为小写，从而消除了对不同单词变体的需要
把句子分成list或set。我用set是因为句子很长
根据需要添加到keywords列表中
匹配两个列表中的单词
- 'afgh' in ['afghanistan']：False
- 'afgh' in 'afghanistan'：True
- 因此，列表理解在word_list的每个单词中搜索每个关键字
- [True if word in y else False for y in x for word in keywords]
- 这允许关键字列表更短（即给定afgh，不需要afghanistan）

import re
import pandas as pd

keywords= ['jalalabad',
           'kunduz',
           'lashkargah',
           'mazar',
           'herat',
           'mazar',
           'afgh',
           'kab',
           'kand']

df = pd.DataFrame({'sentences': ['The Taliban have wanted the United States to pull troops out of Afghanistan Turkey has wanted the Americans out of northern Syria and North Korea has wanted them to at least stop military exercises with South Korea.',
                                 'President Trump has now to some extent at least obliged all three — but without getting much of anything in return. The self-styled dealmaker has given up the leverage of the United States’ military presence in multiple places around the world without negotiating concessions from those cheering for American forces to leave.',
                                 'For a president who has repeatedly promised to get America out of foreign wars, the decisions reflect a broader conviction that bringing troops home — or at least moving them out of hot spots — is more important than haggling for advantage. In his view, decades of overseas military adventurism has only cost the country enormous blood and treasure, and waiting for deals would prolong a national disaster.',
                                 'The top American commander in Afghanistan, Gen. Austin S. Miller, said Monday that the size of the force in the country had dropped by 2,000 over the last year, down to somewhere between 13,000 and 12,000.',
                                 '“The U.S. follows its interests everywhere, and once it doesn’t reach those interests, it leaves the area,” Khairullah Khairkhwa, a senior Taliban negotiator, said in an interview posted on the group’s website recently. “The best example of that is the abandoning of the Kurds in Syria. It’s clear the Kabul administration will face the same fate.”',
                                 'afghan']})

# substitute non-alphanumeric characters
df['sentences'] = df['sentences'].apply(lambda x: re.sub('[\W_]+', ' ', x))

# create a new column with a list of all the words
df['word_list'] = df['sentences'].apply(lambda x: set(x.lower().split()))

# check the list against the keywords
df['location'] = df.word_list.apply(lambda x: any([True if word in y else False for y in x for word in keywords]))

# final
print(df.location)

0     True
1    False
2    False
3     True
4     True
5     True
Name: location, dtype: bool

鉴于以下情况：

相关问题更多 >

编程相关推荐

热门问题

热门文章