我这样做是为了一个项目,我需要从维基百科专门做一些网页抓取。以前工作过的东西现在突然停止工作了。它需要告诉我用户从维基百科文章中输入的人的职业,我使用的方法是:
#Finding their profession
#Declaring keywords for each profession
sportspersonKeywords = ['Sportsperson', 'Sportsman', 'Sportsman', 'Sports', 'Sport', 'Coach', 'Game', 'Olympics', 'Paralympics', 'Medal', 'Bronze', 'Silver', 'Gold', 'Player', 'sportsperson', 'sportsman', 'sportsman', 'sports', 'sport', 'coach', 'game', 'olympics', 'paralympics', 'medal', 'bronze', 'silver', 'gold', 'player', 'footballer', 'Footballer']
scientistKeywords = ['Scientist', 'Mathematician', 'Chemistry', 'Biology', 'Physics', 'Nobel Prize', 'Invention', 'Discovery', 'Invented', 'Discovered', 'science', 'scientist', 'mathematician', 'chemistry', 'biology', 'physics', 'nobel prize', 'invention', 'discovery', 'invented', 'discovered', 'science', 'Physicist', 'physicist', 'chemist', 'Chemist', 'Biologist', 'biologist']
politicianKeywords = ['Politician', 'Politics', 'Election', 'President', 'Vice-President', 'Vice President', 'Senate', 'Senator', 'Representative', 'Democracy', 'politician', 'politics', 'election', 'president', 'vice-president', 'vice president', 'senate', 'senator', 'representative', 'democracy']
#Declaring the first sentence (from the summary)
firstSentence = summary.split('.')[0]
profession = ['Scientist', 'Sportsperson', 'Politician']
professionFinal = ''
#Splitting the first sentence of the summary into separate words
firstSentenceList = firstSentence.split()
#Replacing each other character in the first sentence
counter = 0
print(firstSentenceList)
for i in firstSentenceList:
x = [',', '.']
if x[0] in i:
firstSentenceList = firstSentenceList[counter].replace(',', '')
counter += 1
elif x[1] in i:
i = i.replace('.', '')
counter += 1
else:
counter += 1
continue
print(firstSentenceList)
#Checking each word in the first sentence against the keywords in each profession to try to get a match
for i in firstSentenceList:
if i in sportspersonKeywords:
professionFinal = profession[1]
break
elif i in scientistKeywords:
professionFinal = profession[0]
break
elif i in politicianKeywords:
professionFinal = profession[2]
break
#if a match is found, then that person has that profession, if not, then their profession is not in our parameters
if professionFinal == '':
print('[PROFESSION]: NOT A SPORTPERSON, SCIENTIST, OR POLITICIAN')
else:
print('[PROFESSION]: ' + professionFinal)
对于阿尔伯特·爱因斯坦、塞琳娜·威廉姆斯、唐纳德·特朗普和其他人来说,这一切都很顺利,但当我搜索James Watson时。为了澄清,我只需要从上面的参数中找到他们的职业。如果他们不是科学家、运动员或政治家,就不必再进一步了,只要说他们都不是。不幸的是,我使用的是Repl.it,它不允许断点和其他许多东西,因此我必须通过输入print()
语句来手动调试,以检查一切是如何进行的。当我打印存储我的第一句话(我用来检查关键字的那句话)的firstSentenceList
变量时,我发现它应该识别生物学家,但它没有识别,因为单词biologist后面有一个逗号;所以它是这样列出的:'biologist,'
,这会把关键字搜索搞砸。此代码:
#Replacing each other character in the first sentence
counter = 0
print(firstSentenceList)
for i in firstSentenceList:
x = [',', '.']
if x[0] in i:
firstSentenceList = firstSentenceList[counter].replace(',', '')
counter += 1
elif x[1] in i:
i = i.replace('.', '')
counter += 1
else:
counter += 1
continue
print(firstSentenceList)
是我刚刚加入的东西,试图替换列表中的逗号和句号。我试着运行它,然后wallah,出错了。其中之一是:
因此,简而言之,我不知道如何替换列表中每个字符串中的上述项。谁能教我怎么做。再一次,对于那些看到我的另一篇文章并为我如何把它们写得这么长而惊叹的人,我对此表示歉意
**链接到我的Repl.it:Wikipedia Web-scraping Project - Brightbulb123 - Repl.it
问题在于:
在
replace(',','')
中,需要提到类似.replace(',',' ')
的空格一个建议是将第一句话列表转换成字典。通过键和值对来解决多次出现的问题会更简单
我认为问题在于以下几行:
在这里,您将位置
counter
处的单词分配给列表。有效地用单个单词替换列表。这将解决问题:这同样适用于
.
(X[1]
)一种更好的方法是在迭代此块中的列表时删除
,
和.
:比如:
相关问题 更多 >
编程相关推荐