如何根据上下文在文本中查找关键字?

2024-09-26 22:51:54 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个关键字列表,存储在一个名为vocations.json的json文件和一个包含50000多条记录的数据库中

许多记录都有维基百科链接。通过连接到维基百科,我正在搜索每一条记录的所有关键词,并试图找出这些关键词是否在记录传记的第一段中传递

下面的代码正在查找关键字,但是我需要一个更聪明的算法,程序将根据文本上下文计算关键字

import re
import json
import requests

from bs4 import BeautifulSoup as BS
    
    
def get_text(url):
    r = requests.get(url, timeout=5)
    div = BS(r.content, "html.parser").select_one(".mw-content-ltr")
    p = BS(str(div), "html.parser").find_all("p")
    try:
        return [i.text for i in p if i.text != "\n"][0]
    except IndexError:
        return


def find_occupations(url, keywords):
    text = get_text(url=url)
    if not text:
        return url, None
    occupations = []
    for keyword in keywords:
        for i in re.findall(f"\s{keyword.lower()}", text.lower()):
            if keyword not in occupations:
                occupations.append(keyword)
    return url, occupations


with open("vocations.json") as f:
    words = json.load(f)

对于某些记录,上面的代码可以正确地查找关键字。下面您可以看到正确匹配的示例:

url1 = "https://en.wikipedia.org/wiki/Gerolamo_Cardano"
print(find_occupations(url1, words))

上述url的第一段如下:

Gerolamo (also Girolamo[3] or Geronimo[4]) Cardano (Italian: [dʒeˈrɔlamo karˈdano]; French: Jérôme Cardan; Latin: Hieronymus Cardanus; 24 September 1501 – 21 September 1576) was an Italian polymath, whose interests and proficiencies ranged from being a mathematician, physician, biologist, physicist, chemist, astrologer, astronomer, philosopher, writer, and gambler.[5] He was one of the most influential mathematicians of the Renaissance, and was one of the key figures in the foundation of probability and the earliest introducer of the binomial coefficients and the binomial theorem in the Western world. He wrote more than 200 works on science.[6]

我得到的输出如下:

('https://en.wikipedia.org/wiki/Gerolamo_Cardano', ['Astrologer', 'Astronomer', 'Biologist', 'Chemist', 'Gambler', 'Mathematician', 'Philosopher', 'Physician', 'Physicist', 'Polymath', 'Writer'])

但是对于下面的一些记录,我得到了错误的结果

url2 = "http://en.wikipedia.org/wiki/Barbara_Villiers"
print(find_occupations(url2, words))

上述url的第一段如下:

Barbara Palmer, 1st Duchess of Cleveland (27 November [O.S. 17 November] 1640[1] – 9 October 1709), more often known by her maiden name Barbara Villiers or her title of Countess of Castlemaine, was an English royal mistress of the Villiers family and perhaps the most notorious of the many mistresses of King Charles II of England, by whom she had five children, all of them acknowledged and subsequently ennobled. Barbara was the subject of many portraits, in particular by court painter Sir Peter Lely. In the Gilded Age, it was stylish to adorn an estate with her likeness.

下面您看到的是我得到的输出,这并不完全正确

('http://en.wikipedia.org/wiki/Barbara_Villiers', ['King', 'Mistress', 'Painter'])

我知道为什么程序会找到关键字"King""Painter",尽管它们不是Barbara Villiers的特性。因为这些关键字也存储在json文件中,并且它们也在维基百科页面的第一段中传递

我的第一个问题是,有没有办法通过评估文本的上下文来正确地找到关键词?如果是,你有什么建议

第二个问题是,如果我们可以使用一种方法来搜索和查找单词,该方法可以根据文本的上下文来评估搜索到的单词,那么最终是否有必要检查所有50000条记录,以查看算法是否产生了准确的结果

编辑:下面是vocations.json文件的一些项目

[
    "Accessory designer",
    "Acoustical engineer",
    "Acrobat",
    "Actor",
    "Actress",
    "Advertising designer",
    "Aeronautical engineer",
    "Aerospace engineer",
    "Agricultural engineer",
    "Anesthesiologist",
    "Anesthesiologist Assistant",
    "Animator",
    "Anthropologist",
    "Applied engineer",
    "Arborist",
    "Archaeologist",
    "Archimime",
    "Architect",
    "Army officer",
    "Art administrator",
    "Artisan",
    [...]
]

Tags: andofthetextinimportjsonurl
1条回答
网友
1楼 · 发布于 2024-09-26 22:51:54

问题1:有没有一种方法可以通过评估文本的上下文来正确找到关键词?如果是,你有什么建议

关键字检测(也称为关键字提取)属于natural language processing (NLP)

关键字提取的一些技术包括:

  • 词语搭配与共现
  • TF-IDF(术语频率的缩写-反向文档频率)
  • RAKE(快速自动关键词提取)
  • 支持向量机(SVM)
  • 深度学习

问题2:如果我们可以使用一种方法来搜索和查找单词,该方法可以根据文本上下文评估搜索到的单词,那么最终是否有必要检查所有50000条记录,以查看算法是否产生了准确的结果?

开发统计模型可能不需要培训数据,而构建深度学习模型可能需要大量数据。因此,这完全取决于使用哪种方法

相关问题 更多 >

    热门问题