如何自动识别同一篇论文的引文？

cite1 = "Yoshua Bengio, Réjean Ducharme, Pascal Vincent and Christian Jauvin, A Neural Probabilistic Language Model (2003), in: Journal of Machine Learning Research, 3(1137--1155)" cite2 = "Yoshua Bengio, Réjean Ducharme, Pascal Vincent, Christian Jauvin. (2003) A Neural Probabilistic Language Model" cite3 = "Bengio Y, Ducharme R, Vincent P, Jauvin C. (2003) A Neural Probabilistic Language Model"

from difflib import SequenceMatcher as smatch def similar(x, y): return smatch(None, x.strip(), y.strip()).ratio() similar(cite1, cite2) # 0.721 similar(cite1, cite3) # 0.553 similar(cite2, cite3) # 0.802

2条回答

网友

1楼 · 编辑于 2024-10-02 16:24:59

除了神经网络和自然语言处理，这将是一个相当。。。复杂的方法，我会通过预处理数据来解决这个问题。你知道吗

你能做的事情很少：

- Create Short names Yoshua Bengio => Bengio Y
- Normalize the names: Réjean Ducharme -> rejean ducharme
- Extract author part of the string, title part of the string, and the "leftovers". Calculate similarity for each of the parts and average the result.
- Extract the year of the publication and make it a three variable problem.
- Use additional metadata if available (paper field, citation index, etc.

如果您的问题仅限于这三种书目类型，则上述方法有效。你知道吗

如果你在参考书目中有很大的差异（即应用于整个springer/ieee数据库），你应该研究机器学习方法。你知道吗

虽然我无法在脑海中提出一个正确的模型，但我记得this论文离你的问题很近。你知道吗

在其他方法中，如果你有一个大的书目数据集，你可以尝试像word2vec/node2vec或kmeans这样的半监督方法，看看后续的相似度评分是否足够准确。你知道吗

一句忠告。

在某些情况下，来自同一研究团队的论文名称非常相似，或者当长论文名称不同时，短论文名称相同。Xu可以是Wang Xu或Wei Xu都被转录到Xu W.。
在其他情况下，相同的作者有不同的名字Réjean Ducharme和Rejean Ducharme
论文标题可以有变化：Conference of awesome discoveries和Awesome discoveries, conference of

网友

2楼 · 编辑于 2024-10-02 16:24:59

重要的是要考虑什么使引文独特？

根据你的例子，作者、文章标题和发表年份的组合构成了一个独特的引文。你知道吗

这意味着您可以解析这些名称，然后比较它们的接近程度（因为第三个示例列出的名称不同）。解析标题，它应该匹配100%。解析年份，也应该是100%匹配。你知道吗

相关问题更多 >

编程相关推荐

热门问题

热门文章