擅长:python、mysql、java
<p>实际上,您没有得到键值对,而且<code>pdfminer</code>根本无法向您提供它。它只会从PDF中提取文本(+可能还有一些附加信息)</p>
<p>要获得好的逻辑标签-值对,需要在提取的文本上使用<a href="https://www.ontotext.com/knowledgehub/fundamentals/information-extraction/#:%7E:text=Information%20extraction%20is%20the%20process,storing%20them%20in%20a%20database.&text=Information%20extraction%20is%20the%20process%20of%20extracting%20specific%20(pre,specified)%20information%20from%20textual%20sources." rel="nofollow noreferrer">Information Extraction</a>方法和/或<a href="https://en.wikipedia.org/wiki/Named-entity_recognition" rel="nofollow noreferrer">Named Entity Recognition</a>。这里有很多选择。您可能想先看看<a href="https://spacy.io/" rel="nofollow noreferrer">SpaCy</a>或<a href="https://www.nltk.org/" rel="nofollow noreferrer">NLTK</a></p>
<p>通常,从文档中提取有意义的数据及其关系具有一个新的性感名称<a href="https://sites.google.com/view/di2019" rel="nofollow noreferrer">Document Intelligence</a></p>