如何使用python中的字符串列表在的段落中进行精确匹配

2024-10-01 09:29:12 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个带有somer版本号的字符串列表。我想在一个段落中找到(确切的)这些字符串列表 实例 products=[“productA v4.1”、“productA v4.1.5”、“productA v4.1.5版本”]

段落=“productA v4.1.5文档的故障排除步骤”

在这种情况下,如果Iam使用如下过滤器

products = ["productA v4.1", "productA v4.1.5", "product A v4.1.5 ver"]
paragraph = "Troubleshooting steps for productA v4.1.5 documents"
def checkIfProdExist(x):
  if paragraph.find(x) != -1:
    return True
  else:
    return False
results = filter(checkIfProdExist, products)
print(list(results))

以上代码的输出是 ['ProductaV4.1','ProductaV4.1.5']

如何在段落中仅查找“productA v4.1.5”并获取其索引值


Tags: 实例字符串文档版本列表return版本号results
3条回答

通过对产品列表进行反向排序并从段落中删除第一个匹配的产品实例,解决了我的用例。下面是我如何做的代码。这可能是正确的方法,也可能不是正确的方法,但解决了我的问题。即使产品列表中有n个产品,并且段落中有许多来自产品列表的匹配字符串,它也在工作。感谢您的研究和帮助

products = ["productA v4.1", "productA v4.1.5", "productA v4.1.5 ver"]

#applying the reverse sorting so that large strings comes first
products = sorted(products, key=len, reverse=True)

paragraph = "Troubleshooting steps for productA v4.1.5 ver documents also has steps for productA v4.1 document "


def checkIfProdExist(x):
  if paragraph.find(x) != -1:
    return True
  else:
    return False

#filter all matched strings
prodResults = list(filter(checkIfProdExist, products))

print(prodResults)
# At this state Result is  = ['productA v4.1.5 ver', 'productA v4.1.5', 'productA v4.1']

finalResult = []

# Loop through the matched the strings
for prd in prodResults:
  if paragraph.find(prd) != -1:
    # Loop through the each matched string and copy the first index
    finalResult.append({"index":str(paragraph.find(prd)),"value":prd})
    
    #Once Index copied replace all occurrences of matched string with empty so that next short string will not find it. i.e. removing productA v4.1.5 ver occurrences in paragraph will not provide chance to match productA v4.1.5 and productA v4.1  
    paragraph = paragraph.replace(prd,"")
    
print(finalResult)
# Final Result is [{'index': '26', 'value': 'productA v4.1.5 ver'}, {'index': '56', 'value': 'productA v4.1'}]
# If Paragraph is "Troubleshooting steps for productA v4.1.5 documents" then the result is [{'index': '26', 'value': 'productA v4.1.5'}] 

听起来您基本上希望匹配的开始和结束要么是段落的结尾,要么是到空格字符的转换(“单词”的结尾,尽管遗憾的是,单词的正则表达式定义排除了像.这样的内容,所以您不能使用基于\b的测试)

这里最简单的方法是用空格分割行,然后查看您的字符串是否出现在结果list(使用finding a sublist in a ^{}上的一些变体):

def list_contains_sublist(haystack, needle):
    firstn, *restn = needle  # Extracted up front for efficiency
    for i, x in enumerate(haystack, 1):
        if x == firstn and haystack[i:i+len(restn)] == restn:
            return True
    return False

para_words = paragraph.split()
def checkIfProdExist(x):
    return list_contains_sublist(para_words, x.split())

如果您也需要索引,或者需要精确的空格匹配,那么它就更复杂了(.split()不会保留空格的运行,因此您无法重建索引,如果您对整个字符串进行索引,并且子字符串出现两次,但只有第二次满足您的要求,那么您可能会得到错误的索引)。在这一点上,我可能会使用正则表达式:

import re

def checkIfProdExist(x):
    m = re.search(fr'(^|\s){re.escape(x)}(?=\s|$)', paragraph)
    if m:
        return m.end(1)  # After the matched space, if any
    return -1  # Or omit return for implicit None, or raise an exception, or whatever

请注意,如前所述,这不适用于filter(如果段落以子字符串开头,则返回0,即falsy)。您可能会让它在失败时返回None,在成功时返回tuple个索引,因此它在布尔值和索引要求较高的情况下都有效,例如(演示海象使用3.8+的乐趣):

def checkIfProdExist(x):
    if m := re.search(fr'(?:^|\s)({re.escape(x)})(?=\s|$)', paragraph):
        return m.span(1)  # We're capturing match directly to get end of match easily, so we stop capturing leading space and just use span of capture
    # Implicitly returns falsy None on failure

您希望找到最长的匹配项,因此应首先使用最长字符串开始匹配:

products = ["productA v4.1", "productA v4.1.5", "product A v4.1.5 ver"]
productsSorted = sorted(products, key=len, reverse=True)
paragraph = "Troubleshooting steps for productA v4.1.5 documents"


def checkIfProdExist(x):
    if paragraph.find(x) != -1:
        return True
    else:
        return False


def checkIfProdExistAndExit(prods):
    # stop immediately after the first match!
    for x in prods:
        if paragraph.find(x) != -1:
            return x


results = filter(checkIfProdExist, productsSorted)
print(list(results)[0])
results = checkIfProdExistAndExit(productsSorted)
print(results)

输出:

productA v4.1.5
productA v4.1.5

相关问题 更多 >