如何使用python中的字符串列表在的段落中进行精确匹配

3条回答

网友

1楼 · 编辑于 2024-10-01 09:29:12

通过对产品列表进行反向排序并从段落中删除第一个匹配的产品实例，解决了我的用例。下面是我如何做的代码。这可能是正确的方法，也可能不是正确的方法，但解决了我的问题。即使产品列表中有n个产品，并且段落中有许多来自产品列表的匹配字符串，它也在工作。感谢您的研究和帮助

products = ["productA v4.1", "productA v4.1.5", "productA v4.1.5 ver"]

#applying the reverse sorting so that large strings comes first
products = sorted(products, key=len, reverse=True)

paragraph = "Troubleshooting steps for productA v4.1.5 ver documents also has steps for productA v4.1 document "


def checkIfProdExist(x):
  if paragraph.find(x) != -1:
    return True
  else:
    return False

#filter all matched strings
prodResults = list(filter(checkIfProdExist, products))

print(prodResults)
# At this state Result is  = ['productA v4.1.5 ver', 'productA v4.1.5', 'productA v4.1']

finalResult = []

# Loop through the matched the strings
for prd in prodResults:
  if paragraph.find(prd) != -1:
    # Loop through the each matched string and copy the first index
    finalResult.append({"index":str(paragraph.find(prd)),"value":prd})
    
    #Once Index copied replace all occurrences of matched string with empty so that next short string will not find it. i.e. removing productA v4.1.5 ver occurrences in paragraph will not provide chance to match productA v4.1.5 and productA v4.1  
    paragraph = paragraph.replace(prd,"")
    
print(finalResult)
# Final Result is [{'index': '26', 'value': 'productA v4.1.5 ver'}, {'index': '56', 'value': 'productA v4.1'}]
# If Paragraph is "Troubleshooting steps for productA v4.1.5 documents" then the result is [{'index': '26', 'value': 'productA v4.1.5'}]

网友

2楼 · 编辑于 2024-10-01 09:29:12

听起来您基本上希望匹配的开始和结束要么是段落的结尾，要么是到空格字符的转换（“单词”的结尾，尽管遗憾的是，单词的正则表达式定义排除了像.这样的内容，所以您不能使用基于\b的测试）

这里最简单的方法是用空格分割行，然后查看您的字符串是否出现在结果list（使用finding a sublist in a ^{}上的一些变体）：

def list_contains_sublist(haystack, needle):
    firstn, *restn = needle  # Extracted up front for efficiency
    for i, x in enumerate(haystack, 1):
        if x == firstn and haystack[i:i+len(restn)] == restn:
            return True
    return False

para_words = paragraph.split()
def checkIfProdExist(x):
    return list_contains_sublist(para_words, x.split())

如果您也需要索引，或者需要精确的空格匹配，那么它就更复杂了（.split()不会保留空格的运行，因此您无法重建索引，如果您对整个字符串进行索引，并且子字符串出现两次，但只有第二次满足您的要求，那么您可能会得到错误的索引）。在这一点上，我可能会使用正则表达式：

import re

def checkIfProdExist(x):
    m = re.search(fr'(^|\s){re.escape(x)}(?=\s|$)', paragraph)
    if m:
        return m.end(1)  # After the matched space, if any
    return -1  # Or omit return for implicit None, or raise an exception, or whatever

请注意，如前所述，这不适用于filter（如果段落以子字符串开头，则返回0，即falsy）。您可能会让它在失败时返回None，在成功时返回tuple个索引，因此它在布尔值和索引要求较高的情况下都有效，例如（演示海象使用3.8+的乐趣）：

def checkIfProdExist(x):
    if m := re.search(fr'(?:^|\s)({re.escape(x)})(?=\s|$)', paragraph):
        return m.span(1)  # We're capturing match directly to get end of match easily, so we stop capturing leading space and just use span of capture
    # Implicitly returns falsy None on failure

网友

3楼 · 编辑于 2024-10-01 09:29:12

您希望找到最长的匹配项，因此应首先使用最长字符串开始匹配：

products = ["productA v4.1", "productA v4.1.5", "product A v4.1.5 ver"]
productsSorted = sorted(products, key=len, reverse=True)
paragraph = "Troubleshooting steps for productA v4.1.5 documents"


def checkIfProdExist(x):
    if paragraph.find(x) != -1:
        return True
    else:
        return False


def checkIfProdExistAndExit(prods):
    # stop immediately after the first match!
    for x in prods:
        if paragraph.find(x) != -1:
            return x


results = filter(checkIfProdExist, productsSorted)
print(list(results)[0])
results = checkIfProdExistAndExit(productsSorted)
print(results)

输出：

productA v4.1.5
productA v4.1.5

相关问题更多 >

编程相关推荐

热门问题

热门文章