关键字搜索只在文件的一列中，并在关键字前后保留2个单词

s = """12088|CITA|{Hello very nice lists, better to keep those 12089|CITA|This is great theme for lists keep it """ for line in s.splitlines(): if not line.strip(): continue fields = line.split(None, 2) joined = '|'.join(fields) print(joined)

2条回答

网友

1楼 · 编辑于 2024-09-28 17:23:15

首先我要警告你，对一百万条记录使用这个代码是危险的。您正在处理正则表达式，只要表达式是正则的，这个方法就很好。否则，您可能会创建成吨的案例来提取您想要的数据，而不提取您不想要的数据。你知道吗

对于100万个案例，你需要熊猫，因为循环太慢了。你知道吗

import pandas as pd
import re
df = pd.DataFrame({'C1': [12088
,12089],'C2':["CITA","CITA"],"C3":["Hello very nice lists, better to keep those",
                                   "This is great theme for lists keep it"]})
df["C3"] = df["C3"].map(lambda x:
                        re.findall('(?<=Hello)[\w\s,]*(?=keep)|(?<=great)[\w\s,]*',
                                   str(x)))
df["C3"]= df["C3"].map(lambda x: x[0].strip())
df["C3"].map(lambda x: x.strip())

这给了

df
      C1    C2                           C3
0  12088  CITA  very nice lists, better  to
1  12089  CITA      theme for lists keep it

网友

2楼 · 编辑于 2024-09-28 17:23:15

关于您如何准确地执行关键字搜索还有一些问题。您的示例中已经包含了一个障碍：如何处理诸如逗号之类的字符？另外，不清楚如何处理不包含关键字的行。另外，如果关键字前后没有两个单词，该怎么办？我猜你自己有点不确定的确切要求，并没有考虑所有的边缘案件。你知道吗

尽管如此，我还是对这些问题做了一些“盲目的决定”，下面是一个简单的示例实现，它假设关键字匹配规则非常简单。我已经创建了函数findword()，您可以根据需要调整它。所以，也许这个例子可以帮助你找到自己的需求。你知道吗

KEYWORD = "lists"

S = """12088|CITA|{Hello very nice lists, better to keep those
12089|CITA|This is great theme for lists keep it """


def findword(words, keyword):
    """Return index of first occurrence of `keyword` in sequence
    `words`, otherwise return None.

    The current implementation searches for "keyword" as well as
    for "keyword," (with trailing comma).
    """
    for test in (keyword, "%s," % keyword):
        try:
            return words.index(test)
        except ValueError:
            pass
    return None


for line in S.splitlines():
    tokens = line.split("|")
    words = tokens[2].split()
    idx = findword(words, KEYWORD)
    if idx is None:
        # Keyword not found. Print line without change.
        print line
        continue
    l = len(words)
    start = idx-2 if idx > 1 else 0
    end = idx+3 if idx < l-2 else -1
    tokens[2] = " ".join(words[start:end])
    print '|'.join(tokens)

测试：

$ python test.py
12088|CITA|very nice lists, better to
12089|CITA|theme for lists keep it

PS：我希望我的指数适合切片。不过，你应该检查一下。你知道吗

相关问题更多 >

编程相关推荐

热门问题

热门文章