Python从没有regex的段落的引号中提取单词

3条回答

网友

1楼 · 编辑于 2024-09-28 01:23:24

以下是两种可能的方法：

desired = [
    'ipsum', 'dolor', 'sit', 'amet,', 'consectetur', 'adipiscing', 'elit.',
    'turpi\'', 'in', 'fermentum', 'diam', 'auctor', 'aliquam!', 'tristique'
    ]

text = """
Lorem "ipsum dolor sit amet, consectetur adipiscing elit.". Praesent non sem
urna. Pellentesque elementum "turpi'" est, "in fermentum diam auctor aliquam!".
Morbi rhoncus erat ipsum, eu "tristique"
"""

def extract_quoted(text):
    words = []
    next_pos = -1
    while True:
        try:
            pos = text.index('"', next_pos + 1)
        except ValueError:
            break
        try:
            next_pos = text.index('"', pos + 1)
        except ValueError as e:
            raise ValueError("mismatched quotes") from e
        quoted_segment = text[pos + 1:next_pos]
        words.extend(quoted_segment.split())
    return words

def split_only(text):
    return [word for chunk in text.split('"')[1::2] for word in chunk.split()]

if __name__ == "__main__":
    print(extract_quoted(text) == desired)
    print(split_only(text) == desired)

第一个是关于文本是如何被理解的更明确一点 “parsed”，而第二个可能是更华丽的一行分裂为基础的你要找的方法。你知道吗

网友

2楼 · 编辑于 2024-09-28 01:23:24

我试过这个：

a = """Lorem "ipsum dolor sit amet, consectetur adipiscing elit.". Praesent non sem urna. Pellentesque elementum "turpi'" est, "in fermentum diam auctor aliquam!". Morbi rhoncus erat ipsum, eu "tristique" """
in_quote = 0
res = []
word = ''

for i in a:
    if i == '"':
        in_quote = 1 - in_quote
        if word:
            res+=[word]
            word = ''
    elif in_quote:
        if i == ' ':
            res+=[word]
            word = ''
        else:
            word+=i
print(res)

网友

3楼 · 编辑于 2024-09-28 01:23:24

复制自我的评论：

一旦使用“作为分隔符进行拆分，就可以简单地提取列表中所有奇数索引元素。然后，正常地拆分这些列表（使用空格分隔符）并将列表连接在一起。你知道吗

示例：

text = """Lorem "ipsum dolor sit amet, consectetur adipiscing elit.". Praesent non sem urna. Pellentesque elementum "turpi'" est, "in fermentum diam auctor aliquam!". Morbi rhoncus erat ipsum, eu "tristique" """

text_split_by_quotes = text.split('"')
# get the odd-indexed elements (here's one way to do it):
text_in_quotes = text_split_by_quotes[1::2]
# split each normally (by whitespace) and flatten the list (here's one way to do it):
ans = []
for text in text_in_quotes:
    ans.extend(text.split())
# print answer
print(ans)

>>> ['ipsum', 'dolor', 'sit', 'amet,', 'consectetur', 'adipiscing', 'elit.', "turpi'", 'in', 'fermentum', 'diam', 'auctor', 'aliquam!', 'tristique']

相关问题更多 >

编程相关推荐

热门问题

热门文章

Python从没有regex的段落的引号中提取单词

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >