如何在Python中的大型文本文件中查找短语？

3条回答

网友

1楼 · 编辑于 2024-10-01 02:34:37

您可以一次读取一个字符的文件，并将换行符更改为空格。那么，这只是一个列出通缉犯名单的问题

def find_words(text, fileobj):
    i = 0
    while True:
        c = fileobj.read(1)
        if not c:
           break
        if c == "\n": # python combines \r\n
            c = " "
        if c != text[i]:
            i = 0
        if c == text[i]:
            i += 1
            if i == len(text):
               return True
    return False

如果您想对空格和大小写敏感度更开放一些，可以在比较之前删除所有空格和小写

import re
import itertools
from string import whitespace

def find_words(text, fileobj):
    chars = list(itertools.chain.from_iterable(re.split(r"\s+", text.lower())))
    i = 0
    while True:
        c = fileobj.read(1)
        if not c:
            break
        c = c.lower()
        if c in whitespace:
            continue
        if c != chars[i]:
            i = 0
        if c == chars[i]:
            i += 1
            if i == len(chars):
               return True
    return False

网友

2楼 · 编辑于 2024-10-01 02:34:37

这里有一种解决问题的方法：

import re

def find_phrase():
    phrase = "hi how are you"
    words = dict(zip(phrase.split(), [False]*len(phrase.split())))
    with open("data.txt", "r") as f:
        for line in f:
            for word in words:
                if re.search( r"\b" + word + r"\b", line):
                    words[word] = True

                if all(words.values()):
                    return True
    return False

编辑：

def find_phrase():
    phrase = "hi how are you"
    with open("data.txt", "r") as f:
        for line in f:
            if phrase in line:
                return True
    return False

网友

3楼 · 编辑于 2024-10-01 02:34:37

如果是“相当大”的文件，则按顺序访问这些行，不要将整个文件读入内存：

with open('largeFile', 'r') as inF:
    for line in inF:
        if 'myString' in line:
            # do_something
            break

编辑：

由于字符串中的单词可以位于连续的行上，因此您需要使用计数器来跟踪迭代的单词。比如说,

counter = 0
words_list = ["hi","hello","how"]
with open('largeFile', 'r') as inF:
    for line in inF:
        # print( words_list[counter] ,line)
        if words_list[counter] in line and len(line.split()) == 1 :
            counter +=1
        else:
            counter = 0
        if counter == len(words_list):
            print ("here")
            break;

文本文件

fkerghiohgeoihhgergerig ooetbjoptj enbotobjeob
hi
hello
how
goegjepogjejgpgrg] ekrngeigoieghetghehtigehtgiethg ieoge

它给出了here的输出，因为找到了连续的单词

相关问题更多 >

编程相关推荐

热门问题

热门文章

如何在Python中的大型文本文件中查找短语？

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >