如何用python从网站读取txt文件

2024-09-28 03:18:00 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在学习NLTk,我需要加载一个大文件,我不想把它保存在我的桌面上 如何读取网站上托管的python文件

我在这里尝试了这段代码,但它不起作用,我假设openwith是它的rson,但我需要使用openwith,因为在这种情况下我需要将它保存为file-myfile

import nltk

with open('http://www.sls.hawaii.edu/bley-vroman/brown.txt', 'r')as myfile:
    data=myfile.read().replace('\n', 'r')

data2 = data.replace("/", "")

for i, in line in enummerate(data2.split('\n')):
    if i>10:
        break
    print(str(i) + ':\t' + line)

这就是错误:

Traceback (most recent call last):
  File "tut1.py", line 3, in <module>
    with open('http://www.sls.hawaii.edu/bley-vroman/brown.txt', 'r')as myfile:
FileNotFoundError: [Errno 2] No such file or directory: 'http://www.sls.hawaii.edu/bley-vroman/brown.txt'

如何在脚本中使用该文件而不下载整个文件

我更改了代码以处理请求

import nltk
import requests

myfile = requests.get('http://www.sls.hawaii.edu/bley-vroman/brown.txt')

data=myfile.read().replace('\n', 'r')

但现在,当我运行此命令时,会出现以下错误:

Traceback (most recent call last):
  File "tut1.py", line 6, in <module>
    data=myfile.read().replace('\n', 'r')
AttributeError: 'Response' object has no attribute 'read'

Tags: 文件intxthttpreaddatawwwmyfile
3条回答

^{}允许您逐行使用流媒体内容:

resp = requests.get('http://www.sls.hawaii.edu/bley-vroman/brown.txt', stream=True)
for i, l in enumerate(resp.iter_lines()):
    if i < 10:
        print(l)  # use l.decode() to get string
    else:
        break
resp.close()  # to not hang connection anymore

或者更简单:

for _, l in zip(range(10), resp.iter_lines()):
    print(l)  # use l.decode() to get string

或是最好的

from itertools import islice

print(*islice(resp.iter_lines(), 10), sep="\n")

您可以访问该.txt文件的内容,而不会出现如下错误:

import requests

myfile = requests.get('http://www.sls.hawaii.edu/bley-vroman/brown.txt')

data = myfile.text

如果您想处理文件的前N行(这里是10行),而不将整个响应读入内存,下面介绍了如何做到这一点:

import nltk
import requests

myfile = requests.get('http://www.sls.hawaii.edu/bley-vroman/brown.txt', stream=True).raw

for i in range(0, 10):
    line = myfile.readline()
    data = line.decode().replace('\\n', 'r')
    print(data, end="")

结果:

The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election produced "no evidence" that any irregularities took place. The jury further said in term-end presentments that the City Executive Committee, which had over-all charge of the election, "deserves the praise and thanks of the City of Atlanta" for the manner in which the election was conducted.

The September-October term jury had been charged by Fulton Superior Court Judge Durwood Pye to investigate reports of possible "irregularities" in the hard-fought primary which was won by

我解决的三个问题是:

  1. requests.get()不返回类似文件的对象。添加.raw以获得该请求,并将stream=True添加到请求中以使其正确操作
  2. 您正在调用read(),一旦您寻址了#1,它就会工作,但会读取整个文件。那不是你想要的。我假设您想通过调用readline()逐行阅读
  3. 必须先将传入的字节解码为文本,然后才能使用字符串方法对其进行操作。这就是decode()所做的

当然,要处理10行而不是1行,您需要一个循环和一种只处理10行的方法。我也加了一句。我还添加了一个print()调用,以便我们都能看到结果

我假设代码中的replace()并不是您真正想要的。我猜你的意思是replace('\\n', '\\r'),但因为我不确定(我不知道这能给你带来什么),所以我把这件事留给你来处理。我确实对它进行了修复,这样它就不会通过在搜索词中添加第二个反斜杠来完全消除这一行(不知道它为什么这样做)

相关问题 更多 >

    热门问题