Python在for循环中给变量一个列表的每个索引的内容?

2024-05-17 02:52:40 发布

您现在位置:Python中文网/ 问答频道 /正文

每个人

我已经为我的英语道歉了。 我有一个pythonwebscraper,一方面可以在单词列表中编写整个网站的文本,但也可以对网站的每个子域执行相同的操作。我设法读出了所有的子域和主页的文本,但无法读出子域的文本

我将所有子域打包成一个列表domains,然后想用for循环来更改url,每个过程都有一个不同的子域。但不是这样的

(您不必注意代码的下半部分,它只用于格式化文本!)

我希望他们理解我的问题:)

我的代码:

from urllib.request import urlopen
from bs4 import BeautifulSoup

url = "https://test-domain.com/"
html = urlopen(url).read()
main_html = BeautifulSoup(html, features="html.parser")
subdomains = []
domains = [url]

for link in main_html.find_all("a"):
    subdomains.append(link.get("href"))

domains.extend(subdomains)

for x in domains:

    url = x
    print(url)

    # kill all script and style elements
    for script in main_html(["script", "style"]):
        script.extract()    # rip it out

    # get text
    text = main_html.get_text()


    # break into lines and remove leading and trailing space on each
    lines = (line.strip() for line in text.splitlines())
    # break multi-headlines into a line each
    chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
    # drop blank lines
    text = '\n'.join(chunk for chunk in chunks if chunk)

    sonderzeichen = [",","...","!","?",".","[","]","{","}","|","#","&","*","/",":",";","+","-","_","=","<",">"]

    word_list = text.split()
    for elem in list(word_list):
        for x in sonderzeichen:
            if elem == x:
                word_list.remove(elem)

    word_list = [
        word[:-1] if word[-1] in sonderzeichen else word
        for word in word_list
    ]

    with open("word_list.txt", "w") as f:
        for elem in list(word_list):
            f.write("%s\n" % elem)  

    print(word_list)
    print("\nWordlist successfully generated!")

Tags: 子域textin文本urlformainhtml