每个人
我已经为我的英语道歉了。 我有一个pythonwebscraper,一方面可以在单词列表中编写整个网站的文本,但也可以对网站的每个子域执行相同的操作。我设法读出了所有的子域和主页的文本,但无法读出子域的文本
我将所有子域打包成一个列表domains
,然后想用for循环来更改url
,每个过程都有一个不同的子域。但不是这样的
(您不必注意代码的下半部分,它只用于格式化文本!)
我希望他们理解我的问题:)
我的代码:
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = "https://test-domain.com/"
html = urlopen(url).read()
main_html = BeautifulSoup(html, features="html.parser")
subdomains = []
domains = [url]
for link in main_html.find_all("a"):
subdomains.append(link.get("href"))
domains.extend(subdomains)
for x in domains:
url = x
print(url)
# kill all script and style elements
for script in main_html(["script", "style"]):
script.extract() # rip it out
# get text
text = main_html.get_text()
# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)
sonderzeichen = [",","...","!","?",".","[","]","{","}","|","#","&","*","/",":",";","+","-","_","=","<",">"]
word_list = text.split()
for elem in list(word_list):
for x in sonderzeichen:
if elem == x:
word_list.remove(elem)
word_list = [
word[:-1] if word[-1] in sonderzeichen else word
for word in word_list
]
with open("word_list.txt", "w") as f:
for elem in list(word_list):
f.write("%s\n" % elem)
print(word_list)
print("\nWordlist successfully generated!")
目前没有回答
相关问题 更多 >
编程相关推荐