我试图抓取网站,他们都有表格。但是,第一个url有一个名为.table-translations
的表ID,而另一个没有ID,因此不会爬网
但如果我不包括它,它就不会爬行
如何使用BeautifulSoup在有表ID和没有表ID的情况下对数据进行爬网
下面是我的代码
import requests
from bs4 import BeautifulSoup
urls = ['http://www.mongols.eu/mongolian-language/mongolian-tale-six-silver-stars', 'http://www.mongols.eu/mongolian-language/mongolian-tale-yanzin-jaal']
for url in urls:
print(url)
out_fileName = url.rsplit('/', 1)[-1]
out_mn = out_fileName + "_mn.txt"
out_en = out_fileName + "_en.txt"
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
all_data = []
for row in soup.select('.table-translations tr')[1:]:
mongolian, english = map(lambda t: t.get_text(strip=True), row.select('td')[1:])
all_data.append((mongolian, english))
for row in all_data:
with open(out_mn, "a") as text_file:
text_file.write(row[0] + "\n")
with open(out_en, "a") as text_file:
text_file.write(row[1] + "\n")
此脚本将从这两个URL获取所有翻译。但如果有其他不同结构的页面,则需要调整:
印刷品:
相关问题 更多 >
编程相关推荐