我不熟悉网络抓取,我正在使用python和bash脚本来获取我需要的信息。我使用WSL(Linux的windows子系统)运行,出于某种原因,脚本使用git bash运行
我试图创建一个bash脚本,下载网页的Html,然后发送到一个python脚本,该脚本返回2个txt文件,其中包含指向其他网页的链接。原始脚本然后遍历txt文件的一个链接,并将每个网页的html内容下载到以链接的特定部分命名的文件中。但这最后一个循环不起作用。
如果我手工编写curl命令的链接,它就可以工作。但是如果我尝试运行脚本,它将无法运行。
这是bash脚本:
#!/bin/bash
curl http://mythicspoiler.com/sets.html |
cat >>mainpage.txt
python creatingAListOfAllExpansions.py #returns two txt files containing the expansion links and the commander decks' links
rm mainpage.txt
#get the pages from the links
cat commanderDeckLinks.txt |
while read a ; do
curl $a | ##THIS DOESN'T WORK
cat >>$(echo $a | cut --delimiter="/" -f4).txt
done
我尝试过几种不同的方法,也遇到过类似的问题,但就我个人而言,我无法解决这个问题。始终显示相同的错误:
curl: (3) URL using bad/illegal format or missing URL
这是commanderDeckLinks.txt的内容:
http://mythicspoiler.com/cmd/index.html
http://mythicspoiler.com/c13/index.html
http://mythicspoiler.com/c14/index.html
http://mythicspoiler.com/c15/index.html
http://mythicspoiler.com/c16/index.html
http://mythicspoiler.com/c17/index.html
http://mythicspoiler.com/c18/index.html
http://mythicspoiler.com/c19/index.html
http://mythicspoiler.com/c20/index.html
这是python脚本
#reads the main page of the website
with open("mainpage.txt") as datafile:
data = datafile.read()
#gets the content after the first appearance of the introduced string
def getContent(data, x):
j=0
content=[]
for i in range(len(data)):
if(data[i].strip().startswith(x) and j == 0):
j=i
if(i>j and j != 0):
content.append(data[i])
return content
#gets the content of the website that is inside the body tag
mainNav = getContent(data.splitlines(), "<!--MAIN NAVIGATION-->")
#gets the content of the website that is inside of the outside center tags
content = getContent(mainNav, "<!--CONTENT-->")
#removes extra content from list
def restrictNoise(data, string):
content=[]
for i in data:
if(i.startswith(string)):
break
content.append(i)
return content
#return only lines which are links
def onlyLinks(data):
content=[]
for i in data:
if(i.startswith("<a")):
content.append(i)
return content
#creates a list of the ending of the links to later fetch
def links(data):
link=[]
for i in data:
link.append(i.split('"')[1])
return link
#adds the rest of the link
def completLinks(data):
completeLinks=[]
for i in data:
completeLinks.append("http://mythicspoiler.com/"+i)
return completeLinks
#getting the commander decks
commanderDecksAndNoise = getContent(content,"<!---->")
commanderDeck = restrictNoise(commanderDecksAndNoise, "<!---->")
commanderDeckLinks = onlyLinks(commanderDeck)
commanderDecksCleanedLinks = links(commanderDeckLinks)
#creates a txt file and writes in it
def writeInTxt(nameOfFile, restrictions, usedList):
file = open(nameOfFile,restrictions)
for i in usedList:
file.write(i+"\n")
file.close()
#creating the commander deck text file
writeInTxt("commanderDeckLinks.txt", "w+", completLinks(commanderDecksCleanedLinks))
#getting the expansions
expansionsWithNoise = getContent(commanderDecksAndNoise, "<!---->")
expansionsWithoutNoise = restrictNoise(expansionsWithNoise, "</table>")
expansionsLinksWNoise = onlyLinks(expansionsWithoutNoise)
expansionsCleanedLinks = links(expansionsLinksWNoise)
#creating the expansions text file
writeInTxt("expansionLinks.txt", "w+", completLinks(expansionsCleanedLinks))
如果需要更多信息来解决我的问题,请告诉我。感谢所有试图帮助你的人
这里的问题是bash(Linux)和windows的行尾是不同的,分别是LF和CRLF(我不太确定,因为这对我来说都是新的)。因此,当我在python中创建了一个包含以行分隔的项的文件时,bash脚本无法很好地读取它,因为创建的文件有CRLF结尾,而bash脚本只读取LF,这使得URL变得无用,因为它们有一个不应该存在的CR结尾。我不知道如何使用bash代码来解决这个问题,但我所做的是创建一个文件(使用python),其中每个项目都用下划线分隔,“\”,并添加最后一个项目n,这样我就不必处理行尾了。然后我在bash中运行了一个for循环,循环中的每一项都用下划线分隔,最后一项除外。这就解决了问题
相关问题 更多 >
编程相关推荐