如何迭代URL到curl命令?

2024-10-03 21:28:28 发布

您现在位置:Python中文网/ 问答频道 /正文

我不熟悉网络抓取,我正在使用python和bash脚本来获取我需要的信息。我使用WSL(Linux的windows子系统)运行,出于某种原因,脚本使用git bash运行
我试图创建一个bash脚本,下载网页的Html,然后发送到一个python脚本,该脚本返回2个txt文件,其中包含指向其他网页的链接。原始脚本然后遍历txt文件的一个链接,并将每个网页的html内容下载到以链接的特定部分命名的文件中。但这最后一个循环不起作用。
如果我手工编写curl命令的链接,它就可以工作。但是如果我尝试运行脚本,它将无法运行。
这是bash脚本:

#!/bin/bash

curl http://mythicspoiler.com/sets.html |
cat >>mainpage.txt
python creatingAListOfAllExpansions.py #returns two txt files containing the expansion links and the commander decks' links
rm mainpage.txt

#get the pages from the links
cat commanderDeckLinks.txt |
while read a ; do
    curl $a |          ##THIS DOESN'T WORK
    cat >>$(echo $a | cut --delimiter="/" -f4).txt
done

我尝试过几种不同的方法,也遇到过类似的问题,但就我个人而言,我无法解决这个问题。始终显示相同的错误:

curl: (3) URL using bad/illegal format or missing URL

这是commanderDeckLinks.txt的内容:

http://mythicspoiler.com/cmd/index.html
http://mythicspoiler.com/c13/index.html
http://mythicspoiler.com/c14/index.html
http://mythicspoiler.com/c15/index.html
http://mythicspoiler.com/c16/index.html
http://mythicspoiler.com/c17/index.html
http://mythicspoiler.com/c18/index.html
http://mythicspoiler.com/c19/index.html
http://mythicspoiler.com/c20/index.html

这是python脚本

#reads the main page of the website
with open("mainpage.txt") as datafile:
    data = datafile.read()

#gets the content after the first appearance of the introduced string
def getContent(data, x):
    j=0
    content=[]
    for i in range(len(data)):
        if(data[i].strip().startswith(x) and j == 0):
            j=i
        if(i>j and j != 0):
            content.append(data[i])
    return content

#gets the content of the website that is inside the body tag
mainNav = getContent(data.splitlines(), "<!--MAIN NAVIGATION-->")

#gets the content of the website that is inside of the outside center tags
content = getContent(mainNav, "<!--CONTENT-->")

#removes extra content from list
def restrictNoise(data, string):
    content=[]
    for i in data:
        if(i.startswith(string)):
            break
        content.append(i)
    return content

#return only lines which are links
def onlyLinks(data):
    content=[]
    for i in data:
        if(i.startswith("<a")):
            content.append(i)
    return content


#creates a list of the ending of the links to later fetch
def links(data):
    link=[]
    for i in data:
        link.append(i.split('"')[1])
    return link

#adds the rest of the link
def completLinks(data):
    completeLinks=[]
    for i in data:
        completeLinks.append("http://mythicspoiler.com/"+i)
    return completeLinks

#getting the commander decks
commanderDecksAndNoise = getContent(content,"<!---->")
commanderDeck = restrictNoise(commanderDecksAndNoise, "<!---->")
commanderDeckLinks = onlyLinks(commanderDeck)
commanderDecksCleanedLinks = links(commanderDeckLinks)

#creates a txt file and writes in it
def writeInTxt(nameOfFile, restrictions, usedList):
    file = open(nameOfFile,restrictions)
    for i in usedList:
        file.write(i+"\n")
    file.close()

#creating the commander deck text file
writeInTxt("commanderDeckLinks.txt", "w+", completLinks(commanderDecksCleanedLinks))

#getting the expansions
expansionsWithNoise = getContent(commanderDecksAndNoise, "<!---->")
expansionsWithoutNoise = restrictNoise(expansionsWithNoise, "</table>")
expansionsLinksWNoise = onlyLinks(expansionsWithoutNoise)
expansionsCleanedLinks = links(expansionsLinksWNoise)

#creating the expansions text file
writeInTxt("expansionLinks.txt", "w+", completLinks(expansionsCleanedLinks))

如果需要更多信息来解决我的问题,请告诉我。感谢所有试图帮助你的人


Tags: oftheintxt脚本comhttpfor
1条回答
网友
1楼 · 发布于 2024-10-03 21:28:28

这里的问题是bash(Linux)和windows的行尾是不同的,分别是LF和CRLF(我不太确定,因为这对我来说都是新的)。因此,当我在python中创建了一个包含以行分隔的项的文件时,bash脚本无法很好地读取它,因为创建的文件有CRLF结尾,而bash脚本只读取LF,这使得URL变得无用,因为它们有一个不应该存在的CR结尾。我不知道如何使用bash代码来解决这个问题,但我所做的是创建一个文件(使用python),其中每个项目都用下划线分隔,“\”,并添加最后一个项目n,这样我就不必处理行尾了。然后我在bash中运行了一个for循环,循环中的每一项都用下划线分隔,最后一项除外。这就解决了问题

相关问题 更多 >