如何在有很多链接的CSV文件列中找到img标记url,并将该链接与其他CSV文件中的相同链接进行比较

2024-10-04 09:24:27 发布

您现在位置:Python中文网/ 问答频道 /正文

import csv
# Read input Topic or Reply file
csvfile = open('rad.csv', newline='')
reader = csv.reader(csvfile)
csvfile1 = open('new.csv', newline='')
reader1 = csv.reader(csvfile1)
# Extract image sources
for row in reader:
    content = row[8]
    imageExists = "<img" in content and "src=\"" in content
    #print(imageExists)
    imageNum = 1
    while (imageExists):
        startPos = content.find("src=\"") + 5
        endPos = content.find("\"", startPos)
        imageSrc = content[startPos:endPos]
        print(imageSrc)
        content = content[endPos + 1:]
        imageExists = "<img" in content and "src=\"" in content
        #print(imageExists)
        for row1 in reader1:
            #print("In For")
            content1 = row1[1]
            content2 = row1[7]
            print(content1)
            #print(imageSrc)
            if content1 == imageSrc:
                row = imageSrc.replace(imageSrc,row1[7])
                print("Done Match Found")
                print(content2)
                break
            else:
                print("No Match")
        #imageExists = "<img" in content and "src=\"" in content
        #print(imageExists)
        imageNum += 1

如何在有很多链接的CSV文件列中查找img标记url,并将该链接与其他CSV文件中的相同链接进行比较,然后用id替换为该链接


Tags: andcsvinsrcimg链接contentreader
1条回答
网友
1楼 · 发布于 2024-10-04 09:24:27

我建议使用BeautifulSoup,而不是尝试将HTML解析为字符串。在下面的示例中,我假设HTML条目中没有逗号

from bs4 import BeautifulSoup

with open('example.txt','r') as file_handle:
    example_file_content = file_handle.read().split("\n")

list_of_image_sources = []
for line in example_file_content:
    line_as_list = line.split(",")
    for entry in line_as_list:
        soup = BeautifulSoup(entry.strip(), 'html.parser')
        images = soup.findAll('img')
        for image in images:
            list_of_image_sources.append(image['src'])

一旦您有了每个文件的图像源列表,就可以比较每个CSV文件的列表

相关问题 更多 >