如何选择特定的单词并将它们放入元组列表?

2024-10-06 10:27:54 发布

您现在位置:Python中文网/ 问答频道 /正文

通过使用BeautifulSoup,我得到了一个长字符串的结果。 它的形状是这样的:

<a href="link1"><span>title1</span></a>
<a href="link2"><span>title2</span></a>
<a href="link3"><span>title3</span></a>
<a href="link4"><span>title4</span></a>

我想特别选择“link#”和“title”并将它们放入一个列表元组中,如下所示:

[(link1,title1),(link2,title2),(link3,title3),(link4,title4)]

由于我对python缺乏了解, 我甚至不知道该找什么。 我已经试了6个小时了,还是找不到路。你知道吗

我用的bs代码

def extract(self):

    self.url ="http://aetoys.tumblr.com"
    self.source = requests.get(self.url)
    self.text = self.source.text
    self.soup = BeautifulSoup(self.text)

    for self.div in self.soup.findAll('li',{'class':'has-sub'}):
        for self.li in self.div.find_all('a'):
            print(self.li)

Tags: textselfurllihrefspanbeautifulsouptitle2
1条回答
网友
1楼 · 发布于 2024-10-06 10:27:54

您只需提取href:

out = [] # store lists of lists
for self.div in self.soup.findAll('li',{'class':'has-sub'}):
     out.append([x["href"] for x in self.div.find_all('a',href=True)])
     print([x["href"] for x in self.div.find_all('a',href=True)])



['#', '#', '/onepiece_book', '/onepiece', '#', '/naruto_book', '/naruto', '#', '/bleach_book', '/bleach', '/kingdom', '/tera', '/torico', '/titan', '/seven', '/fairytail', '/soma', '/amsal', '/berserk', '/ghoul', '/kaizi', '/piando']
['#', '/onepiece_book', '/onepiece']
['#', '/naruto_book', '/naruto']
['#', '/bleach_book', '/bleach']
['#', '/conan', '/silver', '/hai', '/nise', '/hunterbyhunter', '/baku', '/unhon', '/souleater', '/liargame', '/kenichi', '/dglayman', '/magi', '/suicide', '/pedal']
['#', '/dobaku', '/gisei', '/dragonball', '/hagaren', '/gantz', '/doctor', '/dunk', '/susi', '/reborn', '/airgear', '/island', '/crows', '/beelzebub', '/zzang', '/akira', '/tennis', '/kuroco', '/claymore', '/deathnote']

要获取单个列表:

url ="http://aetoys.tumblr.com"
source = requests.get(url)
text = source.text
soup = BeautifulSoup(text)

print [ x["href"]  for div in soup.findAll('li',{'class':'has-sub'}) for x in div.find_all('a',href=True)]


['#', '#', '/onepiece_book', '/onepiece', '#', '/naruto_book', '/naruto', '#', '/bleach_book', '/bleach', '/kingdom', '/tera', '/torico', '/titan', '/seven', '/fairytail', '/soma', '/amsal', '/berserk', '/ghoul', '/kaizi', '/piando', '#', '/onepiece_book', '/onepiece', '#', '/naruto_book', '/naruto', '#', '/bleach_book', '/bleach', '#', '/conan', '/silver', '/hai', '/nise', '/hunterbyhunter', '/baku', '/unhon', '/souleater', '/liargame', '/kenichi', '/dglayman', '/magi', '/suicide', '/pedal', '#', '/dobaku', '/gisei', '/dragonball', '/hagaren', '/gantz', '/doctor', '/dunk', '/susi', '/reborn', '/airgear', '/island', '/crows', '/beelzebub', '/zzang', '/akira', '/tennis', '/kuroco', '/claymore', '/deathnote']

如果你真的想要元组:

out = []
for div in soup.findAll('li',{'class':'has-sub'}):
        out.append(tuple(x["href"] for x in div.find_all('a',href=True)))

相关问题 更多 >