多个提取与BeautifulSoup不匹配

<figure class="floatRight" style="margin-left: 30px"> <a class="zoomFunction alignLeft" href="https://www.thieme-connect.de/media/synthesis/EFirst/lookinside/ss-2015-c0259-st_10-1055_s-0034-1378861-1.jpg"><img src="https://www.thieme-connect.de/media/synthesis/EFirst/lookinside/thumbnails/ss-2015-c0259-st_10-1055_s-0034-1378861-1.jpg"/></a> <figcaption></figcaption> </figure> <p> <a name="N65743"></a> </p><h3>Abstract</h3> <p>2-<span class="i">tert</span>-Butyl-5-iodoindolizine underwent Sonogashira reaction with acetylenes in the presence of dichlorobis(triphenylphosphine)palladium, copper(I) iodide, and triethylamine in acetonitrile to give to the corresponding 5-ethynylindolizines in high yields; 5-iodo-2-phenylindolizine and 5-bromo-2-<span class="i">tert</span>-butylindolizine did not undergo the reaction. Several structures were characterized by X-ray. The 5-ethynylindolizines did not undergo cyclization to give cycl[3.2.2]azines.</p> <div class="articleKeywords"> <a name="N65760"></a> <h3>Key words</h3> 5-iodoindolizines - Sonogashira reaction - 5-ethynylindolizine - X-ray </div> <a name="N67312"></a> <h3>Supporting Information</h3> <ul class="linkList">Supporting information for this article is available online at http://dx.doi.org/10.1055/s-0034-1378861.<li> <a class="gotolink" href="https://www.thieme-connect.de/media/synthesis/EFirst/supmat/sup_ss-2015-c0259-st_10-1055_s-0034-1378861.pdf">Supporting Information</a> </li> </ul>

from bs4 import BeautifulSoup with open("test.xml", 'r') as file: soup = BeautifulSoup(file.read(), "lxml") abstract = soup [tag.extract() for tag in abstract("a", attrs={"name": True})] [tag.extract() for tag in abstract("h3")] [tag.extract() for tag in abstract("ul", attrs={"class": "linkList"})] [tag.extract() for tag in abstract("a", attrs={"class": "gotolink"})] print(abstract)

2条回答

网友

1楼 · 编辑于 2024-09-27 23:26:49

诀窍是在每次提取后创建一个新的漂亮的Soup对象，并在此新对象上执行下一次提取。你知道吗

这看起来可能有点难看，但它是有效的：

干净.py

from bs4 import BeautifulSoup


with open("test.xml", 'r') as file:
    soup = BeautifulSoup(file.read(), "lxml")

abstract = soup

[tag.extract() for tag in abstract("a", attrs={"name": True})]
abstract = BeautifulSoup(str(abstract))
[tag.extract() for tag in abstract("h3")]
abstract = BeautifulSoup(str(abstract))
[tag.extract() for tag in abstract("ul", attrs={"class": "linkList"})]
abstract = BeautifulSoup(str(abstract))
[tag.extract() for tag in abstract("a", attrs={"class": "gotolink"})]

print(abstract)

输出

清洁前

(bs4extract)macbook:bs4extract joeyoung$ cat test.xml 
<figure class="floatRight" style="margin-left: 30px">
<a class="zoomFunction alignLeft" href="https://www.thieme-connect.de/media/synthesis/EFirst/lookinside/ss-2015-c0259-st_10-1055_s-0034-1378861-1.jpg"><img src="https://www.thieme-connect.de/media/synthesis/EFirst/lookinside/thumbnails/ss-2015-c0259-st_10-1055_s-0034-1378861-1.jpg"/></a>
<figcaption></figcaption>
</figure>
<p>
<a name="N65743"></a>
</p><h3>Abstract</h3>
<p>2-<span class="i">tert</span>-Butyl-5-iodoindolizine underwent Sonogashira reaction with acetylenes in the presence of dichlorobis(triphenylphosphine)palladium, copper(I) iodide, and triethylamine in acetonitrile to give to the corresponding 5-ethynylindolizines in high yields; 5-iodo-2-phenylindolizine and 5-bromo-2-<span class="i">tert</span>-butylindolizine did not undergo the reaction. Several structures were characterized by X-ray. The 5-ethynylindolizines did not undergo cyclization to give cycl[3.2.2]azines.</p>
<div class="articleKeywords">
<a name="N65760"></a>
<h3>Key words</h3>
5-iodoindolizines - 
        Sonogashira reaction - 
        5-ethynylindolizine - 
        X-ray
      </div>
<a name="N67312"></a>
<h3>Supporting Information</h3>
<ul class="linkList">Supporting information for this article is available online at http://dx.doi.org/10.1055/s-0034-1378861.<li>
<a class="gotolink" href="https://www.thieme-connect.de/media/synthesis/EFirst/supmat/sup_ss-2015-c0259-st_10-1055_s-0034-1378861.pdf">Supporting Information</a>
</li>
</ul>

清洁后

(bs4extract)macbook:bs4extract joeyoung$ python clean.py 
<html><body><figure class="floatRight" style="margin-left: 30px">
<a class="zoomFunction alignLeft" href="https://www.thieme-connect.de/media/synthesis/EFirst/lookinside/ss-2015-c0259-st_10-1055_s-0034-1378861-1.jpg"><img src="https://www.thieme-connect.de/media/synthesis/EFirst/lookinside/thumbnails/ss-2015-c0259-st_10-1055_s-0034-1378861-1.jpg"/></a>
<figcaption></figcaption>
</figure>
<p>
</p>
<p>2-<span class="i">tert</span>-Butyl-5-iodoindolizine underwent Sonogashira reaction with acetylenes in the presence of dichlorobis(triphenylphosphine)palladium, copper(I) iodide, and triethylamine in acetonitrile to give to the corresponding 5-ethynylindolizines in high yields; 5-iodo-2-phenylindolizine and 5-bromo-2-<span class="i">tert</span>-butylindolizine did not undergo the reaction. Several structures were characterized by X-ray. The 5-ethynylindolizines did not undergo cyclization to give cycl[3.2.2]azines.</p>
<div class="articleKeywords">

5-iodoindolizines - 
        Sonogashira reaction - 
        5-ethynylindolizine - 
        X-ray
      </div>
</body></html>

网友
2楼 · 编辑于 2024-09-27 23:26:49

好吧抱歉伙计们，这个虫子其实是来自美女团。当降级到4.3.2-3时，完全相同的代码可以完美地工作。我会报告的。抱歉，我在发帖前没有检查。你知道吗

相关问题更多 >

编程相关推荐

热门问题

热门文章