使用BeautifulSoup拾取以“：”分隔的文本

<TR> <TD width="40%">Company No. (CO.) : 056</TD> <TD width="40%">Country Code. (CC.) : 3532 </TD></TR> <TR> <TD>Register (Reg.) : FD522</TD> <TD>Credit(CD.) : YES</TD></TR> <TR> <TD>Type (TP.) : PRIVATE</TD></TR>

结果

下面的帮助有两种方法。放在那里作为参考：

这样可以获得粗体字母的内容，但是在某些句子中，最后一个字母会丢失：

for bb in aa: cc = bb.get_text() dd = cc[cc.find("")+1 : cc.find("")] print dd

这样，ee和ff提供了“标题”和内容，即“：”前后的文本。在

for bb in aa: cc = bb.get_text() dd = cc.split(' :') ee = dd[0] #title ff = dd[len(dd)-1] # content

3条回答

网友

1楼 · 编辑于 2024-10-01 05:05:05

你不必强迫自己使用beauthulsoup函数来分离它们因为对于每个数据里面都有不同的令牌密钥来分割即：

<TD width="40%">Company No. <I>(CO.)</I> : <B>056</B></TD>

公司编号，以“.”分隔
（CO.）以“：”分隔
056内

我建议您使用子字符串方法从每个td中获取数据：

^{pr2}$

网友

2楼 · 编辑于 2024-10-01 05:05:05

这只是简单的字符串操作，并不是真正的BS4问题。可以做如下的事情。请注意，下面的方法可能不是最好的方法，但我这样做是为了冗长。在

from bs4 import BeautifulSoup as bsoup

ofile = open("test.html")
soup = bsoup(ofile)
soup.prettify()

tds = soup.find_all("td")
templist = [td.get_text() for td in tds]

newlist = []
for temp in templist:
    whole = temp.split(":") # Separate by ":" first.
    half = whole[0].split("(") # Split the first half using open parens.
    first = half[0].strip() # First of three elements.
    second = half[1].replace(")","").strip() # Second of three elements.
    third = whole[1].strip() # Use the second element for the first split to get third of three elements.
    newlist.append([first, second, third])

for lst in newlist:
    print lst # Just print it out.

结果：

^{pr2}$

让我们知道这是否有帮助。在

网友

3楼 · 编辑于 2024-10-01 05:05:05

使用findAll获取完整HTML文档的正确部分，然后使用：

text = soup.get_text()
print text

然后用“.split（）”将其拆分为数组

^{pr2}$

结果

相关问题更多 >

编程相关推荐

热门问题

热门文章