在python中从html标记中提取字符串

*********************** Contents of 'listOfCallSigns' List *********************** 0 ['311062900'] 1 ['235056239'] 2 ['305500000'] 3 ['311063300'] 4 ['236111791'] 5 ['245639000'] 6 ['235077805'] 7 ['235011590']

# Importing the modules needed to run the script from bs4 import BeautifulSoup import urllib2 import re import requests import pprint # Declaring the url for the port of hull url = "http://www.fleetmon.com/en/ports/Port_of_Hull_5898" # Opening and reading the contents of the URL using the module 'urlib2' # Scanning the entire webpage, finding a <table> tag with the id 'vessels_in_port_table' and finding all <tr> tags portOfHull = urllib2.urlopen(url).read() soup = BeautifulSoup(portOfHull) table = soup.find("table", {'id': 'vessels_in_port_table'}).find_all("tr") # Declaring a list to hold the call signs of each ship in the table listOfCallSigns = [] # For each row in the table, using a regular expression to extract the first 9 numbers from each ship call-sign # Adding each extracted call-sign to the 'listOfCallSigns' list for i, row in enumerate(table): if i: listOfCallSigns.append(re.findall(r"\d{9}", str(row.find_all('td')[4]))) print "\n\n*********************** Contents of 'listOfCallSigns' List ***********************\n" # Printing each element of the 'listOfCallSigns' list for i, row in enumerate(listOfCallSigns): print i, row

2条回答

网友
1楼 · 编辑于 2024-10-06 11:35:51

这也可以通过从字符串中去除不需要的字符来实现，如下所示：
a = "string with bad characters []'] in here" a = a.translate(None, "[]'") print a

网友
2楼 · 编辑于 2024-10-06 11:35:51

将最后一行更改为：
# Printing each element of the 'listOfCallSigns' list for i, row in enumerate(listOfCallSigns): print i, row[0] # < added a [0] here
或者，也可以在此处添加[0]：
for i, row in enumerate(table): if i: listOfCallSigns.append(re.findall(r"\d{9}", str(row.find_all('td')[4]))[0]) < added a [0] here
这里的解释是re.findall(...)返回一个列表（在您的例子中，只包含一个元素）。因此，listOfCallSigns最终成为“子列表列表，每个子列表包含一个字符串”：
>>> listOfCallSigns >>> [ ['311062900'], ['235056239'], ['311063300'], ['236111791'], ['245639000'], ['305500000'], ['235077805'], ['235011590'] ]
当您enumerate您的listOfCallSigns时，row变量基本上就是您在代码前面附加的re.findall(...)（这就是为什么您可以在它们之后添加[0]）。你知道吗
所以row和re.findall(...)都是“字符串列表”类型，如下所示：
>>> row >>> ['311062900']
要获取列表中的字符串，需要访问其第一个元素，即：
>>> row[0] >>> '311062900'
希望这有帮助！你知道吗

相关问题更多 >

编程相关推荐

热门问题

热门文章