如何用python优雅地抓取数据？

import urllib2 as ur def getPageData(url): return ur.urlopen(url).readlines() checkList = ['a', 'b', 'c'] if __name__ == '__main__': textList = getPageData(url) res = [] for i in textList: for y in checkList: if y in i: print i

2条回答

网友

1楼 · 编辑于 2024-09-30 02:16:57

这个怎么样：

text = """a: text1
b: text2
c: text3
blah blah not necessary text
a: text4
b: text5
c: text6
etc."""

import re
from collections import defaultdict

d = defaultdict(list)
for line in textList:
    m = re.match(r"([^:]+):\s*(.*)", line)
    if m:
        d[m.group(1)].append(m.group(2))

然后你得到

>>> d
defaultdict(<type 'list'>, {'a': ['text1', 'text4'], 'c': ['text3', 'text6'], 
'b': ['text2', 'text5']})

正则表达式标识至少包含一个标识符（a）的行，然后是一个冒号，并将标识符和冒号（.*）后面的文本放入匹配的组中。然后，它将结果放入一个“默认字典”，在引入内容时创建它的内容。你知道吗

如果事先知道标识符，可以使用

m = re.match(r"(a|b|c|otherid|diff_id|etc)\s*:\s*(.*)", line)

相反。你知道吗

网友

2楼 · 编辑于 2024-09-30 02:16:57

我将在:冒号上拆分，并测试第一部分是否在允许的前缀集中：

checkList = set(['a', 'b', 'c'])

for i in textList:
    check, rest = i.split(':', 1)
    if check.strip() not in checkList:
        continue
    data = rest.strip()
    # insert data into database; check is your column name.

相关问题更多 >

编程相关推荐

热门问题

热门文章

如何用python优雅地抓取数据？

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >