如何从没有html类的单行文本中提取信息？

1条回答

网友

1楼 · 发布于 2024-10-03 13:17:39

在这种情况下，我要做的是找到任何模式来帮助我提取这些数据，例如，我可以看到这些单词是frequency"is hiring|is looking for|is looking to hire|hiring"，公司名称排在第一位，位置排在in之后：

这只是一个小的尝试，你可以扩展它来得到你需要的

import re
text = """ZeroCater (YC W11) Is Hiring a Principal Engineer in SF: Must Love Food (zerocater.com)
OneSignal Is Hiring Full Stack Engineers in San Mateo (onesignal.com)
Faire (YC W17) Is Looking to Hire Business Operations Leads (greenhouse.io)
InsideSherpa (YC W19) Is Hiring Software Engineers in Sydney (workable.com)
Jerry (YC S17) Is Hiring Senior Software Dev, Data Engineer (Toronto/Remote) (getjerry.com)
Iris Automation Is Hiring an Account Executive for B2B Flying Vehicle Software (irisonboard.com)"""

data = text.lower().splitlines()

for i, line in enumerate(data):
    # getting company name
    data[i] = re.split(r'is hiring|is looking for|is looking to hire|hiring', line)

    # job title and location if present
    data[i][1] = re.split(r' in ', data[i][1])

print('company  - Job Title  - Location')
for c in data:
    print(f'{c[0]}  - {c[1][0]}  - {c[1][1] if len(c[1])>1 else ""}')

输出：

company  - Job Title  - Location
zerocater (yc w11)   -  a principal engineer  - sf: must love food (zerocater.com)
onesignal   -  full stack engineers  - san mateo (onesignal.com)
faire (yc w17)   -  business operations leads (greenhouse.io)  - 
insidesherpa (yc w19)   -  software engineers  - sydney (workable.com)
jerry (yc s17)   -  senior software dev, data engineer (toronto/remote) (getjerry.com)  - 
iris automation   -  an account executive for b2b flying vehicle software (irisonboard.com)  -

当然，这段代码需要很多修改才能得到可靠的结果，但至少这是一个开始

相关问题更多 >

编程相关推荐

热门问题

热门文章

如何从没有html类的单行文本中提取信息？

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >