要捕获到特定百分比/小数的正则表达式

2024-09-30 16:37:56 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图在多个网站上获取利率。数据是非常非结构化的,但形式上足够接近。我想捕捉的是:

x.xx%至xx.xx%

数据外观示例:

联邦存款保险公司成员WebBank提供的所有贷款。您的实际利率取决于信用评分、贷款金额、贷款期限和信用使用和历史记录。APR的变化范围为5.98%~35.89%。例如,您可以获得6000美元的贷款,利率为7.99%,4月利率为11.51%,5.00%的贷款手续费为300美元。在本例中,您将收到5700美元,并将每月支付36美元187.99美元。应偿还总额为6767.64美元。您的APR将根据您在申请时的信用额度确定。发起费从1%到6%不等,截至2017年第一季度,平均发起费为5.49%。没有首付款,也从来没有提前还款罚款。贷款的结清取决于您是否同意www.lendingclub.com网站。通过LendingClub提供的所有贷款的最低还款期限为36个月或更长。

3.09%–14.24%*

固定费率: 6.99%至24.99%APR 锁定你的费率。你每月的付款永远不会改变。

我已经把我想捕捉的东西加粗了。我当前的正则表达式如下所示:

(re.findall('(?i)(\d\.\d\d% (?:to|-) \d\d\.\d\d%)

实际报价如下:

^{pr2}$

新输出:

['5.98% to 35.89%', '2018-06-22', 'https://www.lendingclub.com/loans/personal-loans']
['2018-06-22', 'https://www.lendingclub.com/loans/personal-loans']
['6.99% to 24.99%', '6.99% to 24.99%', '6.99% to 24.99%', '6.99% to 24.99%', '2018-06-22', 'https://www.marcus.com/us/en/personal-loans']
['2018-06-22', 'https://www.marcus.com/us/en/personal-loans']
['2018-06-22', 'https://www.discover.com/personal-loans/']
['2018-06-22', 'https://www.discover.com/personal-loans/']
['2018-06-22', 'https://www.discover.com/personal-loans/']
['6.99% to 24.99%', '2018-06-22', 'https://www.discover.com/personal-loans/']
['2018-06-22', 'https://www.discover.com/personal-loans/']
['2018-06-22', 'https://www.discover.com/personal-loans/']
['2018-06-22', 'https://www.discover.com/personal-loans/']
['2018-06-22', 'https://www.discover.com/personal-loans/']
['2018-06-22', 'https://www.lightstream.com/']
['2018-06-22', 'https://www.lightstream.com/']
['2018-06-22', 'https://www.lightstream.com/']
['2018-06-22', 'https://www.lightstream.com/']
['2018-06-22', 'https://www.lightstream.com/']
['2018-06-22', 'https://www.lightstream.com/']
['2018-06-22', 'https://www.lightstream.com/']
['2018-06-22', 'https://www.prosper.com/']
['2018-06-22', 'https://www.prosper.com/']
['2018-06-22', 'https://www.prosper.com/']
['2018-06-22', 'https://www.prosper.com/']
['2018-06-22', 'https://www.prosper.com/']
['2018-06-22', 'https://www.prosper.com/']
['2018-06-22', 'https://www.prosper.com/']
['2018-06-22', 'https://www.prosper.com/']
['2018-06-22', 'https://www.prosper.com/']
['2018-06-22', 'https://www.prosper.com/']
['2018-06-22', 'https://www.prosper.com/']

Tags: tohttpscom网站wwwaprpersonallendingclub
2条回答

编辑:根据你的评论 在Python3中运行以下命令,默认情况下,Python3将以ASCII格式处理示例字符串

输入

import re

input = '''All loans made by WebBank, Member FDIC. Your actual rate depends upon credit score, loan amount, loan term, and credit usage & history. The APR ranges from 5.98% to 35.89%. For example, you could receive a loan of $6,000 with an interest rate of 7.99% and a 5.00% origination fee of $300 for an APR of 11.51%. In this example, you will receive $5,700 and will make 36 monthly payments of $187.99. The total amount repayable will be $6,767.64. Your APR will be determined based on your credit at time of application. The origination fee ranges from 1% to 6% and the average origination fee is 5.49% as of Q1 2017. There is no down payment and there is never a prepayment penalty. Closing of your loan is contingent upon your agreement of all the required agreements and disclosures on the www.lendingclub.com website. All loans via LendingClub have a minimum repayment term of 36 months or longer.

3.09% – 14.24%*

Fixed rates: 6.99% to 24.99% APR Lock in your rate. Your monthly payment will never change.'''
#Non-specific regex (I'm cheating)
output = re.findall('[\d]{1,3}\.[\d]+%[\S\s]{0,5}[\d]{1,3}\.[\d]+%', input)
print('output:')
print(output)

#More specific   you can edit this in several ways
output_1 = re.findall('[\d]{1,3}\.[\d]+%[to\-\s]+[\d]{1,3}\.[\d]+%', input)
print('\noutput_1:')
print(output_1)

#What you need if you copy+paste from Stack into Python2.7.X
output_2 = re.findall('[\d]{1,3}\.[\d]+%[\s]*[to|\-|\xe2\x80\x93]+[\s]*[\d]{1,3}\.[\d]+%', input)
print('\noutput_2 (Python2.X):')
print(output_2)

输出

^{pr2}$

paragraph = soup.find_all(text=re.compile('(?i)(\d\.\d\d% (?:to|-) \d\d\.\d\d%)'))行获取值与模式匹配的所有节点。实际上,您需要从这些段落中提取匹配项。在

使用类似的东西

matches=[]
for n in paragraph:
    matches.extend(re.findall(pattern, n.string))

至于图案本身,你可以用

^{pr2}$

参见regex demo。详细信息:

  • (?i)-不区分大小写的机器已打开
  • \d+(?:\.\d+)?-1+个数字,可选地后跟.和1+个数字
  • %-一个%符号
  • \s*-0+个空格
  • (?:to|-)-to或{}
  • \s*\d+(?:\.\d+)?%-见上文(简而言之,空白,一个int或float值后跟%)。在

相关问题 更多 >