这些都是我可能收到的短信
4 bedrooms 2 bathrooms 3 carparks
3 bedroom house
Bedrooms 2,
beds 5,
Bedrooms 1,
2 bedrooms, 1 bathroom,
Four bedrooms home, double garage
Four bedrooms home
Three double bedrooms home, garage
Three bedrooms home,
2 bedroom home unit with single carport.
Garage car spaces: 2, Bathrooms: 4, Bedrooms: 7,
我想从这篇文章中找出卧室的数量。我设法写了下面这些
def get_bedroom_num(s):
if ':' in s:
out = re.search(r'(?:Bedrooms:|Bedroom:)(.*)', s,re.I).group(1)
elif ',' in s:
out = re.search(r'(?:bedrooms|bedroom|beds)(.*)', s,re.I).group(1)
else:
out = re.search(r'(.*)(?:bedrooms|bedroom).*', s,re.I).group(1)
out = filter(lambda x: x.isdigit(), out)
return out
但它并没有捕捉到所有可能的案例。这里的关键是'卧室'一词,文字将始终有文字卧室在前面或后面的数字。有没有更好的方法来处理这个问题?如果不是通过regex,是否可以在NLP中进行命名实体识别?你知道吗
谢谢。你知道吗
编辑:-
对于第7到第10种情况,我使用下面的函数将字数转换成整数
#Convert word to number
def text2int (textnum, numwords={}):
if not numwords:
units = [
"zero", "one", "two", "three", "four", "five", "six", "seven", "eight",
"nine", "ten", "eleven", "twelve", "thirteen", "fourteen", "fifteen",
"sixteen", "seventeen", "eighteen", "nineteen",
]
tens = ["", "", "twenty", "thirty", "forty", "fifty", "sixty", "seventy", "eighty", "ninety"]
scales = ["hundred", "thousand", "million", "billion", "trillion"]
numwords["and"] = (1, 0)
for idx, word in enumerate(units): numwords[word] = (1, idx)
for idx, word in enumerate(tens): numwords[word] = (1, idx * 10)
for idx, word in enumerate(scales): numwords[word] = (10 ** (idx * 3 or 2), 0)
ordinal_words = {'first':1, 'second':2, 'third':3, 'fifth':5, 'eighth':8, 'ninth':9, 'twelfth':12}
ordinal_endings = [('ieth', 'y'), ('th', '')]
textnum = textnum.replace('-', ' ')
current = result = 0
curstring = ""
onnumber = False
for word in textnum.split():
if word in ordinal_words:
scale, increment = (1, ordinal_words[word])
current = current * scale + increment
if scale > 100:
result += current
current = 0
onnumber = True
else:
for ending, replacement in ordinal_endings:
if word.endswith(ending):
word = "%s%s" % (word[:-len(ending)], replacement)
if word not in numwords:
if onnumber:
curstring += repr(result + current) + " "
curstring += word + " "
result = current = 0
onnumber = False
else:
scale, increment = numwords[word]
current = current * scale + increment
if scale > 100:
result += current
current = 0
onnumber = True
if onnumber:
curstring += repr(result + current)
return curstring
因此,在执行任何正则表达式获取数字之前,可以使用此功能将“四居室住宅,双车库”转换为“四居室住宅,双车库”。你知道吗
你可以使用下面的正则表达式来找到所有不同的组合并提取卧室号码信息
要精确地只列出数字形式的单词,可以将
(\w+)
更改为\b(?:one|two|Three|Four|five|six|seven|eight|nine|ten|\d+)\b
Online Demo
由于regex有替换组,其中捕获的信息可能在group1或group2中可用,因此此Python代码显示了如何从regex中适当捕获的组中提取数据
打印以下内容
Edit2:Python代码的另一个版本,它捕获字符串中的所有匹配项,并以列表的形式返回结果
在这里,它会打印字符串中找到的所有匹配项。你知道吗
相关问题 更多 >
编程相关推荐