用Regex将自由文本解析成dict

3条回答

网友

1楼 · 编辑于 2024-06-13 14:10:27

给你，我的朋友！我使用正则表达式找到每个结果，然后在最后一个(上拆分它们。它涵盖了字符串中的每个异常！！！你知道吗

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import re
text = "alamine A (12 000 UI/kg), thiamine D3 (1 200 UI/kg), niacine E (70 mg/kg), zinc [sous forme d'oxyde de zinc] (70 mg/kg), zinc [sous forme de chélate de zinc d'acides aminés, hydraté] (45 mg/kg), copper [sous forme de sulfate de cuivre (II), pentahydraté] (10 mg/kg), iode [sous forme d'iodate de calcium, anhydre] (2 mg/kg), sélénium [sous forme de sélénite de sodium] (0.2 mg/kg), cyaobactin12 (0.2%)"
my_regex = re.compile(r"([^,]*\[[^\]]*\]\s\([^\)]*\)|[^,]*\([^\)]*\))")
matches = re.findall(my_regex, text)
clean_result = []
for str in matches:
    res = str.rsplit('(', 1)
    clean_result.append((res[0].strip(), res[1][:-1]))

for res in clean_result:
    print "key : " + res[0].decode('utf-8')
    print "value : " + res[1].decode('utf-8')
    print

输出

key : alamine A
value : 12 000 UI/kg

key : thiamine D3
value : 1 200 UI/kg

key : niacine E
value : 70 mg/kg

key : zinc [sous forme d'oxyde de zinc]
value : 70 mg/kg

key : zinc [sous forme de chélate de zinc d'acides aminés, hydraté]
value : 45 mg/kg

key : copper [sous forme de sulfate de cuivre (II), pentahydraté]
value : 10 mg/kg

key : iode [sous forme d'iodate de calcium, anhydre]
value : 2 mg/kg

key : sélénium [sous forme de sélénite de sodium]
value : 0.2 mg/kg

key : cyaobactin12
value : 0.2%

网友

2楼 · 编辑于 2024-06-13 14:10:27

您可以尝试以下方法：

(?:^|,)(.*?)(\((?:\d*\.*\s*\d*\s*)(?:UI\/kg|mg\/kg|%)\))

当您将其分解时，您将看到键和值的每个“部分”必须以字符串开头或最后一部分的逗号开头，并带有非捕获组(?:^|,)。你知道吗

然后，它将使用非贪婪量词(.*?)\(捕获到下一个左括号的所有内容。这是你的“钥匙”。你知道吗

最后，它将捕获您的价值与您现有的代码，稍加修改：

(\((?:\d*\.*\s*\d*\s*)(?:UI\/kg|mg\/kg|%)\))

如果要修剪捕获中多余的空格，可以将\s*添加到键组的任一侧：

(?:^|,)\s*(.*?)\s*(\((?:\d*\.*\s*\d*\s*)(?:UI\/kg|mg\/kg|%)\))

See it in action

网友

3楼 · 编辑于 2024-06-13 14:10:27

让我们从更简单的部分开始：价值。用括号括起来：(?P<value>\([^)]+\))

(?P<value> # Capturing "value" group
  \(       # Matches an opening parentheses
  [^)]+    # Matches one or more non ")" characters
  \)       # Matches a closing parentheses
)

完成了，我们来处理钥匙。
这里的一点是，键可能包含一些用方括号括起来的文本。
然后是任何非(或[字符，后面可能跟有括号中的任何字符：(?P<key>[^[(]+(?:\[[^]]+\])?)

(?P<key>  # Capturing "key" group
  [^[(]+  # One or more non "(" or "[" characters
  (?:     # Non-capturing group
    \[    # An opening bracket
    [^]]+ # One or more non "]" characters
    \]    # A closing bracket
  )?      # Non-capturing group made optional
)

工作即将完成。
我们将在两个组之间添加一个\s作为分隔符。
最后，让我们来处理序列分隔符：(?:(?<=,\s)|^)

(?:        # Non-capturing group
  (?<=,\s) # Either preceded by a coma and a space
  |^       # Or alternatively beginning the string
)

现在把它们放在一起：^{}

See it in action

相关问题更多 >

编程相关推荐

热门问题

热门文章