从Python中的字符串列表中获取十进制数之前的所有值

2024-09-30 22:10:09 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个字符串列表,这些字符串来自下面代码中的变量newresult。例如:

['Naproxen  500  Active ingredient  Ph Eur',
 'Croscarmellose sodium  22.0  Disintegrant  Ph Eur',
 'Povidone K90  11.0 mg  Binder  Ph Eur',
 'Water, purifieda',
 'Silica, colloidal anhydrous  2.62 %  Glidant  Ph Eur',
 'Magnesium stearate  1.38  Lubricant  Ph Eur',
 'Hypromellose 3 mPas  5.10  Film former  Ph Eur 20%']

从这里,我想从字符串中获取小数点之前的所有值(药物名称)。如果在字符串中找不到十进制数,我希望获得数字之前的所有值(医学名称)。如果没有数字,我想从regex组获取药物名称。如果存在任何百分比数据,我希望它作为一个单独的实体,因此,我希望每个字符串中有4项内容

Medicine name- name of the medicine
Dosage - The decimal value in the string. If no decimal value, then dosage is the number value in the string
Activity- Remaining part of the string without medicine name and dosage.
Dosage percentage - Any number/decimal with percentage attached to it.

预期产出:

Medicine name                   Dosage          All Role                   Dosage Percentage

Naproxen                         500         Active ingredient  Ph Eur
Croscarmellose sodium            22.0        Disintegrant  Ph Eur
Povidone K90                     11.0 mg        Binder  Ph Eur
Water, purifieda                              
Silica, colloidal anhydrous                    Glidant  Ph Eur                  2.62%                                     
Magnesium stearate               1.38         Lubricant  Ph Eur
Hypromellose 3 mPas              5.10         Film former  Ph Eur                 20%

迄今为止的代码:

file = open(r'C:\Users\lat.csv', 'r')
oo=csv.reader(file)
allsub = []
for line in oo:
    allsub.append(line)

medicines = [item for sublist in allsub for item in sublist]    

files = open(r'C:\Users\1060099.csv', 'r')
oos=csv.reader(files)

allrole = []
for line in oos:
    allrole.append(line[2]) 

allrole = list(set(allrole))
allrole.remove('Active')    



def tableextract(filename):
    file=open(filename, encoding ='utf8')
    file=file.read()
    result = []
    med = r"(?:{})".format("|".join(map(re.escape, medicines)))
    pattern = re.compile(r"^\s*" + med + r".*(?:\n[^\w\n]*\d*\.?\d+(?:\s*[dkm]g|kg|ml|q\.s\.|gm|µg)?[^\w\n]*(?:\n.*){2})?", re.M|re.IGNORECASE)

    result = pattern.findall(file)
    results = [item.replace("\t", " ") for item in result]
    resultsn = [item.replace("\xa0", " ") for item in results]
    nresults = [item.replace("\n", " ") for item in resultsn]
    newresult = []
    for line in nresults:
        newresult.append((line.strip()))

    med_reg = r"({})".format("|".join(map(re.escape, medicines)))
    pattern_med = re.compile(r"^\s*" + med_reg + r".*(?:\n[^\w\n]*\d*\.?\d+(?:\s*[dkm]g|kg|ml|q\.s\.|gm|µg)?[^\w\n]*(?:\n.*){2})?", re.M|re.IGNORECASE)
    medicine_only = []
    for matcher in pattern_med.finditer(file):
        medicine_only.append(matcher.group(1))

这部分代码的问题如下:

    Rx = r"(?i)(?=.*?((?:\d+(?:\.\d*)?|\.\d+)\s*(?:mg|kg|ml|q\.s\.|gm|µg)))?(?=.*?(\d+(?:\.\d+)?\s*%))?(?=.*?((?:\d+(?:\.\d*)?|\.\d+))(?![\d.])(?!\s*(?:%|mg|kg|ml|gm|q\.s\.|µg)))?.+"
    final = []
    for s in medicine_only:
        for e in newresult:
            match = re.search( Rx, e )
            if e.upper().startswith(s.upper()):
                if match.group(1) and match.group(2):
                    final.append([s,match.group(1), match.group(2), e])

                elif match.group(2) and match.group(3):
                    final.append([s,match.group(3), match.group(2), e])
                elif match.group(1):
                    final.append([s,match.group(1),'', e])
                elif match.group(2):
                    final.append([s,'',match.group(2), e])
                else:
                    if match.group(3):
                        #ee = match.group(3)
                        if isinstance(match.group(3), float):                            
                            final.append([s,match.group(3),'', e])


    res = []
    for sub in final:
        new_sub = sub
        agent_found = False
        for ag in allrole:
            if agent_found:
                break
            for item in sub:
                if ag.lower() in item.lower():
                    new_sub = [ag] + new_sub
                    agent_found = True
                    break
        if not agent_found:
            new_sub = [" "] + new_sub
        res.append(new_sub)

    return res

它要么返回null,要么只返回少数值,而不是从newresult返回它应该返回的所有值。从这里无法获得预期的输出。有谁能帮我修复这部分代码,或者在不修复这部分代码的情况下从字符串的输入列表中获得预期的输出


Tags: 字符串inreforifmatchlinegroup