将特定单词附加到单独行上的其他单词上

2024-09-30 10:35:25 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个科学名称(属、种、种下名称)的列表,分为几行

Synonyms are shown in italics
Solanaceae
Solenomelus Miers
biflorus (Thunb.) Baker
Spirodela Schleiden
punctata (C. A. Meyer) C.
Thompson
Suaeda Forskal ex Scop.
argentinensis Soriano
fruticosa auct., non Forskal
patagonica Speg.
var. crassiuscula Soriano
Symphyostemon Miers ex Lindley
biflorus (Thunb.) Dusén
...

我想在一行中附上每个物种、种下(如果适用)和作者姓名以及各自的属

请注意:
-generas以大写字母开头,后跟作者姓名,作者姓名也以大写字母或(开头。
-种类用小写字母表示。
-以下特定名称以var.ssp.开头。
-一个不以eae结尾的单词是作者名

到目前为止,我的代码如下:

from regex import search
genus_re = r'^[A-Z][a-z]+\s*[(A-Z]'
species_re = r'^[a-z]+\s*(?:[(A-Z]|(?:auct|var|ssp)\.)'
infsp_re = r'^(?:var|ssp)\..+'
author_nl_re = r'^[A-Z][a-z]+(?<!eae)$'

species_ls = []
flag = 0
with open('species_index.txt', 'r') as f:
    lines = f.read().splitlines()
    for line in lines:
        find_genus = search(genus_re, line)
        if find_genus:
            tmp_genus = []
            genus = search(r'^[A-Z][^A-Z\s]+', line)[0]
            tmp_genus.append(genus)
        if search(species_re, line):
            sp = search('.+', line)[0]
            species_ls.append(tmp_genus[0] + ' ' + sp)

我设法将物种名称附加到它们各自的属中,但我觉得我把事情复杂化了,并且很难附加独立的作者和超特定的名称

预期产出为:

Solenomelus biflorus (Thunb.) Baker
Spirodela punctata (C. A. Meyer) C. Thompson
Suaeda argentinensis Soriano
Suaeda fruticosa auct., non Forskal
Suaeda patagonica var. crassiuscula Soriano
Symphyostemon biflorus (Thunb.) Dusén

Tags: re名称searchvarlinetmpspeciesssp
1条回答
网友
1楼 · 发布于 2024-09-30 10:35:25

这是我解决你问题的脚本。这有点乱,但希望能有所帮助

#                                
#  -Globals                          
#                                
processed_lines = []

#                                
#  -Helper Classes                      -
#                                
class State:
    GENUS = 0
    SPECIES = 1
    INFSP = 2

class ProcessedLine:
    def __init__(self):
        self.genus = ""
        self.species = ""
        self.infsp = ""
        self.author = ""

    def __repr__(self):
        return "{}{}{}{}".format(self.genus, self.species, self.infsp, self.author)

    def set_genus(self, value):
        self.genus = value

    def set_species(self, value):
        self.species += " " + value

    def set_infsp(self, value):
        self.infsp += " " + value

    def set_author(self, value):
        self.author += " " + value

#                                
#  -Functions                         
#                                
def process_line(state, split_line):
    return_state = state
    if state == State.GENUS:
        return_state = process_genus(split_line)
    elif state == State.SPECIES:
        return_state = process_species(split_line)
    elif state == State.INFSP:
        return_state = process_infsp(split_line)
    else:
        print("Error: Invalid state")
    return return_state

def process_genus(split_line):
    if processed_lines[-1].genus != "":
        # Need to create new ProcessedLine
        processed_lines.append(ProcessedLine())

    if len(split_line) == 1:
        # Check if Author name
        if split_line[0][-3:] != "eae":
            # Part of Author, append to previous line author
            processed_lines[-2].set_author(split_line[0])
        # Still looking for Genus next
        return State.GENUS
    else:
        if split_line[0][0].isupper() == False:
            # This is another species, use Genus from previous
            processed_lines[-1].set_genus(processed_lines[-2].genus)
            return process_species(split_line)
        else:
            processed_lines[-1].genus = split_line[0]
            return State.SPECIES

def process_species(split_line):
    # Check if words are species or author
    for word in split_line:
        if word[0].islower():
            processed_lines[-1].set_species(word)
        else:
            processed_lines[-1].set_author(word)
    return State.INFSP

def process_infsp(split_line):
    if split_line[0] == "var." or split_line[0] == "ssp.":
        # Author value needs to be replaced so we'll clear it
        processed_lines[-1].author = ""

        # Check if words are infraspecific or author
        for word in split_line:
            if word[0].islower():
                processed_lines[-1].set_infsp(word)
            else:
                processed_lines[-1].set_author(word)
        return State.GENUS
    else:
        # No infraspecific names, let process_genus handle this
        return process_genus(split_line)

#                                
#  -Main                           -
#                                
if __name__ == "__main__":
    state = State.GENUS
    processed_lines.append(ProcessedLine())
    with open('species_index.txt', 'r') as f:
        lines = f.readlines()

        for line in lines:
            line = line.rstrip()
            state = process_line(state, line.split(" "))

    print("Finished! Checking results.")
    for line in processed_lines:
        print(line)

输入:

Solanaceae
Solenomelus Miers
biflorus (Thunb.) Baker
Spirodela Schleiden
punctata (C. A. Meyer) C.
Thompson
Suaeda Forskal ex Scop.
argentinensis Soriano
fruticosa auct., non Forskal
patagonica Speg.
var. crassiuscula Soriano
Symphyostemon Miers ex Lindley
biflorus (Thunb.) Dusén

输出

Finished! Checking results.
Solenomelus biflorus (Thunb.) Baker
Spirodela punctata (C. A. Meyer) C. Thompson
Suaeda argentinensis Soriano
Suaeda fruticosa auct., non Forskal
Suaeda patagonica var. crassiuscula Soriano
Symphyostemon biflorus (Thunb.) Dusén

相关问题 更多 >

    热门问题