有没有办法把CSV列转换成层次关系?

2024-09-29 02:21:16 发布

您现在位置:Python中文网/ 问答频道 /正文

我有700万个生物多样性记录的csv,其中分类级别作为列。例如:

RecordID,kingdom,phylum,class,order,family,genus,species
1,Animalia,Chordata,Mammalia,Primates,Hominidae,Homo,Homo sapiens
2,Animalia,Chordata,Mammalia,Carnivora,Canidae,Canis,Canis
3,Plantae,nan,Magnoliopsida,Brassicales,Brassicaceae,Arabidopsis,Arabidopsis thaliana
4,Plantae,nan,Magnoliopsida,Fabales,Fabaceae,Phaseoulus,Phaseolus vulgaris

我想在D3中创建一个可视化,但是数据格式必须是一个网络,其中每一个不同的列值都是前一列中某个值的子级。我需要从csv变成这样:

^{pr2}$

我还没有想到如何在不使用1000个for循环的情况下做到这一点。有人对如何在python或javascript上创建这个网络有什么建议吗?在


Tags: csv网络记录生物分类nan级别homo
3条回答

使用python和python-benedict库可以很容易地完成您所需的工作(它是Github上的开放源代码:

安装pip install python-benedict

from benedict import benedict as bdict

# data source can be a filepath or an url
data_source = """
RecordID,kingdom,phylum,class,order,family,genus,species
1,Animalia,Chordata,Mammalia,Primates,Hominidae,Homo,Homo sapiens
2,Animalia,Chordata,Mammalia,Carnivora,Canidae,Canis,Canis
3,Plantae,nan,Magnoliopsida,Brassicales,Brassicaceae,Arabidopsis,Arabidopsis thaliana
4,Plantae,nan,Magnoliopsida,Fabales,Fabaceae,Phaseoulus,Phaseolus vulgaris
"""
data_input = bdict.from_csv(data_source)
data_output = bdict()

ancestors_hierarchy = ['kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species']
for value in data_input['values']:
    data_output['.'.join([value[ancestor] for ancestor in ancestors_hierarchy])] = bdict()

print(data_output.dump())
# if this output is ok for your needs, you don't need the following code

keypaths = sorted(data_output.keypaths(), key=lambda item: len(item.split('.')), reverse=True)

data_output['children'] = []
def transform_data(d, key, value):
    if isinstance(value, dict):
        value.update({ 'name':key, 'children':[] })
data_output.traverse(transform_data)

for keypath in keypaths:
    target_keypath = '.'.join(keypath.split('.')[:-1] + ['children'])
    data_output[target_keypath].append(data_output.pop(keypath))

print(data_output.dump())

第一次打印输出将是:

^{pr2}$

第二次打印输出将是:

{
    "children": [
        {
            "name": "Animalia",
            "children": [
                {
                    "name": "Chordata",
                    "children": [
                        {
                            "name": "Mammalia",
                            "children": [
                                {
                                    "name": "Carnivora",
                                    "children": [
                                        {
                                            "name": "Canidae",
                                            "children": [
                                                {
                                                    "name": "Canis",
                                                    "children": [
                                                        {
                                                            "name": "Canis",
                                                            "children": []
                                                        }
                                                    ]
                                                }
                                            ]
                                        }
                                    ]
                                },
                                {
                                    "name": "Primates",
                                    "children": [
                                        {
                                            "name": "Hominidae",
                                            "children": [
                                                {
                                                    "name": "Homo",
                                                    "children": [
                                                        {
                                                            "name": "Homo sapiens",
                                                            "children": []
                                                        }
                                                    ]
                                                }
                                            ]
                                        }
                                    ]
                                }
                            ]
                        }
                    ]
                }
            ]
        },
        {
            "name": "Plantae",
            "children": [
                {
                    "name": "nan",
                    "children": [
                        {
                            "name": "Magnoliopsida",
                            "children": [
                                {
                                    "name": "Brassicales",
                                    "children": [
                                        {
                                            "name": "Brassicaceae",
                                            "children": [
                                                {
                                                    "name": "Arabidopsis",
                                                    "children": [
                                                        {
                                                            "name": "Arabidopsis thaliana",
                                                            "children": []
                                                        }
                                                    ]
                                                }
                                            ]
                                        }
                                    ]
                                },
                                {
                                    "name": "Fabales",
                                    "children": [
                                        {
                                            "name": "Fabaceae",
                                            "children": [
                                                {
                                                    "name": "Phaseoulus",
                                                    "children": [
                                                        {
                                                            "name": "Phaseolus vulgaris",
                                                            "children": []
                                                        }
                                                    ]
                                                }
                                            ]
                                        }
                                    ]
                                }
                            ]
                        }
                    ]
                }
            ]
        }
    ]
}

为了创建您想要的精确嵌套对象,我们将使用纯JavaScript和名为^{}的D3方法的混合。但是,请记住,700万行(请参见下面的post scriptum)需要大量计算。在

值得一提的是,对于这个提议的解决方案,您必须在不同的数据数组中(例如,使用Array.prototype.filter)将王国分开。这种限制的发生是因为我们需要一个根节点,在林奈分类法中,王国之间没有任何关系(除非你创建一个顶级的“Domain”,这将是所有真核生物的根,但是对于古生菌和细菌,你会有同样的问题)。在

所以,假设您有这个CSV(我添加了一些行)和一个王国:

RecordID,kingdom,phylum,class,order,family,genus,species
1,Animalia,Chordata,Mammalia,Primates,Hominidae,Homo,Homo sapiens
2,Animalia,Chordata,Mammalia,Carnivora,Canidae,Canis,Canis latrans
3,Animalia,Chordata,Mammalia,Cetacea,Delphinidae,Tursiops,Tursiops truncatus
1,Animalia,Chordata,Mammalia,Primates,Hominidae,Pan,Pan paniscus

基于这个CSV,我们将在这里创建一个名为tableOfRelationships的数组,顾名思义,它具有列组之间的关系:

^{pr2}$

对于上面的数据,这是tableOfRelationships

+---------+----------------------+---------------+
| (Index) |         name         |    parent     |
+---------+----------------------+---------------+
|       0 | "Animalia"           | null          |
|       1 | "Chordata"           | "Animalia"    |
|       2 | "Mammalia"           | "Chordata"    |
|       3 | "Primates"           | "Mammalia"    |
|       4 | "Hominidae"          | "Primates"    |
|       5 | "Homo"               | "Hominidae"   |
|       6 | "Homo sapiens"       | "Homo"        |
|       7 | "Carnivora"          | "Mammalia"    |
|       8 | "Canidae"            | "Carnivora"   |
|       9 | "Canis"              | "Canidae"     |
|      10 | "Canis latrans"      | "Canis"       |
|      11 | "Cetacea"            | "Mammalia"    |
|      12 | "Delphinidae"        | "Cetacea"     |
|      13 | "Tursiops"           | "Delphinidae" |
|      14 | "Tursiops truncatus" | "Tursiops"    |
|      15 | "Pan"                | "Hominidae"   |
|      16 | "Pan paniscus"       | "Pan"         |
+---------+----------------------+---------------+

请看一下null作为Animalia的父级:这就是为什么我告诉过您需要按王国来分隔数据集,整个表中只能有一个null值。在

最后,基于该表,我们使用d3.stratify()创建层次结构:

const stratify = d3.stratify()
    .id(function(d) { return d.name; })
    .parentId(function(d) { return d.parent; });

const hierarchicalData = stratify(tableOfRelationships);

这是演示。打开浏览器的控制台(代码段的控制台不适合此任务),检查对象的几个级别(children):

const csv = `RecordID,kingdom,phylum,class,order,family,genus,species 1,Animalia,Chordata,Mammalia,Primates,Hominidae,Homo,Homo sapiens 2,Animalia,Chordata,Mammalia,Carnivora,Canidae,Canis,Canis latrans 3,Animalia,Chordata,Mammalia,Cetacea,Delphinidae,Tursiops,Tursiops truncatus 1,Animalia,Chordata,Mammalia,Primates,Hominidae,Pan,Pan paniscus`; const data = d3.csvParse(csv); const taxonomicRanks = data.columns.filter(d => d !== "RecordID"); const tableOfRelationships = []; data.forEach(row => { taxonomicRanks.forEach((d, i) => { if (!tableOfRelationships.find(e => e.name === row[d])) tableOfRelationships.push({ name: row[d], parent: row[taxonomicRanks[i - 1]] || null }) }) }); const stratify = d3.stratify() .id(function(d) { return d.name; }) .parentId(function(d) { return d.parent; }); const hierarchicalData = stratify(tableOfRelationships); console.log(hierarchicalData);

<script src="https://cdnjs.cloudflare.com/ajax/libs/d3/5.7.0/d3.min.js"></script>
和13;
和13;

PS:我不知道您将创建什么样的dataviz,但您确实应该避免分类级别。整个林奈分类法已经过时了,我们不再使用等级:因为系统发生系统学是在60年代中期发展起来的,我们只使用分类单元,没有任何分类等级(这里是进化生物学老师)。另外,我对这700万行很好奇,因为我们已经描述了100多万种物种!在

var log = console.log; var data = ` 1,Animalia,Chordata,Mammalia,Primates,Hominidae,Homo,Homo sapiens 2,Animalia,Chordata,Mammalia,Carnivora,Canidae,Canis,Canis 3,Plantae,nan,Magnoliopsida,Brassicales,Brassicaceae,Arabidopsis,Arabidopsis thaliana 4,Plantae,nan,Magnoliopsida,Fabales,Fabaceae,Phaseoulus,Phaseolus vulgaris`; //make array of rows with array of values data = data.split("\n").map(v=>v.split(",")); //init tree var tree = {}; data.forEach(row=>{ //set current = root of tree for every row var cur = tree; var id = false; row.forEach((value,i)=>{ if (i == 0) { //set id and skip value id = value; return; } //If branch not exists create. //If last value - write id if (!cur[value]) cur[value] = (i == row.length - 1) ? id : {}; //Move link down on hierarhy cur = cur[value]; }); }); log("Tree:"); log(JSON.stringify(tree, null, " ")); //Now you have hierarhy in tree and can do anything with it. var toStruct = function(obj) { let ret = []; for (let key in obj) { let child = obj[key]; let rec = {}; rec.name = key; if (typeof child == "object") rec.children = toStruct(child); ret.push(rec); } return ret; } var struct = toStruct(tree); console.log("Struct:"); console.log(struct);

和13;
和13;

相关问题 更多 >