根据pysp中另一个文件的值筛选大型文件的行

2024-09-30 22:12:07 发布

男 | 程序猿一只，喜欢编程写python代码。

我必须从一个巨大的文件（1.5 TB）中读取和过滤行。巨大文件的内容如下

<http://www.wikidata.org/entity/Q31> <schema#label> "Beligium"@en .
<http://www.wikidata.org/entity/Q31> <schema#label> "Bilkiya"@ay .
<http://www.wikidata.org/entity/Q31> <schema#label> "Belgique"@fr .
<http://www.wikidata.org/entity/Q31> <schema#label> "Beriyum"@na .
<http://www.wikidata.org/entity/Q54> <schema#label> "Japan"@en .
<http://www.wikidata.org/entity/Q112> <schema#label> "asasa"@en .
<http://www.wikidata.org/entity/Q112> <schema#label> "ssdd"@fr .
<https://fr.wikipedia.org/wiki/Label_discographique> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Article> .
<https://fr.wikipedia.org/wiki/Label_discographique> <http://schema.org/about> <http://www.wikidata.org/entity/Q18127> .

我们只想提取与以下模式匹配的行

<http://www.wikidata.org/entity/(ID_LIST)> < schema#label> "(.+)"@(en|fr) .

例如，对于ID_LIST=[Q31，Q54]，我们将提取

    <http://www.wikidata.org/entity/Q31> <schema#label> "Beligium"@en .
    <http://www.wikidata.org/entity/Q31> <schema#label> "Belgique"@fr .
    <http://www.wikidata.org/entity/Q54> <schema#label> "Japan"@en .

使用此命令

rg =  re.compile('<http://www.wikidata.org/entity/(Q31|Q54)> <schema#label> "(.+)"@(en|fr) .')   
rdd = sc.textFile(file_name).filter(lambda x:  rg.match(x))

我的问题是，如果ID\u列表来自另一个包含200万个ID的文件呢？我们应该把这两百万放在正则表达式里吗？这有效率吗？你知道吗

当然，简单的解决方案是提取所有与常规模式匹配的行

<http://www.wikidata.org/entity/(.+)> < schema#label> "(.+)"@(en|fr)

然后通过将ID\u列表转换为另一个rdd或数据帧，使用join操作只提取与ID\u列表匹配的行

有没有更好的办法？你知道吗

Tags：文件 org id http 列表 schema www fr

1条回答

网友

1楼 · 发布于 2024-09-30 22:12:07

使用join。你知道吗

（伪代码，可能不起作用）

# Define extract_id to return the ID of a chunk of XML

idwanted_rdd = sc.textfile(id_file_name).keyBy(lambda _: _)
data_rdd = sc.textFile(data_file_name).keyBy(extract_id)
result = idwanted_rdd.join(data_rdd).map(lambda (k, v): (k, v[1]))

结果将包含(id, data)对

根据pysp中另一个文件的值筛选大型文件的行

相关问题更多 >

编程相关推荐

热门问题

热门文章

根据pysp中另一个文件的值筛选大型文件的行

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >