ntcir-10 math converter包将ntcir-10 math xhtml数据集和相关性判断转换为ntcir-11 math-2和ntcir-12 mathir xhtml5格式。
ntcir10-math-converter的Python项目详细描述
NTCIR-10数学转换器–将NTCIR-10数学数据集和判断转换为NTCIR-11和NTCIR-12格式
NTCIR-10 Math任务中的检索单元 dataset是arxiv文档,在 relevance judgements是一个xml元素。另一方面 手,在NTCIR-11 Math-2,和NTCIR-12 MathIR任务中的检索和判断单元 dataset和relevance judgements是 arxiv文件中的一段。这使得很难同时使用这两个数据集 一起做一个评估。
ntcir math converter是一个python 3命令行实用程序,用于转换 NTCIR-10 Math XHTML5数据集和与NTCIR-11 Math-2的相关性判断, 和ntcir-12mathir xhtml5格式,将数据集分成段落和 将相关性判断从元素重定向到其祖先 段落。因此,ntcir-10数学数据集和相关性判断 可与ntcir-11 math-2和ntcir-12 mathir一起使用 数据集,以及单个工作流中的相关性判断。
用法
安装
可以通过执行以下命令安装包: 安装:
$ pip install ntcir10-math-converter
显示用法
可以通过执行以下命令来显示包的使用信息 命令:
$ ntcir10-math-converter --help
usage: ntcir10-math-converter [-h] --dataset DATASET [DATASET ...]
[--judgements JUDGEMENTS [JUDGEMENTS ...]]
[--num-workers NUM_WORKERS]
Convert NTCIR-10 Math XHTML5 dataset and relevance judgements to the NTCIR-11
Math-2, and NTCIR-12 MathIR XHTML5 format.
optional arguments:
-h, --help show this help message and exit
--dataset DATASET [DATASET ...]
A path to a directory containing the NTCIR-10 Math
XHTML5 dataset, and a path to a non-existent directory
that will contain resulting dataset in the NTCIR-11
Math-2, and NTCIR-12 MathIR XHTML5 format. If only the
path to the NTCIR-10 Math dataset is specified, the
dataset will be read to find out the mapping between
element identifiers, and paragraph identifiers. This
is required for converting the relevance judgements.
--judgements JUDGEMENTS [JUDGEMENTS ...]
Paths to the files containing NTCIR-10 Math relevance
judgements (odd arguments), followed by paths to the
files that will contain resulting relevance judgements
in the NTCIR-11 Math-2, and NTCIR-12 MathIR format
(even arguments).
--num-workers NUM_WORKERS
The number of processes that will be used for
processing the NTCIR-10 Math dataset. Defaults to 1.
转换数据集和相关性判断
下面的命令使用 64个工作进程:
$ ntcir10-math-converter --num-workers 64 \
> --dataset ntcir-10 ntcir-10-converted \
> --judgements \
> NTCIR_10_Math-qrels_ft.dat NTCIR_10_Math-qrels_ft-converted.dat \
> NTCIR_10_Math-qrels_fs.dat NTCIR_10_Math-qrels_fs-converted.dat
Retrieving judged document names, and element identifiers from NTCIR_10_Math-qrels_ft.dat
100%|███████████████████████████████████████████████████████| 1425/1425 [00:00<00:00, 9634.03it/s]
Retrieving judged document names, and element identifiers from NTCIR_10_Math-qrels_fs.dat
100%|███████████████████████████████████████████████████████| 2129/2129 [00:00<00:00, 9671.33it/s]
Processing dataset ntcir-10
Converting dataset ntcir-10 -> ntcir-10-converted/xhtml5
Building a mapping between element identifiers, and paragraph identifiers
100%|████████████████████████████████████████████████████| 100000/100000 [06:45<00:00, 246.50it/s]
Converting relevance judgements NTCIR_10_Math-qrels_ft.dat -> NTCIR_10_Math-qrels_ft-converted.dat
Skipping identifier f080935#idp57072, as it appears outside a paragraph
Skipping identifier f039264#id60072, as it appears outside a paragraph
Skipping identifier f059698#id58538, as it appears outside a paragraph
...
Skipping identifier f023353#idp65840, as it appears outside a paragraph
Skipping identifier f048268#id53551, as it appears outside a paragraph
100%|█████████████████████████████████████████████████████| 1425/1425 [00:00<00:00, 252199.81it/s]
1425 / 1394 input / output relevance judgements
Converting relevance judgements NTCIR_10_Math-qrels_fs.dat -> NTCIR_10_Math-qrels_fs-converted.dat
Skipping identifier f095981#id72919, as it appears outside a paragraph
Skipping identifier f061190#id56357, as it appears outside a paragraph
Skipping identifier f033738#id116089, as it appears outside a paragraph
...
Skipping identifier f019052#id54515, as it appears outside a paragraph
Skipping identifier f021845#id53581, as it appears outside a paragraph
100%|█████████████████████████████████████████████████████| 2129/2129 [00:00<00:00, 291048.96it/s]
2129 / 2076 input / output relevance judgements
仅转换使用64个工作进程的数据集:
$ ntcir10-math-converter --num-workers 64 \
> --dataset ntcir-10 ntcir-10-converted
Processing dataset ntcir-10
Converting dataset ntcir-10 -> ntcir-10-converted/xhtml5
100%|████████████████████████████████████████████████████| 100000/100000 [07:34<00:00, 220.10it/s]
下面的命令使用64个worker只转换相关性判断 进程:
$ ntcir10-math-converter --num-workers 64 \
> --dataset ntcir-10 \
> --judgements \
> NTCIR_10_Math-qrels_ft.dat NTCIR_10_Math-qrels_ft-converted.dat \
> NTCIR_10_Math-qrels_fs.dat NTCIR_10_Math-qrels_fs-converted.dat
Retrieving judged document names, and element identifiers from NTCIR_10_Math-qrels_ft.dat
100%|███████████████████████████████████████████████████████| 1425/1425 [00:00<00:00, 9539.55it/s]
Retrieving judged document names, and element identifiers from NTCIR_10_Math-qrels_fs.dat
100%|███████████████████████████████████████████████████████| 2129/2129 [00:00<00:00, 9332.81it/s]
Processing dataset ntcir-10
Building a mapping between element identifiers, and paragraph identifiers
100%|████████████████████████████████████████████████████████| 2405/2405 [00:16<00:00, 144.41it/s]
Converting relevance judgements NTCIR_10_Math-qrels_ft.dat -> NTCIR_10_Math-qrels_ft-converted.dat
Skipping identifier f080935#idp57072, as it appears outside a paragraph
Skipping identifier f039264#id60072, as it appears outside a paragraph
Skipping identifier f059698#id58538, as it appears outside a paragraph
...
Skipping identifier f023353#idp65840, as it appears outside a paragraph
Skipping identifier f048268#id53551, as it appears outside a paragraph
100%|█████████████████████████████████████████████████████| 1425/1425 [00:00<00:00, 252199.81it/s]
1425 / 1394 input / output relevance judgements
Converting relevance judgements NTCIR_10_Math-qrels_fs.dat -> NTCIR_10_Math-qrels_fs-converted.dat
Skipping identifier f095981#id72919, as it appears outside a paragraph
Skipping identifier f061190#id56357, as it appears outside a paragraph
Skipping identifier f033738#id116089, as it appears outside a paragraph
...
Skipping identifier f019052#id54515, as it appears outside a paragraph
Skipping identifier f021845#id53581, as it appears outside a paragraph
100%|█████████████████████████████████████████████████████| 2129/2129 [00:00<00:00, 291048.96it/s]
2129 / 2076 input / output relevance judgements