ntcir-10 math converter包将ntcir-10 math xhtml数据集和相关性判断转换为ntcir-11 math-2和ntcir-12 mathir xhtml5格式。

ntcir10-math-converter的Python项目详细描述


NTCIR-10数学转换器–将NTCIR-10数学数据集和判断转换为NTCIR-11和NTCIR-12格式

CircleCI

NTCIR-10 Math任务中的检索单元 dataset是arxiv文档,在 relevance judgements是一个xml元素。另一方面 手,在NTCIR-11 Math-2,和NTCIR-12 MathIR任务中的检索和判断单元 datasetrelevance judgements是 arxiv文件中的一段。这使得很难同时使用这两个数据集 一起做一个评估。

ntcir math converter是一个python 3命令行实用程序,用于转换 NTCIR-10 Math XHTML5数据集和与NTCIR-11 Math-2的相关性判断, 和ntcir-12mathir xhtml5格式,将数据集分成段落和 将相关性判断从元素重定向到其祖先 段落。因此,ntcir-10数学数据集和相关性判断 可与ntcir-11 math-2和ntcir-12 mathir一起使用 数据集,以及单个工作流中的相关性判断。

用法

安装

可以通过执行以下命令安装包: 安装:

$ pip install ntcir10-math-converter

显示用法

可以通过执行以下命令来显示包的使用信息 命令:

$ ntcir10-math-converter --help
usage: ntcir10-math-converter [-h] --dataset DATASET [DATASET ...]
                              [--judgements JUDGEMENTS [JUDGEMENTS ...]]
                              [--num-workers NUM_WORKERS]

Convert NTCIR-10 Math XHTML5 dataset and relevance judgements to the NTCIR-11
Math-2, and NTCIR-12 MathIR XHTML5 format.

optional arguments:
  -h, --help            show this help message and exit
  --dataset DATASET [DATASET ...]
                        A path to a directory containing the NTCIR-10 Math
                        XHTML5 dataset, and a path to a non-existent directory
                        that will contain resulting dataset in the NTCIR-11
                        Math-2, and NTCIR-12 MathIR XHTML5 format. If only the
                        path to the NTCIR-10 Math dataset is specified, the
                        dataset will be read to find out the mapping between
                        element identifiers, and paragraph identifiers. This
                        is required for converting the relevance judgements.
  --judgements JUDGEMENTS [JUDGEMENTS ...]
                        Paths to the files containing NTCIR-10 Math relevance
                        judgements (odd arguments), followed by paths to the
                        files that will contain resulting relevance judgements
                        in the NTCIR-11 Math-2, and NTCIR-12 MathIR format
                        (even arguments).
  --num-workers NUM_WORKERS
                        The number of processes that will be used for
                        processing the NTCIR-10 Math dataset. Defaults to 1.

转换数据集和相关性判断

下面的命令使用 64个工作进程:

$ ntcir10-math-converter --num-workers 64 \
>     --dataset ntcir-10 ntcir-10-converted \
>     --judgements \
>         NTCIR_10_Math-qrels_ft.dat NTCIR_10_Math-qrels_ft-converted.dat \
>         NTCIR_10_Math-qrels_fs.dat NTCIR_10_Math-qrels_fs-converted.dat
Retrieving judged document names, and element identifiers from NTCIR_10_Math-qrels_ft.dat
100%|███████████████████████████████████████████████████████| 1425/1425 [00:00<00:00, 9634.03it/s]
Retrieving judged document names, and element identifiers from NTCIR_10_Math-qrels_fs.dat
100%|███████████████████████████████████████████████████████| 2129/2129 [00:00<00:00, 9671.33it/s]
Processing dataset ntcir-10
Converting dataset ntcir-10 -> ntcir-10-converted/xhtml5
Building a mapping between element identifiers, and paragraph identifiers
100%|████████████████████████████████████████████████████| 100000/100000 [06:45<00:00, 246.50it/s]
Converting relevance judgements NTCIR_10_Math-qrels_ft.dat -> NTCIR_10_Math-qrels_ft-converted.dat
Skipping identifier f080935#idp57072, as it appears outside a paragraph
Skipping identifier f039264#id60072, as it appears outside a paragraph
Skipping identifier f059698#id58538, as it appears outside a paragraph
...
Skipping identifier f023353#idp65840, as it appears outside a paragraph
Skipping identifier f048268#id53551, as it appears outside a paragraph
100%|█████████████████████████████████████████████████████| 1425/1425 [00:00<00:00, 252199.81it/s]
1425 / 1394 input / output relevance judgements
Converting relevance judgements NTCIR_10_Math-qrels_fs.dat -> NTCIR_10_Math-qrels_fs-converted.dat
Skipping identifier f095981#id72919, as it appears outside a paragraph
Skipping identifier f061190#id56357, as it appears outside a paragraph
Skipping identifier f033738#id116089, as it appears outside a paragraph
...
Skipping identifier f019052#id54515, as it appears outside a paragraph
Skipping identifier f021845#id53581, as it appears outside a paragraph
100%|█████████████████████████████████████████████████████| 2129/2129 [00:00<00:00, 291048.96it/s]
2129 / 2076 input / output relevance judgements

仅转换使用64个工作进程的数据集:

$ ntcir10-math-converter --num-workers 64 \
>     --dataset ntcir-10 ntcir-10-converted
Processing dataset ntcir-10
Converting dataset ntcir-10 -> ntcir-10-converted/xhtml5
100%|████████████████████████████████████████████████████| 100000/100000 [07:34<00:00, 220.10it/s]

下面的命令使用64个worker只转换相关性判断 进程:

$ ntcir10-math-converter --num-workers 64 \
>     --dataset ntcir-10 \
>     --judgements \
>         NTCIR_10_Math-qrels_ft.dat NTCIR_10_Math-qrels_ft-converted.dat \
>         NTCIR_10_Math-qrels_fs.dat NTCIR_10_Math-qrels_fs-converted.dat
Retrieving judged document names, and element identifiers from NTCIR_10_Math-qrels_ft.dat
100%|███████████████████████████████████████████████████████| 1425/1425 [00:00<00:00, 9539.55it/s]
Retrieving judged document names, and element identifiers from NTCIR_10_Math-qrels_fs.dat
100%|███████████████████████████████████████████████████████| 2129/2129 [00:00<00:00, 9332.81it/s]
Processing dataset ntcir-10
Building a mapping between element identifiers, and paragraph identifiers
100%|████████████████████████████████████████████████████████| 2405/2405 [00:16<00:00, 144.41it/s]
Converting relevance judgements NTCIR_10_Math-qrels_ft.dat -> NTCIR_10_Math-qrels_ft-converted.dat
Skipping identifier f080935#idp57072, as it appears outside a paragraph
Skipping identifier f039264#id60072, as it appears outside a paragraph
Skipping identifier f059698#id58538, as it appears outside a paragraph
...
Skipping identifier f023353#idp65840, as it appears outside a paragraph
Skipping identifier f048268#id53551, as it appears outside a paragraph
100%|█████████████████████████████████████████████████████| 1425/1425 [00:00<00:00, 252199.81it/s]
1425 / 1394 input / output relevance judgements
Converting relevance judgements NTCIR_10_Math-qrels_fs.dat -> NTCIR_10_Math-qrels_fs-converted.dat
Skipping identifier f095981#id72919, as it appears outside a paragraph
Skipping identifier f061190#id56357, as it appears outside a paragraph
Skipping identifier f033738#id116089, as it appears outside a paragraph
...
Skipping identifier f019052#id54515, as it appears outside a paragraph
Skipping identifier f021845#id53581, as it appears outside a paragraph
100%|█████████████████████████████████████████████████████| 2129/2129 [00:00<00:00, 291048.96it/s]
2129 / 2076 input / output relevance judgements

欢迎加入QQ群-->: 979659372 Python中文网_新手群

推荐PyPI第三方库


热门话题
java DB2查看最近执行的命令   java正则表达式如何只匹配数字后的字符,而不在匹配模式中包含数字   java是否可以使用Jmh运行基于时间的预热阶段?   java必须输入两次输入,扫描仪才能读取   java如何使用一个或多个类方法设置多个类字段,而这些类方法并不专门引用任何字段?   java Quartz的CronTrigger每24小时一次,如午夜   java字符串索引超出边界异常?   java FXMLLoader找不到fxml文件Maven项目   java Eclipse Indigo在安装m2ewtp插件时遇到问题   java如何为连接池配置Hibernate、Spring和ApacheDBCP?   java netbeans:类中的类路径。福奈姆   javajmx及其在Tomcat内部Docker上的调试   java HTTP状态404 tomcat 7   Java:String split():我希望它在末尾包含空字符串   java我应该使用枚举集吗?   Java StringTokenizer如何查找段落结尾?