Wikipedia提取器作为Wikipedia数据转储文件的解析器

1条回答

网友

1楼 · 发布于 2024-06-25 22:37:45

请检查一下。这会有帮助

Error using the 'find' command to generate a collection file on opencv

WikiExtractor页面上提到的命令适用于Unix/Linux系统，在Windows上不起作用

在windows上运行的find命令的工作方式与unix/linux中的不同

只要使用python前缀运行，提取的部分在windows/linux环境中都可以正常工作

python WikiExtractor.py -cb 250K -o extracted your_bz2_file

您将看到一个extracted文件夹创建在与脚本相同的目录中

之后find命令应该是这样工作的，仅在linux上

find extracted -name '*bz2' -exec bzip2 -c {} \; > text.xml

find everything in the extracted folder that matches with bz2 and then execute bzip2 command on those file and put the result in text.xml file.

另外，如果您运行bzip -help命令，该命令应该与上面的find命令一起运行，您将看到它在Windows上不起作用，对于Linux，您将获得以下输出

gaurishankarbadola@ubuntu:~$ bzip2 -help
bzip2, a block-sorting file compressor.  Version 1.0.6, 6-Sept-2010.

   usage: bzip2 [flags and input files in any order]

   -h  help           print this message
   -d  decompress     force decompression
   -z  compress       force compression
   -k  keep           keep (don't delete) input files
   -f  force          overwrite existing output files
   -t  test           test compressed file integrity
   -c  stdout         output to standard out
   -q  quiet          suppress noncritical error messages
   -v  verbose        be verbose (a 2nd -v gives more)
   -L  license        display software version & license
   -V  version        display software version & license
   -s  small          use less memory (at most 2500k)
   -1 .. -9            set block size to 100k .. 900k
    fast              alias for -1
    best              alias for -9

   If invoked as `bzip2', default action is to compress.
              as `bunzip2',  default action is to decompress.
              as `bzcat', default action is to decompress to stdout.

   If no file names are given, bzip2 compresses or decompresses
   from standard input to standard output.  You can combine
   short flags, so `-v -4' means the same as -v4 or -4v, &c.

如上所述，bzip2的默认操作是压缩，所以使用bzcat进行解压缩

仅在linux上工作的修改后的命令如下所示

find extracted -name '*bz2' -exec bzcat -c {} \; > text.xml

它在我的ubuntu系统上工作

编辑：

适用于Windows:

尝试任何东西之前，请先阅读说明

创建一个单独的文件夹并将文件放在该文件夹中。档案>WikiExtractor.py和itwiki-latest-pages-articles1.xml-p1p277091.bz2（在我的例子中，因为这是一个我可以找到的小文件）

2.在当前目录中打开命令提示符，并运行以下命令提取所有文件

python WikiExtractor.py -cb 250K -o extracted itwiki-latest-pages-articles1.xml-p1p277091.bz2

根据文件大小，这将需要时间，但现在目录将如下所示

注意：如果您已经提取了文件夹，请将其移动到当前目录，以便它与上面的图像匹配，您无需再次提取

复制粘贴下面的代码并将其保存在bz2_Extractor.py文件中

import argparse
import bz2
import logging

from datetime import datetime
from os import listdir
from os.path import isfile, join, isdir

FORMAT = '%(levelname)s: %(message)s'
logging.basicConfig(format=FORMAT)
logger = logging.getLogger()
logger.setLevel(logging.INFO)


def get_all_files_recursively(root):
    files = [join(root, f) for f in listdir(root) if isfile(join(root, f))]
    dirs = [d for d in listdir(root) if isdir(join(root, d))]
    for d in dirs:
        files_in_d = get_all_files_recursively(join(root, d))
        if files_in_d:
            for f in files_in_d:
                files.append(join(f))
    return files


def bzip_decompress(list_of_files, output_file):
    start_time = datetime.now()
    with open(f'{output_file}', 'w+', encoding="utf8") as output_file:
        for file in list_of_files:
            with bz2.open(file, 'rt', encoding="utf8") as bz2_file:
                logger.info(f"Reading/Writing file  -> {file}")
                output_file.writelines(bz2_file.read())
                output_file.write('\n')
    stop_time = datetime.now()
    print(f"Total time taken to write out {len(list_of_files)} files = {(stop_time - start_time).total_seconds()}")


def main():
    parser = argparse.ArgumentParser(description="Input fields")
    parser.add_argument("-r", required=True)
    parser.add_argument("-n", required=False)
    parser.add_argument("-o", required=True)
    args = parser.parse_args()

    all_files = get_all_files_recursively(args.r)
    bzip_decompress(all_files[:int(args.n)], args.o)


if __name__ == "__main__":
    main()

请阅读每个输入在命令中的作用

python bz2_Extractor.py -r extracted -o output.txt -n 10

-r：包含bz2文件的根目录

-o：输出文件名

-n：要写入的文件数。[如果未提供，则写出根目录中的所有文件]

注意：我可以看到您的文件以GB为单位，其中包含50多万篇文章。如果您尝试使用上述命令将其放在一个文件中，我不确定会发生什么，或者您的系统是否能够保存下来，如果它确实保存下来，输出文件将非常大，因为它是从2.8GB文件中提取的，我认为Windows操作系统无法直接打开它

所以我的建议是一次处理10000个文件

让我知道这是否适合你

PS：对于上面的命令，输出如下所示

相关问题更多 >

编程相关推荐

热门问题

热门文章