PDF转markdown工具：magic-pdf

1. magic-pdf 环境安装

conda create -n MinerU python=3.10
conda activate MinerU
pip install boto3>=1.28.43 -i https://pypi.tuna.tsinghua.edu.cn/simple/
pip install magic-pdf[full]==0.7.0b1 --extra-index-url https://wheels.myhloli.com  -i https://pypi.tuna.tsinghua.edu.cn/simple/

2. 权重下载

sudo apt-get install git-lfs
git clone https://github.com/opendatalab/MinerU.git
cd MinerU/
git lfs install
git lfs clone https://huggingface.co/wanderkid/PDF-Extract-Kit

或者

pip install modelscope

# Use the following Python code to download the model using the ModelScope SDK:
from modelscope import snapshot_download
model_dir = snapshot_download('wanderkid/PDF-Extract-Kit')

3. 修改配置

修改

magic-pdf.template.json 中models-dir修改为模型的下载路径

{"bucket_info":{"bucket-name-1":["ak", "sk", "endpoint"],"bucket-name-2":["ak", "sk", "endpoint"]},"models-dir":"/home/adam/work/MinerU/PDF-Extract-Kit/models","device-mode":"cpu","table-config": {"is_table_recog_enable": false,"max_time": 400}
}

将magic-pdf.template.json文件修改为magic-pdf.json放在系统目录，不同的系统默认目录不同，

Windows ： C:\Users\YourUsername,

Linux ： /home/YourUsername

macOS ： /Users/YourUsername

4. 使用参数

magic-pdf --help
Usage: magic-pdf [OPTIONS]Options:-v, --version                display the version and exit-p, --path PATH              local pdf filepath or directory  [required]-o, --output-dir TEXT        output local directory-m, --method [ocr|txt|auto]  the method for parsing pdf.  ocr: using ocr technique to extract information from pdf,txt: suitable for the text-based pdf only and outperform ocr,auto: automatically choose the best method for parsing pdffrom ocr and txt.without method specified, auto will be used by default. --help                       Show this message and exit.## show version
magic-pdf -v## command line example
magic-pdf -p {some_pdf} -o {some_output_dir} -m auto

{some_pdf}可以是单个 PDF 文件，也可以是包含多个 PDF 的目录。结果将保存在目录中。输出文件列表如下：{some_output_dir}

├── some_pdf.md                 # markdown file
├── images                      # directory for storing images
├── layout.pdf                  # layout diagram
├── middle.json                 # MinerU intermediate processing result
├── model.json                  # model inference result
├── origin.pdf                  # original PDF file
└── spans.pdf                   # smallest granularity bbox position information diagram

5.测试

magic-pdf -p GenZ-LLM.pdf -o ./res/ -m auto

结果：

测试使用cpu执行，内存16g，3页pdf解析大概2分钟, 页数过多会崩掉。有些公式好像解析的不太对，整体可用。

具体log:

2024-08-13 15:53:44.149 | INFO     | magic_pdf.libs.pdf_check:detect_invalid_chars:57 - cid_count: 0, text_len: 14962, cid_chars_radio: 0.0
INFO:datasets:PyTorch version 2.3.1 available.
2024-08-13 15:53:53.048 | INFO     | magic_pdf.model.pdf_extract_kit:__init__:111 - DocAnalysis init, this may take some times. apply_layout: True, apply_formula: True, apply_ocr: False, apply_table: False
2024-08-13 15:53:53.048 | INFO     | magic_pdf.model.pdf_extract_kit:__init__:119 - using device: cpu
2024-08-13 15:53:53.048 | INFO     | magic_pdf.model.pdf_extract_kit:__init__:121 - using models_dir: /home/long/work/MinerU/PDF-Extract-Kit/models
CustomVisionEncoderDecoderModel init
CustomMBartForCausalLM init
CustomMBartDecoder init
[08/13 15:54:06 detectron2]: Rank of current process: 0. World size: 1
[08/13 15:54:07 detectron2]: Environment info:
-------------------------------  --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
sys.platform                     linux
Python                           3.10.14 (main, May  6 2024, 19:42:50) [GCC 11.2.0]
numpy                            1.26.4
detectron2                       0.6 @/home/long/anaconda3/envs/MinerU/lib/python3.10/site-packages/detectron2
detectron2._C                    not built correctly: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.32' not found (required by /home/long/anaconda3/envs/MinerU/lib/python3.10/site-packages/detectron2/_C.cpython-310-x86_64-linux-gnu.so)
Compiler ($CXX)                  c++ (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
DETECTRON2_ENV_MODULE            <not set>
PyTorch                          2.3.1+cu121 @/home/long/anaconda3/envs/MinerU/lib/python3.10/site-packages/torch
PyTorch debug build              False
torch._C._GLIBCXX_USE_CXX11_ABI  False
GPU available                    No: torch.cuda.is_available() == False
Pillow                           10.4.0
torchvision                      0.18.1+cu121 @/home/long/anaconda3/envs/MinerU/lib/python3.10/site-packages/torchvision
fvcore                           0.1.5.post20221221
iopath                           0.1.9
cv2                              4.6.0
-------------------------------  --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
PyTorch built with:- GCC 9.3- C++ Version: 201703- Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications- Intel(R) MKL-DNN v3.3.6 (Git Hash 86e6af5974177e513fd3fee58425e1063e7f1361)- OpenMP 201511 (a.k.a. OpenMP 4.5)- LAPACK is enabled (usually provided by MKL)- NNPACK is enabled- CPU capability usage: AVX2- Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.1, CUDNN_VERSION=8.9.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=2.3.1, USE_CUDA=ON, USE_CUDNN=ON, USE_CUSPARSELT=1, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_GLOO=ON, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF, [08/13 15:54:07 detectron2]: Command line arguments: {'config_file': '/home/long/anaconda3/envs/MinerU/lib/python3.10/site-packages/magic_pdf/resources/model_config/layoutlmv3/layoutlmv3_base_inference.yaml', 'resume': False, 'eval_only': False, 'num_gpus': 1, 'num_machines': 1, 'machine_rank': 0, 'dist_url': 'tcp://127.0.0.1:57823', 'opts': ['MODEL.WEIGHTS', '/home/long/work/MinerU/PDF-Extract-Kit/models/Layout/model_final.pth']}
[08/13 15:54:07 detectron2]: Contents of args.config_file=/home/long/anaconda3/envs/MinerU/lib/python3.10/site-packages/magic_pdf/resources/model_config/layoutlmv3/layoutlmv3_base_inference.yaml:
AUG:DETR: true
CACHE_DIR: ~/cache/huggingface
CUDNN_BENCHMARK: false
DATALOADER:ASPECT_RATIO_GROUPING: trueFILTER_EMPTY_ANNOTATIONS: falseNUM_WORKERS: 4REPEAT_THRESHOLD: 0.0SAMPLER_TRAIN: TrainingSampler
DATASETS:PRECOMPUTED_PROPOSAL_TOPK_TEST: 1000PRECOMPUTED_PROPOSAL_TOPK_TRAIN: 2000PROPOSAL_FILES_TEST: []PROPOSAL_FILES_TRAIN: []TEST:- scihub_trainTRAIN:- scihub_train
GLOBAL:HACK: 1.0
ICDAR_DATA_DIR_TEST: ''
ICDAR_DATA_DIR_TRAIN: ''
INPUT:CROP:ENABLED: trueSIZE:- 384- 600TYPE: absolute_rangeFORMAT: RGBMASK_FORMAT: polygonMAX_SIZE_TEST: 1333MAX_SIZE_TRAIN: 1333MIN_SIZE_TEST: 800MIN_SIZE_TRAIN:- 480- 512- 544- 576- 608- 640- 672- 704- 736- 768- 800MIN_SIZE_TRAIN_SAMPLING: choiceRANDOM_FLIP: horizontal
MODEL:ANCHOR_GENERATOR:ANGLES:- - -90- 0- 90ASPECT_RATIOS:- - 0.5- 1.0- 2.0NAME: DefaultAnchorGeneratorOFFSET: 0.0SIZES:- - 32- - 64- - 128- - 256- - 512BACKBONE:FREEZE_AT: 2NAME: build_vit_fpn_backboneCONFIG_PATH: ''DEVICE: cudaFPN:FUSE_TYPE: sumIN_FEATURES:- layer3- layer5- layer7- layer11NORM: ''OUT_CHANNELS: 256IMAGE_ONLY: trueKEYPOINT_ON: falseLOAD_PROPOSALS: falseMASK_ON: trueMETA_ARCHITECTURE: VLGeneralizedRCNNPANOPTIC_FPN:COMBINE:ENABLED: trueINSTANCES_CONFIDENCE_THRESH: 0.5OVERLAP_THRESH: 0.5STUFF_AREA_LIMIT: 4096INSTANCE_LOSS_WEIGHT: 1.0PIXEL_MEAN:- 127.5- 127.5- 127.5PIXEL_STD:- 127.5- 127.5- 127.5PROPOSAL_GENERATOR:MIN_SIZE: 0NAME: RPNRESNETS:DEFORM_MODULATED: falseDEFORM_NUM_GROUPS: 1DEFORM_ON_PER_STAGE:- false- false- false- falseDEPTH: 50NORM: FrozenBNNUM_GROUPS: 1OUT_FEATURES:- res4RES2_OUT_CHANNELS: 256RES5_DILATION: 1STEM_OUT_CHANNELS: 64STRIDE_IN_1X1: trueWIDTH_PER_GROUP: 64RETINANET:BBOX_REG_LOSS_TYPE: smooth_l1BBOX_REG_WEIGHTS:- 1.0- 1.0- 1.0- 1.0FOCAL_LOSS_ALPHA: 0.25FOCAL_LOSS_GAMMA: 2.0IN_FEATURES:- p3- p4- p5- p6- p7IOU_LABELS:- 0- -1- 1IOU_THRESHOLDS:- 0.4- 0.5NMS_THRESH_TEST: 0.5NORM: ''NUM_CLASSES: 10NUM_CONVS: 4PRIOR_PROB: 0.01SCORE_THRESH_TEST: 0.05SMOOTH_L1_LOSS_BETA: 0.1TOPK_CANDIDATES_TEST: 1000ROI_BOX_CASCADE_HEAD:BBOX_REG_WEIGHTS:- - 10.0- 10.0- 5.0- 5.0- - 20.0- 20.0- 10.0- 10.0- - 30.0- 30.0- 15.0- 15.0IOUS:- 0.5- 0.6- 0.7ROI_BOX_HEAD:BBOX_REG_LOSS_TYPE: smooth_l1BBOX_REG_LOSS_WEIGHT: 1.0BBOX_REG_WEIGHTS:- 10.0- 10.0- 5.0- 5.0CLS_AGNOSTIC_BBOX_REG: trueCONV_DIM: 256FC_DIM: 1024NAME: FastRCNNConvFCHeadNORM: ''NUM_CONV: 0NUM_FC: 2POOLER_RESOLUTION: 7POOLER_SAMPLING_RATIO: 0POOLER_TYPE: ROIAlignV2SMOOTH_L1_BETA: 0.0TRAIN_ON_PRED_BOXES: falseROI_HEADS:BATCH_SIZE_PER_IMAGE: 512IN_FEATURES:- p2- p3- p4- p5IOU_LABELS:- 0- 1IOU_THRESHOLDS:- 0.5NAME: CascadeROIHeadsNMS_THRESH_TEST: 0.5NUM_CLASSES: 10POSITIVE_FRACTION: 0.25PROPOSAL_APPEND_GT: trueSCORE_THRESH_TEST: 0.05ROI_KEYPOINT_HEAD:CONV_DIMS:- 512- 512- 512- 512- 512- 512- 512- 512LOSS_WEIGHT: 1.0MIN_KEYPOINTS_PER_IMAGE: 1NAME: KRCNNConvDeconvUpsampleHeadNORMALIZE_LOSS_BY_VISIBLE_KEYPOINTS: trueNUM_KEYPOINTS: 17POOLER_RESOLUTION: 14POOLER_SAMPLING_RATIO: 0POOLER_TYPE: ROIAlignV2ROI_MASK_HEAD:CLS_AGNOSTIC_MASK: falseCONV_DIM: 256NAME: MaskRCNNConvUpsampleHeadNORM: ''NUM_CONV: 4POOLER_RESOLUTION: 14POOLER_SAMPLING_RATIO: 0POOLER_TYPE: ROIAlignV2RPN:BATCH_SIZE_PER_IMAGE: 256BBOX_REG_LOSS_TYPE: smooth_l1BBOX_REG_LOSS_WEIGHT: 1.0BBOX_REG_WEIGHTS:- 1.0- 1.0- 1.0- 1.0BOUNDARY_THRESH: -1CONV_DIMS:- -1HEAD_NAME: StandardRPNHeadIN_FEATURES:- p2- p3- p4- p5- p6IOU_LABELS:- 0- -1- 1IOU_THRESHOLDS:- 0.3- 0.7LOSS_WEIGHT: 1.0NMS_THRESH: 0.7POSITIVE_FRACTION: 0.5POST_NMS_TOPK_TEST: 1000POST_NMS_TOPK_TRAIN: 2000PRE_NMS_TOPK_TEST: 1000PRE_NMS_TOPK_TRAIN: 2000SMOOTH_L1_BETA: 0.0SEM_SEG_HEAD:COMMON_STRIDE: 4CONVS_DIM: 128IGNORE_VALUE: 255IN_FEATURES:- p2- p3- p4- p5LOSS_WEIGHT: 1.0NAME: SemSegFPNHeadNORM: GNNUM_CLASSES: 10VIT:DROP_PATH: 0.1IMG_SIZE:- 224- 224NAME: layoutlmv3_baseOUT_FEATURES:- layer3- layer5- layer7- layer11POS_TYPE: absWEIGHTS: 
OUTPUT_DIR: 
SCIHUB_DATA_DIR_TRAIN: ~/publaynet/layout_scihub/train
SEED: 42
SOLVER:AMP:ENABLED: trueBACKBONE_MULTIPLIER: 1.0BASE_LR: 0.0002BIAS_LR_FACTOR: 1.0CHECKPOINT_PERIOD: 2000CLIP_GRADIENTS:CLIP_TYPE: full_modelCLIP_VALUE: 1.0ENABLED: trueNORM_TYPE: 2.0GAMMA: 0.1GRADIENT_ACCUMULATION_STEPS: 1IMS_PER_BATCH: 32LR_SCHEDULER_NAME: WarmupCosineLRMAX_ITER: 20000MOMENTUM: 0.9NESTEROV: falseOPTIMIZER: longWREFERENCE_WORLD_SIZE: 0STEPS:- 10000WARMUP_FACTOR: 0.01WARMUP_ITERS: 333WARMUP_METHOD: linearWEIGHT_DECAY: 0.05WEIGHT_DECAY_BIAS: nullWEIGHT_DECAY_NORM: 0.0
TEST:AUG:ENABLED: falseFLIP: trueMAX_SIZE: 4000MIN_SIZES:- 400- 500- 600- 700- 800- 900- 1000- 1100- 1200DETECTIONS_PER_IMAGE: 100EVAL_PERIOD: 1000EXPECTED_RESULTS: []KEYPOINT_OKS_SIGMAS: []PRECISE_BN:ENABLED: falseNUM_ITER: 200
VERSION: 2
VIS_PERIOD: 0[08/13 15:54:08 d2.checkpoint.detection_checkpoint]: [DetectionCheckpointer] Loading from /home/long/work/MinerU/PDF-Extract-Kit/models/Layout/model_final.pth ...
[08/13 15:54:08 fvcore.common.checkpoint]: [Checkpointer] Loading from /home/long/work/MinerU/PDF-Extract-Kit/models/Layout/model_final.pth ...
2024-08-13 15:54:09.334 | INFO     | magic_pdf.model.pdf_extract_kit:__init__:148 - DocAnalysis init done!
2024-08-13 15:54:09.336 | INFO     | magic_pdf.model.doc_analyze_by_custom_model:custom_model_init:98 - model init cost: 25.18623661994934
2024-08-13 15:54:18.411 | INFO     | magic_pdf.model.pdf_extract_kit:__call__:159 - layout detection cost: 8.960: 1888x1472 2 embeddings, 3839.2ms
Speed: 28.6ms preprocess, 3839.2ms inference, 0.9ms postprocess per image at shape (1, 3, 1888, 1472)
2024-08-13 15:54:25.349 | INFO     | magic_pdf.model.pdf_extract_kit:__call__:189 - formula nums: 2, mfr time: 1.24
2024-08-13 15:54:34.577 | INFO     | magic_pdf.model.pdf_extract_kit:__call__:159 - layout detection cost: 9.220: 1888x1472 25 embeddings, 4120.5ms
Speed: 15.3ms preprocess, 4120.5ms inference, 1.0ms postprocess per image at shape (1, 3, 1888, 1472)
2024-08-13 15:54:49.462 | INFO     | magic_pdf.model.pdf_extract_kit:__call__:189 - formula nums: 25, mfr time: 10.67
2024-08-13 15:54:59.903 | INFO     | magic_pdf.model.pdf_extract_kit:__call__:159 - layout detection cost: 10.440: 1888x1472 18 embeddings, 4241.8ms
Speed: 20.1ms preprocess, 4241.8ms inference, 0.9ms postprocess per image at shape (1, 3, 1888, 1472)
2024-08-13 15:55:12.180 | INFO     | magic_pdf.model.pdf_extract_kit:__call__:189 - formula nums: 18, mfr time: 7.93
2024-08-13 15:55:12.184 | INFO     | magic_pdf.model.doc_analyze_by_custom_model:doc_analyze:124 - doc analyze cost: 62.73242211341858
2024-08-13 15:55:12.233 | INFO     | magic_pdf.pdf_parse_union_core:pdf_parse_union:221 - page_id: 0, last_page_cost_time: 0.0
2024-08-13 15:55:12.305 | INFO     | magic_pdf.pdf_parse_union_core:pdf_parse_union:221 - page_id: 1, last_page_cost_time: 0.07
2024-08-13 15:55:12.364 | INFO     | magic_pdf.pdf_parse_union_core:pdf_parse_union:221 - page_id: 2, last_page_cost_time: 0.06
2024-08-13 15:55:12.743 | INFO     | magic_pdf.para.para_split_v2:__detect_list_lines:140 - 发现了列表，列表行数：[(8, 9)]， [[8, 9]]
2024-08-13 15:55:12.744 | INFO     | magic_pdf.para.para_split_v2:__detect_list_lines:153 - 列表行的第8到第9行是列表
2024-08-13 15:55:12.750 | INFO     | magic_pdf.para.para_split_v2:__detect_list_lines:140 - 发现了列表，列表行数：[(19, 20)]， [[19]]
2024-08-13 15:55:12.750 | INFO     | magic_pdf.para.para_split_v2:__detect_list_lines:153 - 列表行的第19到第20行是列表
2024-08-13 15:55:12.755 | INFO     | magic_pdf.para.para_split_v2:para_split:764 - 连接了第0页和第1页的段落
2024-08-13 15:55:13.239 | INFO     | magic_pdf.pipe.UNIPipe:pipe_mk_markdown:48 - uni_pipe mk mm_markdown finished
2024-08-13 15:55:13.278 | INFO     | magic_pdf.pipe.UNIPipe:pipe_mk_uni_format:43 - uni_pipe mk content list finished
2024-08-13 15:55:13.278 | INFO     | magic_pdf.tools.common:do_parse:119 - local output dir is ./res/GenZ-LLM-Analyzer/auto

PDF转markdown工具：magic-pdf

1. magic-pdf 环境安装

2. 权重下载

3. 修改配置

4. 使用参数

5.测试

相关资讯

热文排行

最新新闻

推荐新闻

热搜词