aleph.ingest.poppler

aleph.ingest.poppler.element_size(el)
aleph.ingest.poppler.extract_page(path, temp_dir, page, languages)

Extract the contents of a single PDF page, using OCR if need be.

aleph.ingest.poppler.extract_pdf(path)

Extract content from a PDF file.

This will convert the whole file to XML using pdftohtml, then run OCR on individual images within the file.

aleph.ingest.poppler.ocr_page(path, temp_dir, page_no, languages)

Extract a page as an image and perform OCR.