Save Images to PDF

Hi,

I have a set of separate files, let’s say scans of receipts, mostly in pdf and image formats. I need to group them, let’s say by month and save into a single file - a pdf. In order to do that I need to extract either pages or images from all source pdfs and accumulate them into a single target pdf file, along with other images. First of all, I couldn’t find any documentation re my use case. Secondly, all the keywords from RPA.PDF library that I tried, didn’t work as expected, for example: “Extract Pages From Pdf” can’t write into the same document, the document will be overridden by the last call of this keyword. “Get All Figures” extracts images, objects like <RPA.PDF.keywords.model.Figure object at 0x0000028ACFB41788> but I don’t know how to use them after… the documentation doesn’t show either. Currently looking at pypdf library to understand some of the keywords but it still feels like a deadend :frowning:

Welcome to the forum @aut0mat0r !

Sorry to here about your problems. I started to investigate issue and got some things sorted out. I think we can release new rpaframework version next week which will help a bit with these kind of issues.

I don’t want anybody stuck at deadends.

Thanks @mika, I appreciate it, looking forward to it!

Are you able to share some of the files ypu have for my test purposes or do they contain sensitive data ?

Yes, can’t share due to compliance, but I can share code I came up with.

import PyPDF2
from fpdf import FPDF
from PIL import Image
from typing import (
    List,
    Tuple,
    Union
)
from RPA.PDF.keywords import (
    LibraryContext,
    keyword,
)
class PdfGenerator:
def __init__(self):
    self.writer = PyPDF2.PdfFileWriter()

@keyword
def append_file_to_pdf(self, filepath: str, tempfile: str = None):
    if filepath.endswith('.pdf'):
        fileobj = open(filepath, "rb")
        reader = PyPDF2.PdfFileReader(fileobj, strict=False)
        self.writer.appendPagesFromReader(reader)
        #fileobj.close()
    else:
        max_width = 188
        max_height = 244
        reader = self.convert_img_to_pdf(filepath, max_width, max_height, tempfile)
        self.writer.appendPagesFromReader(reader)

@keyword
def output_resulting_pdf(self, filepath: str):
    with open(str(filepath), "wb") as f:
        self.writer.write(f)
        f.close()
    self.writer = PyPDF2.PdfFileWriter()


def convert_img_to_pdf(self, image_path: str, max_width: int, max_height: int, tempfile: str) -> PyPDF2.PdfFileReader:
    im = Image.open(image_path)
    w, h = im.size

    if (h < w):
        pdf = FPDF(orientation="L")
        width, height = self.fit_dimensions_to_box(*im.size, max_height, max_width)
    else:
        pdf = FPDF()
        width, height = self.fit_dimensions_to_box(*im.size, max_width, max_height)
    
    im.close()
    pdf.add_page()
    pdf.image(name=image_path, x=10, y=10, w=width, h=height)
    pdf.output(name=tempfile)
    return PyPDF2.PdfFileReader(tempfile, strict=False)


def fit_dimensions_to_box(self, width: int, height: int, max_width: int, max_height: int) -> Tuple[int, int]:
    """
    Fit dimensions of width and height to a given box.
    """
    ratio = width / height
    if width > max_width:
        width = max_width
        height = int(width / ratio)
    if height > max_height:
        height = max_height
        width = int(ratio * height)

    if width == 0 or height == 0:
        raise ValueError("Image has invalid dimensions.")

    return width, height

I used some code from PDF library as you can see.

1 Like

Thank you for sharing that :+1: I will probably use some of that as it is in the next release. I worked mainly on getting images out of PDF (if possible) and saving them as images.

@aut0mat0r would these new keywords Add Files To PDF and Save Figure As Image work for you (I have used them in 2 different tasks below) ? any comments ?

I added possibility to add parameters for the Add Files To PDF file list.

In case of PDF it can state which pages to add into new PDF.
In case of image it can state either ‘center’ to center image, or ‘x’ and ‘y’ coordinates for the image.

*** Tasks ***
Adding files to new PDF
    ${files}=    Create List
    ...    ${TESTDATA_DIR}${/}invoice.pdf
    ...    ${TESTDATA_DIR}${/}robot.pdf:1
    ...    ${TESTDATA_DIR}${/}approved.png:y=0,x=0
    ...    ${TESTDATA_DIR}${/}robot.pdf:2-10,15
    ...    ${TESTDATA_DIR}${/}robot.pdf
    ...    ${TESTDATA_DIR}${/}approved.png
    ...    ${TESTDATA_DIR}${/}approved.png:center
    Add Files To PDF    ${files}    newdoc.pdf

*** Keywords ***
Create Image Prefix from PDF filename
    [Arguments]    ${filename}
    ${prefix}=    Replace String    ${filename}    _    ${EMPTY}
    ${prefix}=    Replace String    ${filename}    .    ${EMPTY}
    ${prefix}=    Get Substring    ${prefix}    0    10
    [Return]    ${prefix}

*** Tasks ***
Save PDF figures into image files
    ${pdfs}=    List Files In Directory    ${TESTDATA_DIR}    *.pdf
    FOR    ${pdf}    IN    @{pdfs}
        ${pages}    Get Number Of Pages    ${TESTDATA_DIR}${/}${pdf}
        ${figures}    Get All Figures    ${TESTDATA_DIR}${/}${pdf}
        ${prefix}=    Create Image Prefix from PDF filename    ${pdf}
        FOR    ${pagenum}    IN RANGE    1    ${pages+1}
            FOR    ${key}    ${val}    IN    &{figures[${pagenum}]}
                Save Figure As Image    ${val}    ${prefix}_page${pagenum}    ${CURDIR}${/}images
            END
        END
    END

Those keywords are not yet part of the release. I will probably release them tomorrow.

@aut0mat0r now there is rpaframework release 9.3.0 with new keywords

https://rpaframework.org/releasenotes.html

Sorry was busy lately. It looks great man! Let me try those on my data and I’ll get back to you.

1 Like

It worked, @mika, great job, thanks!
The only thing I’ve noticed is landscape images were converted to pdf in the portrait mode without prior rotation which resulted in smaller images. Is it possible to rotate the image or the pdf page in the config list? Thanks!

@aut0mat0r there is now rpaframework 9.3.4. I added some new properties to images.

It is possible to set orientation for the images to portrait or landscape although output PDF will look a bit odd if some pages have different orientation than the rest.

I added format because I had some input PDFs which were actually in Letter format and default format for added images for different, which resulted different page sizes.

This is from the keyword documentation:

Add images and/or pdfs to new PDF document

Image formats supported are JPEG, PNG and GIF.

The file can be added with extra properties by denoting : at the end of the filename. Each property should be separated by comma.

Supported extra properties for PDFs are:

  • page and/or page ranges
  • no extras means that all source PDF pages are added into new PDF

Supported extra properties for images are:

  • format, the PDF page format, for example. Letter or A4
  • rotate, how many degrees image is rotated counter-clockwise
  • align, only possible value at the moment is center
  • orientation, the PDF page orientation for the image, possible values P (portrait) or L (landscape)
  • x/y, coordinates for adjusting image position on the page
***Settings***
Library    RPA.PDF

***Tasks***
Add files to pdf
    ${files}=    Create List
    ...    ${TESTDATA_DIR}${/}invoice.pdf
    ...    ${TESTDATA_DIR}${/}approved.png:align=center
    ...    ${TESTDATA_DIR}${/}robot.pdf:1
    ...    ${TESTDATA_DIR}${/}approved.png:x=0,y=0
    ...    ${TESTDATA_DIR}${/}robot.pdf:2-10,15
    ...    ${TESTDATA_DIR}${/}approved.png
    ...    ${TESTDATA_DIR}${/}landscape_image.png:rotate=-90,orientation=L
    ...    ${TESTDATA_DIR}${/}landscape_image.png:format=Letter
    Add Files To PDF    ${files}    newdoc.pdf