PDF Extraction Issue

Hi, I am not able to extract some data from the pdf, its showing found 2 values.
Code

Log

PDF
image

Hi, @shine.nazeer!

only_closest: return all possible values or only the closest.

Try setting only_closest to False and see if you get multiple values.

Since there are two instances of Sub Total found and the first one does not have text to the right, it might be the reason for the empty list.

Also, you do not need to set pagenum and direction if the default values are ok (those are the defaults).

Tried those, but showing same error. Found 2 Matches for location.

@shine.nazeer: I have the same end result. With a text locator, no matter what I try, if there is more than one match, I only get an empty list. :confused:

I used this test PDF (probably the same you tried):

Here’s the source code if someone wants to dig in deeper:

Yes I tried with the same pdf, and also if I look for From & To its returning empty list.

I did some debugging, and it seems that in this case the anchor element will not be set (or will be set to None):

self.set_anchor_to_element(locator)

The rest of the logic is wrapped in this if condition that now evaluates to False:

if self.anchor_element:

Apparently, the anchor element can not be set if there are more than one match (not unique):

for the time being I am extracting the data using coords, but its very difficult to find the coords. Is there any way to find the coords in pdf easily.

Prepare for some pretty messy, unpolished, but working code! :sweat_smile:

(would be better to implement the logic in a Python library - I just wanted to hack something together)

  • Get text coordinates takes a text and returns the bounding box coordinates for the first match.
  • Use the coordinates with the Find text keyword to find the text.
*** Settings ***
Library           RPA.PDF
Library           Collections
Library           String

*** Tasks ***
Get text from PDF
    ${coordinates}=
    ...    Get text coordinates
    ...    ${CURDIR}${/}example-invoice.pdf
    ...    Sub Total
    ${coords}=    Convert To String    ${coordinates}
    ${coords}=    Remove String    ${coords}    [    ]    ${SPACE}
    ${text}=    Find Text    coords:${coords}    direction=down

*** Keywords ***
Get text coordinates
    [Arguments]    ${pdf_path}    ${text}
    ${textboxes}=    Get Text From Pdf    ${pdf_path}    details=True
    ${coordinates}=    Set Variable    ${EMPTY}
    FOR    ${item}    IN    @{textboxes[1]}
        IF    """${item.text}""" == """${text}"""
            ${coordinates}=    Set Variable    @{item.bbox}
            Exit For Loop
        END
    END
    [Return]    ${coordinates}

1 Like

this is nice, but how didnt you know that bbox is the one which shows coordinates.

Not able to extract From and To address also.

That is because To: and From: are part of the texts and not their own boxes:

From: DEMO - Sliced Invoices Suite 5A-1204 123 Somewhere Street Your City AZ 12345 admin@slicedinvoices.com

To: Test Business 123 Somewhere St Melbourne, VIC 3000 test@test.com

For those you might need to implement something like “starts with”.

I navigated to the Python source file, added some breakpoints, ran the robot in debug mode, and inspected the variables and their values. The debugger is very useful!

1 Like