Extracting Tables from PDF

Is there some way of extracting tables from a pdf document?

So here’s what I tried. I gathered that I can use the camelot package in python to read in tables from PDF documents. I have written a custom library inside my libraries folder and it looks like this:

import camelot

class PDFTables:
    def extract_table(self, pdf):
        tables = camelot.read_pdf(pdf, pages = '1-end')
        return tables[1].df

I saved this file as PDFTables.py and then I am trying to use this library:

*** Settings ***
Documentation   Testing stuff on pdf documents
Library         RPA.PDF
Library         PDFTables

*** Tasks ***
Extract PDF Table
    ${activities}=     Extract table      falcon.pdf
    [Return]     ${activities}

When I try to run this, while I don’t get an error, but this runs forever and I have to restart the kernel. I am guessing that I need an instance of the class and then apply the extract_table() method. How do I create an instance of this class in my robot file?

hello @dhirajkhanna!

I think the issue in your code is not in the library, but in the fact that you are returning a value from a keyword in the task.

I have tried your code out, edited it slightly and I was able to extract the table data from the example (foo.pdf) that the camelot library provides.

I am attaching the Robocode Lab activity zip here: example-tables-pdf.zip (97.2 KB) so you can try running it :slight_smile:

Here’s what I did:

  1. added the camelot library to ./config/conda.yaml
  - defaults
  - conda-forge
  - python=3.7.5
  - pip=20.1
  - camelot-py
  - pip:
    - rpaframework==2.5.1
  1. I’m using a mac, so I had to install ghostscript like they recommend in the installation istructions: https://camelot-py.readthedocs.io/en/master/user/install-deps.html#install-deps

  2. I downloaded the foo.pdf file that is used in the examples for the library, and put it in the ./tasks folder (so we don’t have to worry about paths in this example).

  3. I modified your library code like this, so that you can pass the index of the table (your library had index 1 hardcoded, that would fetch the second table in the document (probably you are aware :slight_smile: )

class PDFTables:
    def extract_table(self, pdf, table_index=0):
        tables = camelot.read_pdf(pdf, pages='1-end')
        return tables[table_index].df
  1. I modified the ./tasks/robot.robot like this:
*** Settings ***
Documentation     Testing stuff on pdf documents
Library           RPA.PDF
Library           PDFTables

*** Tasks ***
Extract PDF Table
    ${activities}=    Extract table    foo.pdf   table_index=0
    Log    ${activities}

(note that we are passing the table_index)

Running the robot gives me this log:

Which means it’s working! :slight_smile:

I hope this helps!


@Mario thank you so much for this! Still new here so figuring out the difference between [Return] and Log. Nice addition to the index of the table, although the PDFs that I am using, they are quite standard and I need to extract the second table, hence hardcoded. Unfortunately, the headers of the table are not being read. Guess that’s a camelot issue. But thanks again for this. Appreciate!

So if I now want to carry out some aggregation on the extracted table using pandas and spaCy, will I have to write another library?

I did find this pandas robotframework library: https://pypi.org/project/robotframework-pandaslibrary/ but if your needs are quite specific, and now that you know how to do it, I think making your own small library seems to be a good way to go.

Great, thanks! Will experiment and see what works best.