Extracting text from PDF

Hi,

I am using “Get Text From Pdf” from RPA.PDF library to extract text from my PDF.

The name in my PDF is Rati Suri, but the extracted text displays name as “Ra? Suri”.

Some other words are also extracted in the same way - star0ng instead of starting, Addi&onal SoAware instead of Additional Software, InstrucDons instead of Instructions etc

Please help on how to resolve this. Thanks

Hi! Is the file something you could share for testing purposes, or does it contain confidential information?

@jani - Attached the same for your reference
ezyzip.zip (79.5 KB)

Please let me know the issue. Thanks

1 Like

Your PDF is formatted in a strange way… And not only your name.
If I extract all the content of your PDF I notice that all missing strings appear from line 90.
So your strings are not missing, they are just not in the right places.
See the screenshot

I doubt that we can do anything about that.
The best way would be to fix the PDF.

If you cannot fix the PDF, I would suggest using OCR to extract the text from your PDF.
Which means:

  • Convert the PDF to an image format (you have a robot for that Here)
  • Send it to AWS to get the sting out (or use some python OCR library like easyOCR)

Here is the code you can user to extract the text using easyOCR:

conda.xml:

channels:
  # Define conda channels here.
  - conda-forge

dependencies:
  # Define conda packages here.
  # If available, always prefer the conda version of a package, installation will be faster and more efficient.
  # https://anaconda.org/search
  - python=3.9
  - poppler=22.01.0
  - pdf2image=1.16.0

  - pip=20.1
  - pip:
      # Define pip packages here.
      # https://pypi.org/
      - rpaframework==13.1.0 # https://rpaframework.org/releasenotes.html
      - EasyOCR==1.4.1
      - opencv-python-headless==4.5.4.60

extractText.py:

from pdf2image import convert_from_path
import easyocr


def convert_pdf_to_images(pdf_path):
    '''
Convert PDF to image
Params:
    pdf_path = full path of your pdf
    '''
    images = convert_from_path(pdf_path,first_page=1,last_page=1)
    for index, image in enumerate(images):
        image.save(f'{pdf_path}-{index}.png')


def extract_text_from_image(IMAGE_PATH):
    
    reader = easyocr.Reader(['fr'])
    result = reader.readtext(IMAGE_PATH,paragraph="False")
    return result

And here is the output using your PDF:

 	[[[[155, 154], [1401, 154], [1401, 215], [155, 215]], '75b IT Form Non-employee (contractor, freelancer; intern)'], [[[220, 255], [1480, 255], [1480, 290], [220, 290]], 'To be completed prior to starting by manager and sent to your People Coordinator fermin maneru@pret com'], [[[222, 346], [294, 346], [294, 372], [222, 372]], 'Name'], [[[456, 344], [556, 344], [556, 374], [456, 374]], 'Rati Suri'], [[[807, 343], [1043, 343], [1043, 379], [807, 379]], 'Form Completed By'], [[[1123, 343], [1275, 343], [1275, 379], [1123, 379]], 'Andy Haines'], [[[222, 400], [368, 400], [368, 430], [222, 430]], 'Department'], [[[808, 398], [932, 398], [932, 428], [808, 428]], 'Start Date'], [[[1124, 398], [1324, 398], [1324, 428], [1124, 428]], '28th March 2022'], [[[221, 450], [330, 450], [330, 483], [221, 483]], 'Manager'], [[[454, 449], [606, 449], [606, 485], [454, 485]], 'Emma Payne'], [[[810, 452], [918, 452], [918, 480], [810, 480]], 'End Date'], [[[1124, 452], [1324, 452], [1324, 480], [1124, 480]], '28th March 2023'], [[[1189, 578], [1529, 578], [1529, 649], [1189, 649]], 'Completed? (To be signed by IT on handover)'], [[[168, 600], [276, 600], [276, 630], [168, 630]], 'Facilities'], [[[698, 600], [928, 600], [928, 629], [698, 629]], 'Please Answer here'], [[[166, 667], [883, 667], [883, 774], [166, 774]], '75b access pass required? No (please state if access to restricted areas are required) Pass Start Date:'], [[[946, 740], [1142, 740], [1142, 770], [946, 770]], 'Pass Expiry Date:'], [[[168, 831], [448, 831], [448, 863], [168, 863]], 'Lunch account required?'], [[[697, 831], [894, 831], [894, 864], [697, 864]], 'Department: n/a'], [[[166, 886], [508, 886], [508, 916], [166, 916]], 'IT Systems & Software access'], [[[168, 939], [636, 939], [636, 1001], [168, 1001]], 'Please tell us the requested email display name'], [[[698, 958], [924, 958], [924, 988], [698, 988]], 'Rati Suri@pret.com'], [[[166, 1028], [676, 1028], [676, 1092], [166, 1092]], 'Access to any additional mailboxes should be listed here_'], [[[693, 1115], [1159, 1115], [1159, 1370], [693, 1370]], 'Jira, Confluence; MS Teams, AD account; Office 365, Okta, ServiceNow, GitHub organisation (added to "Okta- Global_GitHub" group) Gsuite (added to"Okta-Global_GSuite" and the "Okta- Global_GCP" group) Disco "Okta-Global- BackStage'], [[[163, 1190], [649, 1190], [649, 1335], [163, 1335]], 'Additional Software: We supply Microsoft Office applications as standard only: If any other software is required it needs to be listed here'], [[[167, 1422], [653, 1422], [653, 1596], [167, 1596]], 'Please list any other drives required if not mentioned below: Department DRV Please state name HO drive Personal Drive'], [[[167, 1622], [642, 1622], [642, 1724], [167, 1724]], 'Email distribution lists? Please list all required as if not listed they will not be added.'], [[[700, 1660], [748, 1660], [748, 1688], [700, 1688]], '75b'], [[[167, 1745], [455, 1745], [455, 1836], [167, 1836]], 'VDI Access required? Remote access required?'], [[[168, 1856], [426, 1856], [426, 1888], [168, 1888]], 'To Be Completed by IT'], [[[698, 1858], [824, 1858], [824, 1884], [698, 1884]], 'Comments'], [[[1192, 1856], [1336, 1856], [1336, 1888], [1192, 1888]], 'Completed?'], [[[168, 1910], [448, 1910], [448, 1940], [168, 1940]], 'Follow You Printer set up'], [[[700, 1910], [750, 1910], [750, 1936], [700, 1936]], 'PIN:'], [[[167, 1958], [498, 1958], [498, 1997], [167, 1997]], 'FollowYou Printing explained']]

Your name is extracted successfully.

NOTE:
To use easyOCR you’ll need a CPU witch support AVX2 instruction or a graphic card supporting CUDA.

3 Likes

Indeed, the format is very strange in the text content itself, as I made a test robot getting the name with Find Text:

${pdf} =     Get Work Item File    cv.pdf
Open Pdf    ${pdf}
${matches} =    Find Text    Name
Log List    ${matches}

and got in the logs:

which is the same bad string as reflected when using Get Text From Pdf.


Encouraging @Gael 's solution!