How to handle word library in Linux

Hi,
I am scheduling my robot in Gitlab pipeline.
Using RCC tool, I configured the pipeline, but when it runs, it throws the below error,
“NameError: name ‘win32com’ is not defined”. By default, my gitlab account uses ubuntu image.

I have the below code in my script -
RPA.Word.Application.Open Application
RPA.Word.Application.Open File ${attach1}
${text}= Get All Texts
RPA.Word.Application.Quit Application

Since my Gitlab points to a Linux OS, The above script is failing as RPA.Word.Application library works only on Windows platform.

How can I handle this on Linux? I need to open a word document and get all the text from it. Kindly clarify

As documentation says, RPA.Word.Application works only in Windows. So, you currently you cannot handle it on Linux, you have to use Windows machine for that.

See: RPA.Word.Application library

RPA Framework doesn’t yet have file -based library for manipulating Word documents, but you could implement it with python.

Here is a super simple example using python with Evaluate -keyword, but it is probably better to implement it as a proper python keyword:

${text}= Evaluate "\\n".join([p.text for p in docx.Document("example.docx").paragraphs]) modules=docx

1 Like

@Teppo - Thank you. This worked for me and I could get the paragraphs alone from the word document.
Hyperlinks text, table contents are all missing. Could you please let me know how can I get all the text from the word document?

I think those are all available with the python-docx library python-docx — python-docx 0.8.11 documentation but it may require implementing a python function to extract the data. It is hard to define a generic function because it depends on how you want to use the extracted text.

To help you get started, here is another example of python that will extract text from tables:

import docx

def extract_tables(filename):
    doc = docx.Document(filename)
    
    text = ""
    for table in doc.tables:
        for row in table.rows:
            for cell in row.cells:
                text = "\n".join([text, cell.text])
    return text