How to download a PDF file from websites

Hello everyone, I need help!

I’m automating a simple task:
1 - access a web site with credencials
2 - search for some cases
3 - download some pdfs

The first 2 steps I did perfectly, but the last one no.
I’m coding in pure python and I’ve tried to use requests, but the pdf downloaded come with problems.
the figure bellow is the link of the pdf and the other is the result from the script

The final result I need, is take some data from PDF, if someone knows how to download or get the string of the PDF without download, will help me a lot.
Thanks and sorry the bad english

Hi, most probably the data that you retrieve through the request is not the actual PDF content, but instead some HTML text data telling you that you’re not authorized to access that URL, therefore you place inside invalid content which can’t be displayed.
If that’s the case (you can check by opening the PDF file in an editor and see if it starts with something like %PDF-1.4), then that’s because the request you sent doesn’t set the cookie nor use the session with which you are authenticated in that website.

To solve it, either do the auth and all the requests under a requests.Session() object or set the Cookie header as part of your headers= dictionary during requests. Sometimes an URL crafted into the form of https://<usr>:<pwd>@logistica.belgo.com.br/uploaded_file?... might work as well, give it a try.
Or even easier, use Browser automation and navigate to that page, then click over the button/element which triggers the download (don’t forget to call Set Download Directory first).

If you still can’t fix it, I need some form of access into that platform in order to replicate and find a fix myself. Good luck!

1 Like

Hello cosmin, you saved me again rs.
The data i retrieve through the request was a html ans using a html viewer online I see this.


image

As you said, the content of request was not the PDF.
About this click from RPA.Browser i’ll study more about

1 Like

https://<lucas.leite@ergondata.com.br>:"<"Ft7zx13>@logistica.belgo.com.br/uploaded_file?path=/uploads/autonomo_documento/documentos/96191/Viagem2676710027885012493.pdf
this way doesn’t work :confused:

I hope you already changed the password embedded above or used a fake one, as this is a public forum and you should never expose sensitive info here.

Meanwhile please try with the other options and let me know if you hit any blocker.

1 Like

It’s a fake password yes.
The other way its clicking on the link? it didn’t work, clicking on the link just open the web site with the pdf

All good then. Keep in mind that the <usr>:<pwd> above I gave as an example was meant to be replaced entirely, not including the brackets, like follows: https://me@email.com:myPassword1234@site.com/page/to/resource and I’m not really sure about the first @ like how that should be escaped.

If browser automation doesn’t work here, then try in Python with requests.Session() (after doing the auth, so the cookies are stored) or use our RPA.HTTP library starting with the Create Session keyword and using the auth parameter.

1 Like

It doesn’t work too, but very thanks !!!