I am trying to do web scraping on a page and I can only open the browser, in the second step, when I want to open a link I receive the message:
You don’t have permission to access on this server.
I am trying with RPA.Browser.Selenium and I have searched for information in the forum and documentation without result.
That info is coming from server side, and there might be few reasons, like:
- You might need to authenticate into that server.
- It might be, that server detects “automation” and prevents automation to scrape that server.
- Server might prevent all access if it is having troubles.
Can you tell us what site it is?
Thank you very much for answering.
It is a site that provides a list of companies, it is not illegal, but I understand that they seek to limit RPA because they charge for their database.
According to robots.txt they have long list of disallowed links
Thanks, I thought the problem came from there.
Any recommendation ?
If normally browsing those pages you want to extract won’t cause you problems, then the robot must look & act “human”, so try with setting up a
User-Agent header first (chrome example) that is often met in browsers. (I think that Create Webdriver will help you pass the user-agent option and if Robot Framework code isn’t enough, then write a Python module that configures the driver as you wish and use Switch Browser with the initialized and returned browser index)
Other tricks to try:
- Random sleeps before clicking links/buttons.
- Not browsing in a specific robotic and systematic order.
- Using proxies to look like you’re surfing from a specific country/location. (post-deployment, if the Control Room server region is banned)