Scrape Search Results

Hi, hope you can help!

Looking to search Google and grab the results, including the link and text. Ideally I could then grab a screen-cap/similar of the resulting links too…

I have had good success iterating through results captured using:

$(searchresults}= Get Web Elements class:g

Then working through a loop of these elements. I tried:

Get Element Attribute ${searchresult} href

but I get no data back.

Any idea what I am doing wrong/a better way of doing this?

Thanks!

Hi,

That would loop through the result elements, but it’s not a simple link with a href attribute. If you take a look at the DOM, an element with class g is something like this:
image

It’s not even as simple as taking the only link element from inside the result, since there are multiple links such as the drop-down menu.

I’m not sure if it’s the most robust solution, but here’s my proposal for finding all link elements as xpath: //*[@class="g"]//a[@data-ved and @ping].

An explanation:

  1. Select all elements (//*)
  2. Filter for elements with class g ([@class="g"])
  3. Select all links inside the element (//a)
  4. Filter for attributes @data-ved and @ping, which seem to be unique for the page link

The last part is the one I’m a bit unsure of but it seems to work for now.

1 Like

This is great, thanks. I will work on this basis.

If I wanted to capture, say, the text “find online test…” what is the best mechanism for this? Can I loop this together so I can export to, for example, an excel file with links in one column and corresponding text in another?

Hi, @cdlxeco!

Here’s one possibility. Modify for your needs!

There’s one .robot file and one custom Python library.

The Python library is there because doing try/catch in Robot Framework syntax is not that nice. That’s why I decided to pass the WebElement to the library for further scraping.

The h3 with the text does not always exist, so I wanted to return an empty string in those cases. You might want to modify the get_result_text to better fit your use case.

The robot

*** Settings ***
Documentation     Google search results scraper.
Library           Collections
Library           RPA.Browser.Selenium
Library           RPA.Tables
Library           ResultScraper
Task Teardown     Close All Browsers

*** Tasks ***
Scrape Google search results
    Open Available Browser    https://www.google.com/search?q=dinosaur
    Accept cookies
    ${urls_and_texts}=    Scrape results
    ${table}=    Create Table    ${urls_and_texts}
    Write Table To Csv    ${table}    ${CURDIR}${/}output${/}search-results.csv

*** Keywords ***
Accept cookies
    Wait Until Element Is Visible    //iframe
    Select Frame    //iframe[contains(@src, "https://consent.google.com")]
    Click Element When Visible    //div[@id="introAgreeButton"]
    Unselect Frame

Scrape results
    ${results}=    Get WebElements    //*[@class="g"]//a[@data-ved and @ping]
    ${urls_and_texts}=    Create List
    FOR    ${result}    IN    @{results}
        ${url}=    Get result url    ${result}
        ${text}=    Get result text    ${result}
        ${url_and_text}=    Create List    ${url}    ${text}
        Append To List    ${urls_and_texts}    ${url_and_text}
    END
    [Return]    ${urls_and_texts}

The ResultScraper.py library:

def get_result_text(result) -> str:
    try:
        return result.find_element_by_tag_name("h3").text
    except:
        return ""


def get_result_url(result) -> str:
    return result.get_attribute("href")

Example output CSV

0,1
https://en.wikipedia.org/wiki/Dinosaur,Dinosaur - Wikipedia
https://www.amnh.org/explore/videos/dinosaurs-and-fossils/dinosaurs-today,
https://www.nhm.ac.uk/discover/dino-directory/country/australia/gallery.html,
https://www.amnh.org/exhibitions/dinosaurs-ancient-fossils/extinction/dinosaurs-survive,
https://www.nhm.ac.uk/discover/how-an-asteroid-caused-extinction-of-dinosaurs.html,
https://en.wikipedia.org/wiki/Dinosaur_(film),Dinosaur (film) - Wikipedia
https://www.nationalgeographic.com/science/prehistoric-world/flesh-bone/,"Dinosaurs Article, Dinosaur Modeling Information, Facts ..."
https://www.nationalgeographic.com/science/article/dinosaur-extinction,Dinosaur extinction facts and information | National Geographic
https://www.youtube.com/watch?v=zXNTijY_ELw,SCARY Dinosaur Roars! - YouTube
https://www.youtube.com/watch?v=zXNTijY_ELw,
https://www.youtube.com/watch?v=tFNPvUDluNQ,Life of DINOSAURS - YouTube
https://www.youtube.com/watch?v=tFNPvUDluNQ,
https://www.britannica.com/animal/dinosaur,"dinosaur | Definition, Types, Pictures, Videos, & Facts ..."
https://www.nhm.ac.uk/discover/dino-directory/name/a/gallery.html,Dino Directory Name A-Z - Dinosaurs beginning with the letter ...
https://www.mydinosaurs.com/blog/10-dinosaur-cartoon-movies-kids/,
https://www.livescience.com/38596-mesozoic-era.html,
http://www.techwithkids.com/List_45_top-dinosaur-apps,
https://www.amnh.org/explore/videos/dinosaurs-and-fossils/in-what-kind-of-environment-did-dinosaurs-live,

If you have any further questions, feel free to ask!

// Jani

Jani made a nice example already, but I happened to make one with pure RF earlier so might as well share:

*** Settings ***
Library    RPA.Browser.Selenium
Library    RPA.Excel.Files
Library    RPA.Tables

*** Tasks ***
Find all results
    Open Available Browser   https://google.com
    Run Keyword And Ignore Error    Accept Cookies
    Search For Text    Robot Framework
    ${links}=    Scrape links
    Write To Worksheet    ${links}
    [Teardown]     Close all browsers
    
*** Keywords ***
Accept Cookies
    Select Frame     //iframe[contains(@src, "https://consent.google.com")]
    Click Element    id:introAgreeButton
    [Teardown]       Unselect frame

Search For Text
    [Arguments]    ${text}
    Input Text    name:q    ${text}
    Press Keys    name:q    ENTER
    Wait Until Page Contains Element    search

Scrape Links
    ${header}=      Create List        Name   Link
    ${results}=     Create Table       columns=${header}

    ${xpath}=       Set variable       //*[@class="g"]//a[@data-ved]/h3/span
    ${links}=       Get WebElements    ${xpath}/../..
    ${titles}=      Get WebElements    ${xpath}
    FOR  ${link_element}  ${title_element}  IN ZIP  ${links}  ${titles}
        ${link}=    Get Element Attribute   ${link_element}   href
        ${title}=   Get Text    ${title_element}
        ${row}=     Create list    ${title}   ${link}
        Add table row    ${results}    ${row}
    END
    [Return]    ${results}

Write To Worksheet
    [Arguments]    ${results}
    Create workbook    results.xlsx
    Append Rows to Worksheet  ${results}
    Save Workbook

It shows how to create a Table for manipulating this sort of data and then storing it into a Excel workbook. I got around the issue of matching titles and links by searching for titles first and then going up from that xpath. Jani’s solution is a bit more robust though, I think.

1 Like

This is really helpful, thanks both.

I am completely new to all programming so as yet have not ventured more in to python. That being said the second approach seems more familiar to me in that I have more experience with RF at the moment.

Am getting a TypeError: Not a valid input format error on the column creation in Scrape Links. What format should this input be in? Is there a way to look this up myself?

Is it advisable to move towards python?

Thanks for all your help this far!

Oh, sorry, I missed this follow-up!

Can you share the robot script you’re trying to run? The one I posted should work directly. The library documentation is hosted here: Tables — RPA Framework documentation, but looking at it the Tables library requires some more verbose explanations.

Python should be optional for many use-cases, but it does open up some avenues for more powerful automation. And as you can see from Jani’s example, a Python library doesn’t need to be complex.

Hi, I’m not sure what I am doing wrong, but the code you had written doesn’t seem to work for me.

Below is my code as it is at the moment. Would appreciate any feedback/efficiency changes you may have. I am working on trying to also extract the text under the results, and have attached a screenshot of what I mean.

I am ultimately trying to iterate a number of company names through a search of a number of search terms. I have got a fair bit working, but I am sure I have done things the wrong way round!

Your thoughts, as ever, are much appreciated!

*** Keywords ***

Setup
    Open Available Browser   https://google.com
    Run Keyword And Ignore Error    Accept Cookies

Accept Cookies
    Select Frame     //iframe[contains(@src, "https://consent.google.com")]
    Click Element    id:introAgreeButton
    [Teardown]       Unselect frame

Create Folder For Search
    ${date}=    Get Current Date    result_format=%d.%m.%Y
    Log     ${date}
    Set Global Variable   ${date}  
    #create folder using date
    ${root}=     Convert To String    ${CURDIR}${/}output${/}${date}
    Create Directory     ${root}
    Log     ${root}
    Set Global Variable     ${root}

Get Companies
    #get list
    Open Workbook     ${CURDIR}${/}companies${/}companies.xlsx
    ${companys}=     Read Worksheet As Table     header=True     start=2     trim=True
    Close Workbook
    #counter to track loop
    ${compcount}=   Set Variable    -1
    #cycle
    FOR     ${company}   IN  @{companys}
        #increment loop counter
        ${compcount}=   Evaluate    ${compcount} + 1
        #get name
        ${name}=    RPA.Tables.Get Table Cell  ${companys}    ${compcount}  Company
        Set Global Variable     ${name}
        #create folder
        ${dir}=     Convert To String    ${root}${/}${name}
        Create Directory    ${dir}
        Set Global Variable     ${dir}
        #create excel file
        #file path
        ${excel}=   Convert To String   ${dir}${/}${name}.xlsx
        Create Workbook     path=${excel}  fmt=xlsx
        Set Global Variable     ${excel}
        Save Workbook
        Search All Searchterms
    END
    
Search All Searchterms
    #get searchterms
    Open Workbook     ${CURDIR}${/}searchterms.xlsx
    ${searchterms}=     Read Worksheet As Table     header=True     start=2     trim=True
    Close Workbook
    #counter to track loop
    ${searchcount}=   Set Variable    -1
    #cycle
    FOR     ${searchterm}   IN  @{searchterms}
        #increment search loop counter
        ${searchcount}=   Evaluate    ${searchcount} + 1
        #get searchterm
        ${searchname}=    RPA.Tables.Get Table Cell  ${searchterms}    ${searchcount}  Search Terms
        Set Global Variable     ${searchname}
        #get end date
        ${enddate}=    Get Current Date
        #get start date
        ${startdate}=    Subtract Time From Date     ${enddate}      1825d      result_format=%Y-%m-%d  exclude_millis=True date_format=%y.%m.%d
        #format end date
        ${enddate}=    Get Current Date     result_format=%Y-%m-%d
        #construct search term
        ${search}=  Catenate    ${name}  ${searchname}  after:${startdate}  before:${enddate}
        Search For Text     ${search}
        #set excel locations
        Open Workbook   ${excel}
        Create Worksheet    ${searchname}    exist_ok=True
        Scrape
    END
    

Search For Text
    [Arguments]    ${search}
    Input Text    name:q    ${search}
    Press Keys    name:q    ENTER
    Wait Until Page Contains Element    search
    


Scrape
    #counter to track loop
    ${count}=       Set Variable        0
    #webelements
    ${xpath}=       Set Variable        //*[@class="g"]//a[@data-ved]/h3/span
    ${links}=       Get WebElements     ${xpath}/../..
    ${titles}=      Get WebElements     ${xpath}
    
    #testing
    ${xpathx}=      Set Variable        //div[@class="g"]//h3/a
    ${texts}=       Get WebElements     ${xpathx}/..
    
    #loop
    FOR  ${link_element}  ${title_element}  ${text_element}  IN ZIP  ${links}  ${titles}  ${texts}
        #increment loop counter
        ${count}=   Evaluate    ${count} + 1
        Get Active Worksheet
        Set Active Worksheet    ${searchname}
        #get content
        ${link}=    Get Element Attribute   ${link_element}   href
        ${title}=   Get Text    ${title_element}
        #write content
        Set Worksheet Value     row=${count}    column=1    value=${title}
        Set Worksheet Value     row=${count}    column=2    value=${link}
        Capture Element Screenshot      ${text_element}
        Save Workbook
    END

Do you not get any results with my version or is there an error from somewhere?

In the one you shared you added a selector for the text element, but that doesn’t seem to match to anything for me. And the loop created with IN ZIP works in a way where it loops all lists at the same time based on the shortest one, so in this case doesn’t loop at all.

Google in general tries to make these hard to parse automatically. Here’s my attempt with some xpath trickery, but it becomes even more finicky:

Scrape
    ${xpath}=       Set Variable        //*[@class="g"]//a[@data-ved]/h3/span
    ${texts}=       Get WebElements     ${xpath}/../../../following-sibling::div//span
    ${links}=       Get WebElements     ${xpath}/../..
    ${titles}=      Get WebElements     ${xpath}
    
    FOR  ${link_element}  ${title_element}  ${text_element}  IN ZIP  ${links}  ${titles}  ${texts}
        ${link}=    Get Element Attribute    ${link_element}    href
        ${title}=   Get Text    ${title_element}
        ${text}=    Get Text    ${text_element}
        Log    ${link} - ${title} - ${text}
    END

I do have some general tips based on the rest of the shared script:

  • I see a lot of Log lines scattered around, which is fine, but if you change the log level of robot to TRACE you can see all arguments and return values of keywords directly in the generated log. In robot.yaml for instance you would add options --loglevel and TRACE.
  • You also use global variables for passing things around, which is also ok in something small like this, but for larger things I would recommend using the [Arguments] and [Return] settings for keywords. Otherwise you might end up with a bit of a spaghetti.
  • There is no need to manually keep an index when iterating loops, RF has the built in loop operator IN ENUMERATE which will generate one for you
  • You can also loop Tables directly without an index, like this:
FOR    ${row}    IN     @{table}
    Log    ${row}[Company]
END
  • In Robot everything you enter as an argument is by default a string. These for instance are the same thing:
${var}=    Convert to string    directory${/}${filename}
Some keyword    ${var}
---
Some keyword     directory${/}${filename}