Rselenium Web Scraping

Web Scraping Free
Python Scrapy Selenium
Scraping With Selenium

Check out my latest version of scraping helper (replaced Selenium with a more robust and faster solution i.e an API) at github.com/npranav10/acciotables

If you are planning to make use of the following procedure to scrap data from a Sports Reference page, please go through the Disclaimer

Pre-Match Conference:

RSelenium: A wonderful tool for web scraping. Posted on September 10, 2016 by ginnalin in R bloggers 0 Comments This article was first published on R – FordoX, and kindly contributed to R-bloggers. (You can report issue about the content on this page here).
What is web scraping? To align with terms, web scraping, also known as web harvesting, or web data extraction is data scraping used for data extraction from websites. The web scraping script may access the url directly using HTTP requests or through simulating a web browser. The second approach is exactly how selenium works – it simulates a.
Web scraping using Selenium and BeautifulSoup can be a handy tool in your bag of Python and data knowledge tricks, especially when you face dynamic pages and heavy JavaScript-rendered websites. This guide has covered only some aspects of Selenium and web scraping.

The idea to write this article was always in my mind when I managed to scrap Fbref data for myself.
But only when I shared this info with my friends who are working in the same field, I realised that a helper was not available online and that they were waiting for someone to do this daunting task.
This procedure is not only applicable for Fbref or any SRef in that case but also for data in any website which is rendered dynamically.

If you would like to directly skip to the procedure you can jump to Getting Started
Or If you just need the code you can travel to Final Code

What we normally used to do ?

This page explains how to do web scraping with Selenium IDE commands. Web scraping works if the data is inside the HTML of a website. If you want to extract data from a PDF, image or video you need to use visual screen scraping instead. When to use what command?

The traditional data scraping through rvest() is as follows:

Grab the page URL and table-id
You can get tableids by Inspecting Element or by using SelectorGadget
By Inspecting Element:
1. Right Click on a table and select Inspect Element on the ensuring context menu
2. The data on which you made the right click gets highlighted on the Developer tab
3. Scroll upwards until you come across the < table > tag with either class name (say .wikitable) or a id name (say #stats_shooting) associated with it
4. The corresponding table id here is '#stats_passing_squads'
Drag this SelectorGadgetJS link to your bookmarks and you are good to continue.
By using SelectorGadget:
1. Click on SelectorGadget from your bookmarks.
2. Hover over the table's boundary until you find the 'div table' label appearing on the orange marker.
3. At that instant, if you perform a left click, the marker turns red with the id of the table appearing on the SelectorGadget toolbar on your screen.
4. The corresponding table id here is '#stats_keeper_squads'
And then (har)rvest it
require('dplyr')
stats_passing_squads = xml2::read_html('https://fbref.com/en/comps/22/passing/Major-League-Soccer-Stats') %>% rvest::html_nodes('#stats_passing_squads') %>% rvest::html_table()
stats_passing_squads = stats_passing_squads[[1]]

What's special with FBref tables and Why just using rvest won't help?

The site is structured in such a way that all tables except the first (even the 1st sometimes), are rendered dynamically. I managed to capture the way the tables load.
Notice how the 1st table loads quickly and how the reminder are taking their time to load.
I have annotated the tableids on the tables for the sake of clarity.

So when we try to scrap such tables, this happens :

What Next? :(

Borderlands: the pre-sequel download. The issue is we can see the tables and their ids in naked eye but can't ~~put it on paper~~ use them in R Studio. So we are going to use RSelenium which helps us in this matter.

How does RSelenium help us?

Selenium is an automation tool. RSelenium is a package which helps us to achieve automation in R.
By using RSelenium, we fool the website as if a browser has walked through the website completely and collected every bit of info on its way. The server will think the selenium driver to be just another human wandering the page.
In other words, it's similar to copying the contents of whole page and saving it in a word document for future purposes. Get it right? It's as simple as this.

Let's get started
But Before getting started, there is a catch.
We can't just enter the url in a browser and ask 'RSelenium' to do its job. We need to install Selenium in our computer, so that RSelenium can communicate with real Selenium. And for that we will use the concept of containers (Docker) rathering than using executables which are tiresome and has different procedure for different OSes.
Hold your breath. All these are steep leaning curves and rather one-time investments.

Getting Started:

Install Docker for Desktop
Install RSelenium in RStudio as follows: install.packages('RSelenium')
Execute the following code in RStudio's Console unless stated otherwise.

Now with all dependancies installed, let's get into the action

Start Docker for Desktop in your OS. (It really takes time to load. Watch Haaland's goals meanwhile.)
Run the following code in RStudio's Terminal (not Console): docker run -d -p 4445:4444 selenium/standalone-chrome
The de facto way of knowing the success status is by running docker ps on which you will get the container id of selenium as output.
If Terminal is not visible in your panes, then use ALT + SHIFT + R to open it.
Remember the anonymous browser that I mentioned earlier? We are about to invoke him now:
remDr <- RSelenium::remoteDriver(remoteServerAddr = 'localhost',port = 4445L,browserName = 'chrome')
remDr$open()
remDr$navigate('https://fbref.com/en/comps/22/passing/Major-League-Soccer-Stats')
remDr$screenshot(display = TRUE)
The rest of the code involves rvest where we will be dumping the html source code into read_html()
require('dplyr')
stats_passing_squads = xml2::read_html(remDr$getPageSource()[[1]]) %>% rvest::html_nodes('#stats_passing_squads') %>% rvest::html_table()
stats_passing_squads = stats_passing_squads[[1]]
Großartig !

You - 'So Pranav. You are asking me to follow the above steps everytime?'
Me - ' No not absolutely. Hence comes the most awaited part.'
We can create a function to automate all these steps (including running Docker from Console). Here's how I do it.

FBref data at one click of a button

getFBrefStats = function(url,id){
require(RSelenium)
require(dplyr)
# For some unspecified reason we are starting and stopping the docker container initailly.
# Similar to heating the bike's engine before shifting the gears.
system('docker run -d -p 4445:4444 selenium/standalone-chrome')
t = system('docker ps',intern=TRUE)
if(is.na(as.character(strsplit(t[2],split = ' ')[[1]][1]))FALSE)
{
system(paste('docker stop ',as.character(strsplit(t[2],split = ' ')[[1]][1]),sep='))
}
# To avoid starting docker in Terminal
system('docker run -d -p 4445:4444 selenium/standalone-chrome')
Sys.sleep(3)
remDr <- RSelenium::remoteDriver(remoteServerAddr = 'localhost', port = 4445L, browserName = 'chrome')
# Automating the scraping initiation considering that Page navigation might crash sometimes in
# R Selenuium and we have to start the process again. Good to see that this while() logic
# works perfectly
while (TRUE) {
tryCatch({
#Entering our URL gets the browser to navigate to the page
remDr$open()
remDr$navigate(as.character(url))
}, error = function(e) {
remDr$close()
Sys.sleep(2)
print('slept 2 seconds')
next
}, finally = {
#remDr$screenshot(display = TRUE) #This will take a screenshot and display it in the RStudio viewer
break
})
}
# Scraping required stats
data <- xml2::read_html(remDr$getPageSource()[[1]]) %>%
rvest::html_nodes(id) %>%
rvest::html_table()
data = data[[1]]
remDr$close()
remove(remDr)
# Automating the following steps:
# 1. run 'docker ps' in Terminal and get the container ID from the output
# 2. now run 'docker stop container_id' e.g. docker stop f59930f56e38
t = system('docker ps',intern=TRUE)
system(paste('docker stop ',as.character(strsplit(t[2],split = ' ')[[1]][1]),sep='))
return(data)
}

Test Drive:
All you have to do is start 'Docker for Desktop' and call our function in RStudio.

bundesliga_players_fbref_shooting = getFBrefStats('https://fbref.com/en/comps/20/shooting/Bundesliga-Stats','#stats_shooting')
head(bundesliga_players_fbref_shooting)

Post-Match Conference:

Where you can make full use of such automation?

Building a shiny app which makes something out of fbref data.
The only catch is you can't run Docker in shinyapps.io . Instead you can host your shiny app in a cloud such as AWS or Azure and then run Docker for Desktop 24 hrs * 365 days or at least until breach your free-tier limit.

If you wish to see me provide such an example for our community, just drop a message at @npranav10

Also a shout-out to Eliot McKinley for encouraging me to write this article.

Reference:

Callum Taylor : Using RSelenium and Docker To Webscrape In R - Using The WHO Snake Database
Apart from borrowing few snippets from the above piece, I did manage to bring in some automation into the procedure to gather data.
If you feel, there is also another way to achieve this, don't hesitate to contact me.

Disclaimer:

Sports Reference LLC says : 'Except as specifically provided in this paragraph, you agree not to use or launch any automated system, including without limitation, robots, spiders, offline readers, or like devices, that accesses the Site in a manner which sends more request messages to the Site server in any given period of time than a typical human would normally produce in the same period by using a conventional on-line Web browser to read, view, and submit materials.'

I would like to re-iterate that in this procedure, the time taken to fetch data from any SRef website is same as someone copying it from a browser, because of the way RSelenium works.
Also the function to get such data was made to help people who prefer to get the data in click of a button rather than:
1. Opening a page
2. Scrolling through for a table
3. Clicking 'Share & More'
4. Choosing 'Get Table as CSV' and Downloading it.
5. And then loading csv file in R
If you are a Sports Reference official and you would like me to avoid using FBref as an example, kindly message me.

Advanced

Web scraping is a very useful mechanism to either extract data, or automate actions on websites. Normally we would use urllib or requests to do this, but things start to fail when websites use javascript to render the page rather than static HTML. For many websites the information is stored in static HTML files, but for others the information is loaded dynamically through javascript (e.g. from ajax calls). The reason maybe because the information is constantly changing, or it maybe to prevent webscraping! Either way, you need to more advanced techniques to scrape the information – this is where the library selenium can help.

What is web scraping?

To align with terms, web scraping, also known as web harvesting, or web data extraction is data scraping used for data extraction from websites. The web scraping script may access the url directly using HTTP requests or through simulating a web browser. The second approach is exactly how selenium works – it simulates a web browser. The big advantage in simulating the website is that you can have the website fully render – whether it uses javascript or static HTML files.

What is selenium?

According to selenium official web page, it is a suite of tools for automating web browsers. This project is a member of the Software Freedom Conservancy, Selenium has three projects, each provides a different functionality if you are interested in it, visit their official website. The scope of this blog will be attached to the Selenium WebDriver project

When should you use selenium?

Selenium is going to facilitate us with tools to perform web scraping, but when should it be used? You generally can use selenium in the following scenarios:

When the data is loaded dynamically – for example Twitter. What you see in “view source” is different to what you see on the page (The reason is that “view source” just shows the static HTML files. If you want to see under the covers of a dynamic website, right click and “inspect element” instead)
When you need to perform an interactive action in order to display the data on screen – a classic example is infinite scrolling. For some websites, you need to scroll to the bottom of the page, and then more entries will show. What happens behind the scene is that when you scroll to the bottom, javascript code will call the server to load more records on screen.

So why not use selenium all the time? It is a bit slower then using requests and urllib. The reason is that selenium simulates running a full browser including the overhead that a brings with it. There are also a few extra steps required to use selenium as you can see below.

Once you have the data extracted, you can still use similar approaches to process the data (e.g. using tools such as BeautifulSoup)

Pre-requisites for using selenium

Step 1: Install selenium library

Before starting with a web scraping sample ensure that all requirements have been set, Selenium requires pip or pip3 installed, if you don’t have it installed you can follow the official guide to install it based on the operating system you have.

Web Scraping Free

Once pip is installed you can proceed with the installation of selenium, with the following command

Alternatively, you can download the PyPI source archive (selenium-x.x.x.tar.gz) and install it using setup.py:

Step 2: Install web driver

Selenium simulates an actual browser. It won’t use your chrome installation but it will use a “driver” which is the browser engine to run a browser. Selenium supports multiple web browsers, so you may chose which web browser to use (read on)

Selenium WebDriver refers to both the language bindings and the implementations of the individual browser controlling code. This is commonly referred to as just a web driver.

Web driver needs to be downloaded, and then it could be either added to the path environment variable or initialized with a string containing the path where downloaded web driver is. Environment variables are out of the scope of the blog so we are going to use the second option.

From here to the end Firefox web driver is going to be used, but here is a table containing information regarding each web driver, you are able to choose any of them, Firefox is recommended to follow this blog

Download the driver to a common folder which is accessible. Your script will refer to this driver.

You can follow our guide on how to install the web driver here.

A Simple Selenium Example in Python

Ok, we’re all set. To begin with, let’s start with a quick staring example to ensure things are all working. Our first example will involving collecting a website title. In order to achieve this goal, we are going to use selenium, assuming it is already installed in your environment, just import webdriver from selenium in a python file as it’s shown in the following.

Running the code below will open a firefox window which looks a little bit different as can be seen in the following image and at the then it prints into the console the title of the website, in this case, it is collecting data from ‘Google’. Results should be similar to the following images:

Note that this was run in foreground so that you can see what is happening. Now we are going to manually close the firefox window opened, it was intentionally opened in this way to be able to see that the web driver actually navigates just like a human will do. But now that it is known, we can add at the end of the out this code: driver.quit() so the window will automatically be closed after the job is done. Code now will look like this.

Now the sample will open the Firefox web driver do its jobs and then close the windows. With this little and simple example, we are ready to go dipper and learn with a complex sample

How To Run Selenium in background

In case you are running your environment in console only or through putty or other terminal, you may not have access to the GUI. Also, in an automated environment, you will certainly want to run selenium without the browser popping up – e.g. in silent or headless mode. This is where you can add the following code at the start “options” and “–headless”.

The remaining examples will be run in ‘online’ mode so that you can see what is happening, but you can add the above snippet to help.

Example of Scraping a Dynamic Website in Python With Selenium

Until here, we have figure out how to scrap data from a static website, with a little bit of time, and patience you are now able to collect data from static websites. Let’s now dive a little bit more into the topic and build a script to extract data from a webpage which is dynamically loaded.

Imagine that you were requested to collect a list of YouTube videos regarding “Selenium”. With that information, we know that we are going to gather data from YouTube, that we need the searching result of “Selenium”, but this result will be dynamic and will change all the time.

The first approach is to replicate what we have done with Google, but now with YouTube, so a new file needs to be created yt-scraper.py

Now we are retrieving data YouTube title printed, but we are about to add some magic to the code. Our next step is to edit the search box and fill it with the word that we are looking for “Selenium” by simulating a person typing this into the search. This is done by using the Keys class:

from selenium.webdriver.common.keys import Keys.

The driver.quit() line is going to be commented temporally so we are able to see what we are performing

The Youtube page shows a list of videos from the search as expected!

Fmv bundle 1. As you might notice, a new function has been called, named find_element_by_xpath, which could be kind of confusing at the moment as it uses strange xpath text. Let’s learn a little bit about XPath to understand a bit more.

What is XPath?

XPath is an XML path used for navigation through the HTML structure of the page. It is a syntax for finding any element on a web page using XML path expression. XPath can be used for both HTML and XML documents to find the location of any element on a webpage using HTML DOM structure.

The above diagram shows how it can be used to find an element. In the above example we had ‘//input[@id=”search”]. This finds all <input> elements which have an attributed called “id” where the value is “search”. See the image below – under the “inspect element” for the search box from youTube, you can seen there’s a tag <input id=”search” … >. That’s exactly the element we’re searching for with XPath

There are a great variety of ways to find elements within a website, here is the full list which is recommended to read if you want to master the web scraping technique.

Looping Through Elements with Selenium

Now that Xpath has been explained, we are able to the next step, listing videos. Until now we have a code that is able to open https://youtube.com, type in the search box the word “Selenium” and hit Enter key so the search is performed by youtube engine, resulting in a bunch of videos related to Selenium, so let’s now list them.

Firstly, right click and “inspect element” on the video section and find the element which is the start of the video section. You can see in the image below that it’s a <div> tag with “id=’dismissable'”

Python Scrapy Selenium

We want to grab the title, so within the video, find the tag that covers the title. Again, right click on the title and “inspect element” – here you can see the element “id=’video-title'”. Within this tag, you can see the text of the title.

One last thing, let’s remind that we are working with internet and web browsing, so sometimes is needed to wait for the data to be able, in this case, we are going to wait 5 seconds after the search is performed and then retrieve the data we are looking information. Keep in mind that the results could vary due to internet speed, and device performance.

Once the code is executed you are going to see a list printed containing videos collected from YouTube as shown in the following image, which firstly prints the website title, then it tells us how many videos were collected and finally, it lists those videos.

Waiting for 5 seconds works, but then you have to adjust for each internet speed. There’s another mechanism you can use which is to wait for the actual element to be loaded – you can use this a with a try/except block instead.

So instead of the time.sleep(5), you can then replace the code with: Spare parts: episode 1 download.

This will wait up to a maximum of 5 seconds for the videos to load, otherwise it’ll timeout

Conclusion

With Selenium you are going to be able to perform endless of tasks, from automation tasks to automate testing, the sky is the limit here, you have learned how to scrape data from static and dynamic websites, performing javascript actions like send some keys like “Enter”. You can also look at BeautifulSoup to extract and search for data next

Subscribe to our newsletter

Scraping With Selenium

Get new tips in your inbox automatically. Subscribe to our newsletter!