R | Luis José Zapata Blog

Webscraping 201

Sun, 19 Sep 2021 00:00:00 +0000

More and more central government institutions are centralizing their main websites on the www.gob.pe platform. While the government also has a dedicated data website (Open Data Platform), unstructured information (like reports or PDFs) often contains more details.

For this example, we will retrieve the links to daily temperature reports from the main ports of the country, which are published daily in the Daily Oceanographic Bulletin by IMARPE and made available on the government website. Since individual links do not follow a simple query structure, it is not possible to predict the daily URL. Therefore, we will use the government’s search tool to obtain the daily links.

Gob.pe Search Tool

To scrape information from the government website, we will use its search tool. This tool also operates with a query mechanism, simplifying the search process.

To review the parameters used, visit www.gob.pe and search for the desired information. At first glance, the query starts with the term /busquedas?.

For example, if we need to view IMARPE’s latest publications within a defined interval, we use the query:

We start with /busquedas? to indicate we are using the search tool, followed by the type of content, which in this case is publications: contenido[]=publicaciones, and the institution: institucion[]=imarpe.

Next, we specify the search parameters. To define a time interval, we use the parameters desde (from) and hasta (to): desde=04-07-2018&hasta=16-09-2021. Since IMARPE also publishes weekly reports, we include a term to filter results with the word “diario” (daily): term=Diario.

Thus, the data request URL would be as follows:

https://www.gob.pe/busquedas?contenido[]=publicaciones&institucion[]=imarpe&desde**=04-09-2021**&hasta**=16-09-2021**&term=**Diario**

Entering this address into a web browser directs us to IMARPE’s latest daily bulletin publications. This approach can be adapted to various types of information across institutions by experimenting with the search tool and reviewing the query.

This allows us to access the daily bulletin publications.

Webscraping with Rvest

With the web address defined, we proceed to retrieve the links to each publication.

path = "https://www.gob.pe/busquedas?contenido[]=publicaciones&institucion[]=imarpe&reason=sheet&sheet=1"

For web scraping, we will use the Rvest library, which provides tools to read the HTML content of a webpage and extract information, such as URLs. The tool to download HTML content from a web address is read_html.

library(tidyverse)
library(lubridate)
library(stringr)
library(rvest)
imarpe = read_html(path)
imarpe

## {html_document}
## <html lang="es-pe">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body>\n<!-- Google Tag Manager (noscript) -->\n <noscript><iframe s ...

Using the read_html function on the Peruvian government results webpage, we obtain a set of HTML information stored in the variable imarpe.

The variable imarpe now contains a set of instructions for web browsers, organized in various tags. To extract the tags, we use the html_head function, and to retrieve the text within these tags, we use the html_text function. Next, we use text extraction functions (refer to Regex on Google) to extract the URLs of IMARPE publications using text patterns.

listas = imarpe %>%
html_node("head") %>% # Extract URLs from the head section
html_text() %>%
str_extract_all(paste("href(.*?)",year(today()),sep="")) %>% # Pattern: Starts with href and ends with 2021
as.data.frame(col.names = "Text") %>%
as_tibble() %>%
filter(str_detect(Text, pattern = "diario")) %>% # Filter URLs containing "diario"
mutate(Text = paste("https://www.gob.pe",
str_remove_all(Text, "href=\\\\\\\""),
sep = "")) %>% # Convert to full web address
mutate(Date = dmy(str_sub(Text, -10, -1))) %>% # Extract the date
arrange(Date)
imarpe %>%
html_node("head") %>% # Extract URLs from the head section
html_text() %>%
str_extract_all("href(.*?)2024") %>% # Pattern: Starts with href and ends with 2021
as.data.frame(col.names = "Text") %>%
as_tibble() %>%
filter(str_detect(Text, pattern = "diario"))

## # A tibble: 2 × 1
## Text
## <chr>
## 1 "href=\\\"/institucion/imarpe/informes-publicaciones/6204259-boletin-diario-o…
## 2 "href=\\\"/institucion/imarpe/informes-publicaciones/6200693-boletin-diario-o…

To process the text, we must first review the HTML text and its characteristics. The text processing functions are as follows:

First, extract text (str_extract_all) that starts with href (the tag indicating a URL in HTML) and ends with 2021 (the web addresses of interest include the year in their query). Using Regex, the dot (.) matches all characters. The asterisk (*) matches one or more, while the question mark ensures that the Regex is not greedy (captures the minimum number of characters possible to end before 2021).
Convert the extracted vectors into a dataset.
Filter URLs containing the word “diario” since we are looking for IMARPE’s Daily Oceanographic Bulletins, which include this word in their URLs.
Remove the href HTML tag from the text using the str_remove_all function.
Extract the dates by taking the last 10 characters of the URL (the query includes the date at the end of the URL).

listas

## # A tibble: 2 × 2
## Text Date
## <chr> <date>
## 1 https://www.gob.pe/institucion/imarpe/informes-publicaciones/62006… 2024-11-19
## 2 https://www.gob.pe/institucion/imarpe/informes-publicaciones/62042… 2024-11-20

This provides a list of URLs, which can later be used to download each publication.

Rvest is not the only method for web scraping with R. Additionally, more advanced tools, such as headless browsers (simulated web browsers capable of clicking and entering data like a normal browser), can be used. Examples include Rselenium (more complex) or Webdriver (simpler).

Conclusion

Using web scraping techniques and text manipulation, it is possible to extract the URLs of IMARPE publications via the search tool on the government website (www.gob.pe).

Using similar web scraping techniques, each publication can be accessed through a download loop to retrieve the uploaded information, which will be demonstrated in a future post.

Webscraping 101

Sun, 28 Mar 2021 00:00:00 +0000

The internet offers a wealth of free and easily accessible information. There are two types of available data: data that can be easily downloaded through a “download” button and data visible on web pages.

To obtain both types of data, it is not necessary to manually download them. It is possible to automate the process through code.

In this post, I will introduce how to download the first type of data (data requiring only pressing a “download” button) automatically via code. However, to achieve this, it is important to understand the mechanisms that enable downloading data online. Today, we will discuss two such mechanisms: the POST and GET methods.

HTTP: GET and POST

The HTTP protocol is the communication protocol used by browsers to access web pages. HTTP regulates how the server (where the web page is hosted) sends resources to the client (the web browser). These resources contain the “instructions” a web browser (e.g., Chrome) uses to display and interact with the web page.

A URL is a web address that defines how a resource is located on the internet (e.g., https://www.google.com). When entering this address into a browser, it sends an HTTP request to the server, asking for the web resource (using a GET or POST method). The server receives the request, searches for the file in question (an HTML page, Excel file, etc.), and sends a header back to the browser. The header is essentially a message indicating whether the search was successful (whether the file exists). If successful, the server also sends the requested file.

For data downloads, two primary HTTP methods are commonly used: GET and POST. The GET method is used to “retrieve” resources from the server via a request. The POST method allows for “retrieving” and “saving” resources on the server by including additional data in the request.

GET: This HTTP method is used to retrieve information from a server. It sends a request to obtain a response (e.g., an Excel file or data we want to download) to the client (web page). A key feature of the GET method is that it allows the data to be transmitted “visibly” in the browser through the URL using queries. A query is a message added to the URL specifying the required information. For example, suppose we want to request visitor data for November from a hypothetical website, www.luis.com. The variables might include variable = visits and month = November. The query would look like: www.luis.com/download?variable=visits&month=november. The bold section is called the query. This query sends the request to the server and receives a response in return (e.g., an Excel file). Each website has its own syntax for how to structure such requests.
POST: Another HTTP method, which, unlike GET, sends information to be processed and stored on the server. The POST method can also be used to download data. The uniqueness of POST for downloading data lies in the use of two URLs. This improves server security. To download a file with the POST method, a query is sent similar to GET. However, instead of immediately providing a response, the server opens a temporary link at another URL for the file download.

Example of Downloading with GET and POST

To better understand the GET and POST protocols, we will use both methods to download data available on the COES website. The site contains data that can be downloaded using both methods.

Example: POST Method

The COES Indicators Portal allows for downloading daily electricity production data up to the previous day.

To access this section, go to the Indicators Portal on the COES website. Clicking the export button sends a query containing the selected date range in the request, and in return, an Excel file with the data is provided. How do we determine whether the method is POST or GET? It’s easy, using web analysis tools like the Chrome Developer Tools.

Figure 1: Chrome Developer Tools

Clicking on it reveals several tools, the most relevant being Network. Selecting Network shows the interaction between our client (web page) and the COES server. First, click Clear to clean up the workspace and focus on future interactions. Then, click the Export button on the COES webpage (after selecting the date range) to download the Excel file. In the Network area, two interactions appear. The presence of two interactions suggests the use of a POST method (one URL for sending and another for downloading).

Figure 2: Network

Click the first interaction (exportargeneracion) for more details. Since it’s the first interaction, it should contain the request sent to the server.

Figure 3: Headers - POST

We can see that the Request Method is POST. Additionally, the request URL is "https://www.coes.org.pe/Portal/portalinformacion/exportargeneracion". Finally, the request sends three pieces of data: the start date (fechaInicial), the end date (fechaFinal), and an indicator (indicador).

Note the date format: day/month/year, where both the day and month always have two digits. This format is crucial when sending the request.

A header acts as metadata (additional descriptive information) included in the request (request) and response. There are 15 request headers in the image. While not mandatory, they can be helpful, as we will see later.

Finally, the Status Code = 200 indicates a successful interaction.

Thus, we can conclude that the method for obtaining this information is POST, including the requested date range in the query.

Since it’s a POST method, the request does not provide an immediate response. We must check the second interaction to find the URL for downloading the data.

Clicking on the second interaction (descargargeneracion) reveals the following information:

Figure 4: Headers - GET

Here, a “gateway” to the data was opened at the URL: “https://www.coes.org.pe/Portal/portalinformacion/descargargeneracion”. This interaction uses the GET method, meaning that entering the URL automatically retrieves the data from the server. Since the request data was already sent in the previous interaction, no additional query is needed.

Example: GET Method

In this example, we will download data from the Daily Operation Evaluation Report (IEOD). This report provides more granular daily electricity consumption data, such as demand by geographic zones, large companies, resources, and more. To access this data, first, visit the IEOD platform and verify whether the download function uses POST or GET.

Figure 5: IEOD

After selecting a date on the IEOD platform and choosing the desired Excel file ("Anexo1_Resumen_0709.xlsx"), right-click on the link to copy the URL. In this case, the copied URL is: “https://www.coes.org.pe/portal/browser/download?url=Post%20Operaci%C3%B3n%2FReportes%2FIEOD%2F2020%2F09%20Setiembre%2F07%2FAnexo1_Resumen_0709.xlsx”. From the bold portion, it is evident that this is a GET method (data added to the request is visible in the URL, unlike POST). But which parts of the request change, and what do characters like %2F, %20, or %C3%B3 mean?

Using the Network section of the Chrome Developer Tools, we can see that the query starts after download?url=.

Figure 6: Headers - GET

From this, we can infer that the GET portion carrying the data, or the query, is: “Post Operación/Reportes/IEOD/**2020