<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>R | Luis José Zapata Blog</title><link>https://luis-zapatabobadilla.netlify.app/tag/r/</link><atom:link href="https://luis-zapatabobadilla.netlify.app/tag/r/index.xml" rel="self" type="application/rss+xml"/><description>R</description><generator>Wowchemy (https://wowchemy.com)</generator><language>en-us</language><lastBuildDate>Sun, 19 Sep 2021 00:00:00 +0000</lastBuildDate><image><url>https://luis-zapatabobadilla.netlify.app/media/icon_hud7092cb2ef3850b88d56971ecbc6aabf_452_512x512_fill_lanczos_center_2.png</url><title>R</title><link>https://luis-zapatabobadilla.netlify.app/tag/r/</link></image><item><title>Webscraping 201</title><link>https://luis-zapatabobadilla.netlify.app/project/webscraping-201/</link><pubDate>Sun, 19 Sep 2021 00:00:00 +0000</pubDate><guid>https://luis-zapatabobadilla.netlify.app/project/webscraping-201/</guid><description>&lt;p>More and more central government institutions are centralizing their main websites on the &lt;a href="www.gob.pe">www.gob.pe&lt;/a> platform. While the government also has a dedicated data website (&lt;a href="https://www.datosabiertos.gob.pe/" target="_blank" rel="noopener">Open Data Platform&lt;/a>), unstructured information (like reports or PDFs) often contains more details.&lt;/p>
&lt;p>For this example, we will retrieve the links to daily temperature reports from the main ports of the country, which are published daily in the &lt;strong>Daily Oceanographic Bulletin&lt;/strong> by IMARPE and made available on the government website. Since individual links do not follow a simple query structure, it is not possible to predict the daily URL. Therefore, we will use the government&amp;rsquo;s search tool to obtain the daily links.&lt;/p>
&lt;h2 id="gobpe-search-tool">Gob.pe Search Tool&lt;/h2>
&lt;p>To scrape information from the government website, we will use its search tool. This tool also operates with a query mechanism, simplifying the search process.&lt;/p>
&lt;p>To review the parameters used, visit &lt;a href="http://www.gob.pe">www.gob.pe&lt;/a> and search for the desired information. At first glance, the query starts with the term &lt;strong>/busquedas?&lt;/strong>.&lt;/p>
&lt;p>For example, if we need to view IMARPE&amp;rsquo;s latest publications within a defined interval, we use the query:&lt;/p>
&lt;p>We start with &lt;strong>/busquedas?&lt;/strong> to indicate we are using the search tool, followed by the type of content, which in this case is publications: &lt;strong>contenido[]=publicaciones&lt;/strong>, and the institution: &lt;strong>institucion[]=imarpe&lt;/strong>.&lt;/p>
&lt;p>Next, we specify the search parameters. To define a time interval, we use the parameters &lt;code>desde&lt;/code> (from) and &lt;code>hasta&lt;/code> (to): &lt;strong>desde=04-07-2018&amp;amp;hasta=16-09-2021&lt;/strong>. Since IMARPE also publishes weekly reports, we include a term to filter results with the word &amp;ldquo;diario&amp;rdquo; (daily): &lt;strong>term=Diario&lt;/strong>.&lt;/p>
&lt;p>Thus, the data request URL would be as follows:&lt;/p>
&lt;blockquote>
&lt;p>&lt;a href="https://www.gob.pe/">https://www.gob.pe/&lt;/a>&lt;strong>busquedas?&lt;strong>contenido[]&lt;/strong>=publicaciones&lt;/strong>&amp;amp;institucion[]&lt;strong>=imarpe&lt;/strong>&amp;amp;desde**=04-09-2021**&amp;amp;hasta**=16-09-2021**&amp;amp;term=**Diario**&lt;/p>
&lt;/blockquote>
&lt;p>Entering this address into a web browser directs us to IMARPE&amp;rsquo;s latest daily bulletin publications. This approach can be adapted to various types of information across institutions by experimenting with the search tool and reviewing the query.&lt;/p>
&lt;p>This allows us to access the daily bulletin publications.&lt;/p>
&lt;h2 id="webscraping-with-rvest">Webscraping with Rvest&lt;/h2>
&lt;p>With the web address defined, we proceed to retrieve the links to each publication.&lt;/p>
&lt;pre>&lt;code class="language-r">path = &amp;quot;https://www.gob.pe/busquedas?contenido[]=publicaciones&amp;amp;institucion[]=imarpe&amp;amp;reason=sheet&amp;amp;sheet=1&amp;quot;
&lt;/code>&lt;/pre>
&lt;p>For web scraping, we will use the &lt;strong>Rvest&lt;/strong> library, which provides tools to read the HTML content of a webpage and extract information, such as URLs. The tool to download HTML content from a web address is &lt;strong>read_html&lt;/strong>.&lt;/p>
&lt;pre>&lt;code class="language-r">library(tidyverse)
library(lubridate)
library(stringr)
library(rvest)
imarpe = read_html(path)
imarpe
&lt;/code>&lt;/pre>
&lt;pre>&lt;code>## {html_document}
## &amp;lt;html lang=&amp;quot;es-pe&amp;quot;&amp;gt;
## [1] &amp;lt;head&amp;gt;\n&amp;lt;meta http-equiv=&amp;quot;Content-Type&amp;quot; content=&amp;quot;text/html; charset=UTF-8 ...
## [2] &amp;lt;body&amp;gt;\n&amp;lt;!-- Google Tag Manager (noscript) --&amp;gt;\n &amp;lt;noscript&amp;gt;&amp;lt;iframe s ...
&lt;/code>&lt;/pre>
&lt;p>Using the &lt;strong>read_html&lt;/strong> function on the Peruvian government results webpage, we obtain a set of HTML information stored in the variable &lt;em>imarpe&lt;/em>.&lt;/p>
&lt;p>The variable &lt;em>imarpe&lt;/em> now contains a set of instructions for web browsers, organized in various tags. To extract the tags, we use the &lt;strong>html_head&lt;/strong> function, and to retrieve the text within these tags, we use the &lt;strong>html_text&lt;/strong> function. Next, we use text extraction functions (refer to Regex on Google) to extract the URLs of IMARPE publications using text patterns.&lt;/p>
&lt;pre>&lt;code class="language-r">listas = imarpe %&amp;gt;%
html_node(&amp;quot;head&amp;quot;) %&amp;gt;% # Extract URLs from the head section
html_text() %&amp;gt;%
str_extract_all(paste(&amp;quot;href(.*?)&amp;quot;,year(today()),sep=&amp;quot;&amp;quot;)) %&amp;gt;% # Pattern: Starts with href and ends with 2021
as.data.frame(col.names = &amp;quot;Text&amp;quot;) %&amp;gt;%
as_tibble() %&amp;gt;%
filter(str_detect(Text, pattern = &amp;quot;diario&amp;quot;)) %&amp;gt;% # Filter URLs containing &amp;quot;diario&amp;quot;
mutate(Text = paste(&amp;quot;https://www.gob.pe&amp;quot;,
str_remove_all(Text, &amp;quot;href=\\\\\\\&amp;quot;&amp;quot;),
sep = &amp;quot;&amp;quot;)) %&amp;gt;% # Convert to full web address
mutate(Date = dmy(str_sub(Text, -10, -1))) %&amp;gt;% # Extract the date
arrange(Date)
imarpe %&amp;gt;%
html_node(&amp;quot;head&amp;quot;) %&amp;gt;% # Extract URLs from the head section
html_text() %&amp;gt;%
str_extract_all(&amp;quot;href(.*?)2024&amp;quot;) %&amp;gt;% # Pattern: Starts with href and ends with 2021
as.data.frame(col.names = &amp;quot;Text&amp;quot;) %&amp;gt;%
as_tibble() %&amp;gt;%
filter(str_detect(Text, pattern = &amp;quot;diario&amp;quot;))
&lt;/code>&lt;/pre>
&lt;pre>&lt;code>## # A tibble: 2 × 1
## Text
## &amp;lt;chr&amp;gt;
## 1 &amp;quot;href=\\\&amp;quot;/institucion/imarpe/informes-publicaciones/6204259-boletin-diario-o…
## 2 &amp;quot;href=\\\&amp;quot;/institucion/imarpe/informes-publicaciones/6200693-boletin-diario-o…
&lt;/code>&lt;/pre>
&lt;p>To process the text, we must first review the HTML text and its characteristics. The text processing functions are as follows:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>First, extract text (&lt;strong>str_extract_all&lt;/strong>) that starts with &lt;strong>href&lt;/strong> (the tag indicating a URL in HTML) and ends with &lt;strong>2021&lt;/strong> (the web addresses of interest include the year in their query). Using Regex, the dot (.) matches all characters. The asterisk (*) matches one or more, while the question mark ensures that the Regex is not greedy (captures the minimum number of characters possible to end before 2021).&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Convert the extracted vectors into a dataset.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Filter URLs containing the word &amp;ldquo;&lt;strong>diario&lt;/strong>&amp;rdquo; since we are looking for IMARPE&amp;rsquo;s &lt;strong>Daily Oceanographic Bulletins&lt;/strong>, which include this word in their URLs.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Remove the &lt;strong>href&lt;/strong> HTML tag from the text using the &lt;strong>str_remove_all&lt;/strong> function.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Extract the dates by taking the last 10 characters of the URL (the query includes the date at the end of the URL).&lt;/p>
&lt;/li>
&lt;/ul>
&lt;pre>&lt;code class="language-r">listas
&lt;/code>&lt;/pre>
&lt;pre>&lt;code>## # A tibble: 2 × 2
## Text Date
## &amp;lt;chr&amp;gt; &amp;lt;date&amp;gt;
## 1 https://www.gob.pe/institucion/imarpe/informes-publicaciones/62006… 2024-11-19
## 2 https://www.gob.pe/institucion/imarpe/informes-publicaciones/62042… 2024-11-20
&lt;/code>&lt;/pre>
&lt;p>This provides a list of URLs, which can later be used to download each publication.&lt;/p>
&lt;blockquote>
&lt;p>Rvest is not the only method for web scraping with R. Additionally, more advanced tools, such as headless browsers (simulated web browsers capable of clicking and entering data like a normal browser), can be used. Examples include &lt;strong>Rselenium&lt;/strong> (more complex) or &lt;strong>Webdriver&lt;/strong> (simpler).&lt;/p>
&lt;/blockquote>
&lt;h2 id="conclusion">Conclusion&lt;/h2>
&lt;p>Using web scraping techniques and text manipulation, it is possible to extract the URLs of IMARPE publications via the search tool on the government website (&lt;a href="http://www.gob.pe">www.gob.pe&lt;/a>).&lt;/p>
&lt;p>Using similar web scraping techniques, each publication can be accessed through a download loop to retrieve the uploaded information, which will be demonstrated in a future post.&lt;/p></description></item><item><title>Webscraping 101</title><link>https://luis-zapatabobadilla.netlify.app/project/webscraping-101/</link><pubDate>Sun, 28 Mar 2021 00:00:00 +0000</pubDate><guid>https://luis-zapatabobadilla.netlify.app/project/webscraping-101/</guid><description>&lt;p>The internet offers a wealth of free and easily accessible information. There are two types of available data: data that can be easily downloaded through a &amp;ldquo;&lt;em>download&lt;/em>&amp;rdquo; button and data visible on web pages.&lt;/p>
&lt;p>To obtain both types of data, it is not necessary to manually download them. It is possible to automate the process through code.&lt;/p>
&lt;p>In this post, I will introduce how to download the first type of data (data requiring only pressing a &amp;ldquo;&lt;em>download&lt;/em>&amp;rdquo; button) automatically via code. However, to achieve this, it is important to understand the mechanisms that enable downloading data online. Today, we will discuss two such mechanisms: the &lt;strong>POST&lt;/strong> and &lt;strong>GET&lt;/strong> methods.&lt;/p>
&lt;h2 id="http-get-and-post">HTTP: GET and POST&lt;/h2>
&lt;p>The &lt;strong>HTTP&lt;/strong> protocol is the communication protocol used by browsers to access web pages. HTTP regulates how the server (where the web page is hosted) sends resources to the client (the web browser). These resources contain the &amp;ldquo;&lt;em>instructions&lt;/em>&amp;rdquo; a web browser (e.g., Chrome) uses to display and interact with the web page.&lt;/p>
&lt;p>A &lt;strong>URL&lt;/strong> is a &lt;em>web address&lt;/em> that defines how a resource is located on the internet (e.g., &lt;a href="https://www.google.com">https://www.google.com&lt;/a>). When entering this address into a browser, it sends an HTTP request to the server, asking for the web resource (using a &lt;strong>GET&lt;/strong> or &lt;strong>POST&lt;/strong> method). The server receives the request, searches for the file in question (an HTML page, Excel file, etc.), and sends a &lt;strong>header&lt;/strong> back to the browser. The &lt;strong>header&lt;/strong> is essentially a message indicating whether the search was successful (whether the file exists). If successful, the server also sends the requested file.&lt;/p>
&lt;p>For data downloads, two primary HTTP methods are commonly used: &lt;strong>GET&lt;/strong> and &lt;strong>POST&lt;/strong>. The &lt;strong>GET&lt;/strong> method is used to &amp;ldquo;&lt;em>retrieve&lt;/em>&amp;rdquo; resources from the server via a request. The &lt;strong>POST&lt;/strong> method allows for &amp;ldquo;&lt;em>retrieving&lt;/em>&amp;rdquo; and &amp;ldquo;&lt;em>saving&lt;/em>&amp;rdquo; resources on the server by including additional data in the request.&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>GET&lt;/strong>: This HTTP method is used to &lt;em>retrieve information&lt;/em> from a server. It sends a request to obtain a response (e.g., an Excel file or data we want to download) to the client (web page). A key feature of the GET method is that it allows the data to be transmitted &amp;ldquo;visibly&amp;rdquo; in the browser through the URL using &lt;strong>queries&lt;/strong>. A query is a message added to the &lt;strong>URL&lt;/strong> specifying the required information. For example, suppose we want to request visitor data for November from a hypothetical website, &lt;a href="http://www.luis.com">www.luis.com&lt;/a>. The variables might include &lt;code>variable = visits&lt;/code> and &lt;code>month = November&lt;/code>. The query would look like: &lt;a href="http://www.luis.com/">www.luis.com/&lt;/a>&lt;strong>download?variable=visits&amp;amp;month=november&lt;/strong>. The bold section is called the &lt;strong>query&lt;/strong>. This query sends the request to the server and receives a response in return (e.g., an Excel file). Each website has its own syntax for how to structure such requests.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>POST&lt;/strong>: Another HTTP method, which, unlike GET, &lt;em>sends information&lt;/em> to be processed and stored on the server. The POST method can also be used to download data. The uniqueness of POST for downloading data lies in the use of two URLs. This improves server security. To download a file with the POST method, a &lt;strong>query&lt;/strong> is sent similar to GET. However, instead of immediately providing a response, the server opens a &lt;em>temporary link&lt;/em> at another URL for the file download.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h3 id="example-of-downloading-with-get-and-post">Example of Downloading with GET and POST&lt;/h3>
&lt;p>To better understand the GET and POST protocols, we will use both methods to download data available on the &lt;a href="https://www.coes.org.pe/portal/" target="_blank" rel="noopener">COES&lt;/a> website. The site contains data that can be downloaded using both methods.&lt;/p>
&lt;h4 id="example-post-method">Example: POST Method&lt;/h4>
&lt;p>The COES Indicators Portal allows for downloading daily electricity production data up to the previous day.&lt;/p>
&lt;p>To access this section, go to the &lt;strong>Indicators Portal&lt;/strong> on the COES website. Clicking the export button sends a &lt;strong>query&lt;/strong> containing the selected date range in the &lt;strong>request&lt;/strong>, and in return, an Excel file with the data is provided. &lt;em>How do we determine whether the method is POST or GET?&lt;/em> It’s easy, using web analysis tools like the &lt;strong>Chrome Developer Tools&lt;/strong>.&lt;/p>
&lt;div class="figure" style="text-align: center">
&lt;img src="images/HTMLI.png" alt="Chrome Developer Tools" width="600px" />
&lt;p class="caption">&lt;span id="fig:unnamed-chunk-1">&lt;/span>Figure 1: Chrome Developer Tools&lt;/p>
&lt;/div>
&lt;p>Clicking on it reveals several tools, the most relevant being &lt;strong>Network&lt;/strong>. Selecting Network shows the interaction between our client (web page) and the COES server. First, click &lt;em>Clear&lt;/em> to clean up the workspace and focus on future interactions. Then, click the &lt;strong>Export&lt;/strong> button on the COES webpage (after selecting the date range) to download the Excel file. In the &lt;em>Network&lt;/em> area, two interactions appear. The presence of two interactions suggests the use of a POST method (one URL for sending and another for downloading).&lt;/p>
&lt;div class="figure" style="text-align: center">
&lt;img src="images/HTMLII.png" alt="Network" width="700px" />
&lt;p class="caption">&lt;span id="fig:unnamed-chunk-2">&lt;/span>Figure 2: Network&lt;/p>
&lt;/div>
&lt;p>Click the first interaction (&lt;strong>exportargeneracion&lt;/strong>) for more details. Since it’s the first interaction, it should contain the &lt;strong>request&lt;/strong> sent to the server.&lt;/p>
&lt;div class="figure" style="text-align: center">
&lt;img src="images/request.png" alt="Headers - POST" width="600px" />
&lt;p class="caption">&lt;span id="fig:unnamed-chunk-3">&lt;/span>Figure 3: Headers - POST&lt;/p>
&lt;/div>
&lt;p>We can see that the &lt;strong>Request Method&lt;/strong> is &lt;strong>POST&lt;/strong>. Additionally, the request URL is &lt;em>&amp;quot;&lt;a href="https://www.coes.org.pe/Portal/portalinformacion/exportargeneracion">https://www.coes.org.pe/Portal/portalinformacion/exportargeneracion&lt;/a>&amp;quot;&lt;/em>. Finally, the request sends three pieces of data: the start date (&lt;strong>fechaInicial&lt;/strong>), the end date (&lt;strong>fechaFinal&lt;/strong>), and an indicator (&lt;strong>indicador&lt;/strong>).&lt;/p>
&lt;p>Note the date format: &lt;em>day/month/year&lt;/em>, where both the day and month always have two digits. This format is crucial when sending the request.&lt;/p>
&lt;p>A &lt;strong>header&lt;/strong> acts as metadata (additional descriptive information) included in the request (request) and response. There are 15 &lt;strong>request headers&lt;/strong> in the image. While not mandatory, they can be helpful, as we will see later.&lt;/p>
&lt;p>Finally, the &lt;strong>Status Code = 200&lt;/strong> indicates a successful interaction.&lt;/p>
&lt;p>Thus, we can conclude that the method for obtaining this information is &lt;strong>POST&lt;/strong>, including the requested date range in the query.&lt;/p>
&lt;blockquote>
&lt;p>Since it’s a POST method, the request does not provide an immediate response. We must check the second interaction to find the URL for downloading the data.&lt;/p>
&lt;/blockquote>
&lt;p>Clicking on the second interaction (&lt;strong>descargargeneracion&lt;/strong>) reveals the following information:&lt;/p>
&lt;div class="figure" style="text-align: center">
&lt;img src="images/response.png" alt="Headers - GET" width="700px" />
&lt;p class="caption">&lt;span id="fig:unnamed-chunk-4">&lt;/span>Figure 4: Headers - GET&lt;/p>
&lt;/div>
&lt;p>Here, a &amp;ldquo;&lt;em>gateway&lt;/em>&amp;rdquo; to the data was opened at the URL: &amp;ldquo;&lt;a href="https://www.coes.org.pe/Portal/portalinformacion/descargargeneracion" target="_blank" rel="noopener">&lt;em>https://www.coes.org.pe/Portal/portalinformacion/descargargeneracion&lt;/em>&lt;/a>&amp;rdquo;. This interaction uses the GET method, meaning that entering the URL automatically retrieves the data from the server. Since the request data was already sent in the previous interaction, no additional &lt;strong>query&lt;/strong> is needed.&lt;/p>
&lt;h4 id="example-get-method">Example: GET Method&lt;/h4>
&lt;p>In this example, we will download data from the Daily Operation Evaluation Report (IEOD). This report provides more granular daily electricity consumption data, such as demand by geographic zones, large companies, resources, and more. To access this data, first, visit the IEOD platform and verify whether the download function uses &lt;strong>POST&lt;/strong> or &lt;strong>GET&lt;/strong>.&lt;/p>
&lt;div class="figure" style="text-align: center">
&lt;img src="images/IEOD_GET.png" alt="IEOD" width="500px" />
&lt;p class="caption">&lt;span id="fig:unnamed-chunk-5">&lt;/span>Figure 5: IEOD&lt;/p>
&lt;/div>
&lt;p>After selecting a date on the IEOD platform and choosing the desired Excel file (&amp;quot;&lt;strong>Anexo1_Resumen_0709.xlsx&lt;/strong>&amp;quot;), &lt;strong>right-click on the link&lt;/strong> to copy the URL. In this case, the copied URL is: &amp;ldquo;&lt;a href="https://www.coes.org.pe/portal/browser/">https://www.coes.org.pe/portal/browser/&lt;/a>&lt;strong>download?url=Post%20Operaci%C3%B3n%2FReportes%2FIEOD%2F2020%2F09%20Setiembre%2F07%2FAnexo1_Resumen_0709.xlsx&lt;/strong>&amp;rdquo;. From the bold portion, it is evident that this is a &lt;strong>GET&lt;/strong> method (data added to the &lt;strong>request&lt;/strong> is visible in the URL, unlike POST). But which parts of the request change, and what do characters like &lt;strong>%2F&lt;/strong>, &lt;strong>%20&lt;/strong>, or &lt;strong>%C3%B3&lt;/strong> mean?&lt;/p>
&lt;p>Using the &lt;strong>Network&lt;/strong> section of the &lt;strong>Chrome Developer Tools&lt;/strong>, we can see that the &lt;strong>query&lt;/strong> starts after &lt;em>download?url=&lt;/em>.&lt;/p>
&lt;div class="figure" style="text-align: center">
&lt;img src="images/query1.png" alt="Headers - GET" width="500px" />
&lt;p class="caption">&lt;span id="fig:unnamed-chunk-6">&lt;/span>Figure 6: Headers - GET&lt;/p>
&lt;/div>
&lt;p>From this, we can infer that the &lt;strong>GET&lt;/strong> portion carrying the data, or the &lt;strong>query&lt;/strong>, is: &amp;ldquo;Post Operación/Reportes/IEOD/**2020&lt;/p></description></item></channel></rss>