How to scrape web pages with Julia
Julia can be used for fast web scraping, not just data analysis.
Updated May 26, 2023Tested with Julia version 1.9.0
Do you want to crawl and scrape web pages with the Julia language? This tutorial will show you how.
Ethical and legal web scraping
Before we begin, it goes without saying that you should only scrape web pages where you are not infringing any laws or rules by doing so. See the box below for a few websites you can scrape legally and ethically.
Introducing three Julia web scraping packages: HTTP.jl, Gumbo.jl and AbstractTrees.jl
We’re going to use three packages. Here’s what they each do:
HTTP.jl
Scrapes web pages
Gumbo.jl
Parses these web pages after we’ve scraped them with HTTP.jl
, so that we can more easily retrieve specific HTML elements (such as an <h1>
heading).
AbstractTrees.jl
Allows us to extract specific HTML elements (again, such as an <h1>
heading) by name, rather than position.
Without AbstractTrees.jl
, as far as I know, you’ll only be able to receive specific HTML elements by their position in the source code.
For example, you’d need to know that the <h1>
heading is, say, the 17th element in the source code. Of course, if you are scraping multiple websites, or even multiple page types from the same websites, this won’t always be true, and your scrape won’t work.
Downloading and installing the packages
First, if you don’t have Julia itself installed, here’s how to install Julia on a Mac.)
OK, once you have Julia installed, fire up your terminal of choice, and enter the Julia REPL by typing julia
at your command prompt.
You should see the green Julia command prompt:
julia>
Now it’s time to add the packages.
My favourite way to do this is to type the ]
(right square bracket, located above the Return / Enter key on your keyboard) into the Julia terminal.
This changes the green prompt we saw above into a purple one that says:
pkg>
The three letters above stand for “package”. We’re now in Julia’s package manager.
We can now easily add the three web scraping and parsing packages we need, using the add
command:
add HTTP Gumbo AbstractTrees
Note that we separate the package names with spaces, rather than commas.
Hit Enter and let Julia do its thing.
Once the packages have been installed, you can exit out of Julia’s package manager by pressing Delete / Backspace. You’ll then see the green prompt again:
julia>
“Using” the packages
Before we do any actual web scraping, we need to tell Julia we intend to use all three of our newly installed packages:
using HTTP, Gumbo, AbstractTrees
Unlike when we add
ed the packages, we need to use commas to separate the package names in a using
command.
Scraping example.com
Now that we’re all set up, let’s get to work scraping the homepage of example.com.
We’ll store it in a new variable we’ll create, and we’ll call this variable r
(for “request”, as in “HTTP request”).
We’ll just use the HTTP
package for now—we’ll use the others later.
We enter our command into the Julia REPL:
r = HTTP.get("https://example.com/")
The output
Here’s the output. We’ll walk through it below.
HTTP.Messages.Response:
"""
HTTP/1.1 200 OK
Age: 433583
Cache-Control: max-age=604800
Content-Type: text/html; charset=UTF-8
Date: Sun, 02 Aug 2020 02:37:09 GMT
Etag: "3147526947+ident"
Expires: Sun, 09 Aug 2020 02:37:09 GMT
Last-Modified: Thu, 17 Oct 2019 07:18:26 GMT
Server: ECS (sjc/4E8D)
Vary: Accept-Encoding
X-Cache: HIT
Content-Length: 1256
<!doctype html>
<html>
<head>
<title>Example Domain</title>
<meta charset="utf-8" />
<meta http-equiv="Content-type" content="text/html; charset=utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<style type="text/css">
body {
background-color: #f0f0f2;
margin: 0;
padding: 0;
font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
}
div {
width: 600px;
margin: 5em auto;
padding: 2em;
background-color: #fdfdff;
border-radius: 0.5em;
box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);
}
a:link, a:visited {
color: #38488f;
text-decoration: none;
}
@media (max-width: 700px) {
div {
margin: 0 auto;
width: auto;
}
}
</style>
</head>
<body>
<div>
<h1>Example Domain</h1>
<p>This domai
⋮
1256-byte body
"""
Output walkthrough
Julia truncates the output, but we can see, in descending order:
- the protocol and status code (
HTTP/1.1 200 OK
).HTTP/1.1
is the protocol and200
is the status code. A code of200
means the page has loaded correctly, which is why we also see the wordOK
. - the headers, which in this case consists of eleven key-value pairs from
Age: 433583
through toContent-Length: 1256
- the first 1000 or so characters of the source code of the actual HTML document. In this case, we see
<!doctype html>
through to just after<h1>Example Domain</h1>
. Note that the full web page will be stored in our variabler
; it’s just the output in the Julia REPL that’s truncated.
Later on in this tutorial, we’ll explore techniques for laser-targeting just the status code or just the headers (see the table of contents at the top for links to these), but for now, let’s continue exploring how we can target individual HTML elements in the source code. First though, the next three sections show you how to set cookies and headers when web scraping with Julia and the HTTP.jl package. If you don’t need to set cookies or headers, feel free to skip past these.
How to set cookies when web scraping with Julia and the HTTP.jl package
This week, I needed to scrape our company’s staging server.
I knew that I would need my scraper to set a cookie to make our staged web application behave properly for the scrape.
To set the cookie, I started with the following (vastly simplified) code:
using HTTP
url = "https://staging.example.com"
r = HTTP.get(url; cookies=Dict("foo"=>123))
Notice that a semicolon separates the first two parameters from the cookies
dictionary—it’d be easy to misread this as a comma.
Below are some additional pointers on setting cookies when scraping websites with Julia.
How to set the value of a cookie to true with HTTP.jl
If you need to set the value of a cookie to true
, set it to "true"
(with the double quotes). For example:
r = HTTP.get(url; cookies=Dict("experiment"=>"true"))
(Don’t worry, it will come through correctly as true
.)
Otherwise you’ll get an error.
How to set multiple cookies with HTTP.jl
To set multiple cookies when scraping with Julia, you can do this this:
r = HTTP.get(url; cookies=Dict("foo"=>123, "experiment"=>"true"))
You don’t need to set a cookie expiry with HTTP.jl
I didn’t bother setting a cookie expiry since my script sets the cookie each time the code scrapes a url.
So that wraps up the “how to” on setting cookies with Julia web scraping. Before we get back to the tutorial proper, a quick tip on setting headers.
How to set headers when web scraping with Julia and the HTTP.jl package
Today I need to scrape a different staging server at work. This didn’t require cookies to be set, but it did require a custom header. In this case, I needed to set an x-country
header.
Here’s the relevant line of code which does that in Julia using the HTTP.jl package:
r = HTTP.get(url; headers=Dict("x-country" => "US"))
How to set both cookies and headers when scraping with Julia and the HTTP.jl package
If you need to set both cookies and headers when scraping with Julia’s HTTP.jl package, you can do so with something like this:
r = HTTP.get(url; headers=Dict("x-country" => "US"), cookies=Dict("ALLOW_ZONE_OVERRIDE"=>"true"))
And now, let’s continue our tutorial on scraping example.com.
How to scrape specific HTML elements with Gumbo.jl and Julia
Gumbo.jl
is a Julia package that enables us to transform the relatively amorphous blob of HTML we crawled with HTTP.jl
into a parseable HTML tree.
What this means in practice is we can zero in on particular HTML elements, such as the <h1>
heading. We’ll do just that below.
Turning our blob of HTML into a parseable HTML tree with Gumbo
Earlier, we stored our crawl of the homepage of example.com into a variable we called r
(which stands for “request”, as in “HTTP request”), like this:
using HTTP, Gumbo, AbstractTrees
r = HTTP.get("https://example.com/")
Now we’ll make a “Gumbo” version—which will be easily traversable—of that r
variable, and we’ll call it r_parsed
.
The way to do that is below:
r_parsed = parsehtml(String(r.body))
The output from the Gumbo command
After inputting the command above, we get:
HTML Document:
<!DOCTYPE html>
HTMLElement{:HTML}:<HTML>
<head>
<title>
Example Domain
</title>
<meta charset="utf-8"/>
<meta content="text/html; charset=utf-8" http-equiv="Content-type"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<style type="text/css">
body {
background-color: #f0f0f2;
margin: 0;
padding: 0;
font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
}
div {
width: 600px;
margin: 5em auto;
padding: 2em;
...
This output is also truncated, but we’re now ready to start digging down into our parsed HTML tree. We’ll do that right now.
Digging down into Gumbo’s parsed HTML tree
To dig down into Gumbo’s parsed HTML tree, we’ll need to append .root
to r_parsed
to start working with it. So we’ll be using r_parsed.root
.
Now, let’s say we want to see just the <head>
section of this source code, the source code of example.com.
Well, there are two main sections in the source code of any web page: the <head>
(available at r_parsed.root[1]
), and the <body>
(which can be found at r_parsed.root[2]
). (Remember that Julia uses 1-based indexing.)
Since we want the <head>
section to start with, we’ll use this command:
head = r_parsed.root[1]
We get:
HTMLElement{:head}:<head>
<title>
Example Domain
</title>
<meta charset="utf-8"/>
<meta content="text/html; charset=utf-8" http-equiv="Content-type"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<style type="text/css">
body {
background-color: #f0f0f2;
margin: 0;
padding: 0;
font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
}
div {
width: 600px;
margin: 5em auto;
padding: 2em;
background-color: #fdfdff;
...
Now, if we want just the <body>
section, we’ll use this command:
body = r_parsed.root[2]
… and we see this:
HTMLElement{:body}:<body>
<div>
<h1>
Example Domain
</h1>
<p>
This domain is for use in illustrative examples in documents. You may use this
domain in literature without prior coordination or asking for permission.
</p>
<p>
<a href="https://www.iana.org/domains/example">
More information...
</a>
</p>
</div>
</body>
Getting the text of the h1 heading
OK, so how do we get the text of the <h1>
heading?
Well, immediately inside the <body>
, we have a <div>
, and then immediately inside that, we have our <h1>
. So that’s the first element inside the first element inside the <body>
.
So we’ll do this:
h1 = body[1][1]
(Remember that we created the variable body
above—it’s not built into Gumbo.)
… and we get this:
HTMLElement{:h1}:<h1>
Example Domain
</h1>
OK, so how do we get just the text of this <h1>
?
Well, if we enter this command:
h1[1].text
… we get this:
"Example Domain"
Booyah!
Scraping a given HTML element by name in Julia using the AbstractTrees package
Okay, so the code above is all well and good if you’re scraping a particular page type where the template is relatively fixed, and thus the order of the HTML elements doesn’t change. Your company’s blog posts, for example.
But what if you want to scrape multiple websites, or even multiple page types on the same website? When the <h1>
heading is, say, element number 17 on one page, and, say, element number 23 on another page, using its position isn’t scalable.
Instead, let’s scrape our desired HTML element(s) by name.
And just to keep things interesting, this time, let’s scrape the page <title>
instead of the <h1>
element.
How to scrape web page title elements using Julia
Below is the complete code to scrape the page <title>
element from example.com.
using HTTP, Gumbo, AbstractTrees
r = HTTP.get("https://example.com/")
r_parsed = parsehtml(String(r.body))
root = r_parsed.root
for elem in PreOrderDFS(root)
try
if tag(elem) == :title
println(AbstractTrees.children(elem)[1])
end
catch
# Nothing needed here
end
end
Running that code, we get:
Example Domain
… which, if you look at example.com, is the text in the page <title>
. Boom!
Let’s walk through that code so we have it straight:
using HTTP, Gumbo, AbstractTrees
We’re using the two packages (HTTP
andGumbo
) we covered earlier in this tutorial. We’re also using a package we haven’t yet used in this tutorial:AbstractTrees
. It’s this package that will let us scrape by the name of the HTML element, rather than its position.
r = HTTP.get("https://example.com/")
We used this exact line of code earlier in the tutorial. It uses theHTTP
package to scrape example.com, but we’ll need to do more to get it into a usable form. Let’s keep going …
r_parsed = parsehtml(String(r.body))
We also used this exact line earlier in the tutorial. Here we’re using theparsehtml()
function from theGumbo
package to make our scraped HTML more usable.
root = r_parsed.root
We haven’t used this line before in this tutorial. Here we’re creating a variable,root
, which contains, both the<head>
and<body>
sections from example.com.
for elem in PreOrderDFS(root)
Here we start a “for” loop. We’ll be iterating over each element (elem
) in ourroot
variable. We’ve wrapped root inside a function from theAbstractTrees
package namedPreOrderDFS()
, which appears to be necessary to allow us to extract HTML elements by name.
try
Here we start a “try / catch” block. We need this because thetag()
function in the next line of code accepts onlyHTMLElement
s, but our iterator will throw other things at it too—things that will break our script if we don’t wrap it in this “try / catch” block.For example, our iterator will find the page
<title>
—which is anHTMLElement
—but it will also find the text inside it—which has the typeHTMLText
. It’s thisHTMLText
which won’t be accepted by thetag()
function on the next line—and thus break our script if not for this “try / catch” block.
if tag(elem) == :title
Here we’re telling our code we only want the page<title>
element. Both thetag()
function and the:title
symbol are from Gumbo.
println(AbstractTrees.children(elem)[1])
With theprintln()
function, we are telling Julia to print to the terminal the actual text in the page<title>
of the example.com homepage. Of course, you don’t have to print the page<title>
to the terminal, you can write it to a dataframe, a CSV or a text file.
The rest of the code just closes all the loops; a necessary task.
Let’s look at how to scrape meta descriptions next.
How to scrape web page meta descriptions using Julia
Since example.com doesn’t have a meta description, let’s use the meta description of the page you’re reading right now. If you look in the source code, you’ll see:
<meta name=description content="Julia can be used for fast web scraping, not just data analysis.">
But before we go on, there’s an important difference between the meta description and the title element that we need to take into account.
Generally, web pages have only one <title>
, but they do have multiple meta
elements. See box below for examples.
So the page we’re going to scrape—the one you’re reading right now—has multiple meta elements.
This means we can’t just scrape our meta description based on the fact that it’s a meta element—we’ll need to do more to target the meta description.
It’s easy when you know how, though. Here’s the code, followed by a code walkthrough:
using HTTP, Gumbo, AbstractTrees
url = "https://julia.school/julia/scraping/"
r = HTTP.get(url)
r_parsed = parsehtml(String(r.body))
head = r_parsed.root[1]
for elem in PreOrderDFS(head)
try
if getattr(elem, "name") == "description"
content = getattr(elem, "content")
println(content)
end
catch
# Nothing needed
end
end
So here’s the code walkthrough:
using HTTP, Gumbo, AbstractTrees
Here we’re using the three scraping packages we’ve been using earlier in this tutorial.
url = "https://julia.school/julia/scraping/"
Here we put our target url into a variable namedurl
for code elegance.
r = HTTP.get(url)
r_parsed = parsehtml(String(r.body))
head = r_parsed.root[1]
This is the same use of theHTTP
andGumbo
Julia packages as per earlier in this tutorial—take a look above for explanation of these lines.
for elem in PreOrderDFS(head)
This is an identical line to one in the page<title>
scraping example earlier in this tutorial. We’re using theAbstractTrees
Julia package to allow us to iterate over the elements in the<head>
section of the webpage.
try
We need a “try / catch” block here because not all of the elements in the<head>
will pass the conditional logic in the forthcoming lines. Without this “try / catch” block, our script will break upon encountering a failing condition.
if getattr(elem, "name") == "description"
Here we are using thegetattr()
function from theGumbo
Julia package to create some conditional logic.We’re asking the script to check if the
name
attribute of a given meta element (any element, really) is equal todescription
.If it does, then we’ve found our meta description.
(If you’re not familiar with meta descriptions, they take the form
<meta name="description" content="foo bar"/>
).
content = getattr(elem, "content")
Here we’re using the samegetattr()
function fromGumbo
to get the value of the meta description—which lives in itscontent
attribute—and assign it to a value namedcontent
.
println(content)
In the simple example code above, I have the script print the value ofcontent
—the actual text of the meta description—to the terminal. However, you can do whatever you want; add it to an array, write it to a dataframe, or append it to a newline in a text file, and so on.
The rest of the code is just closing out the loops. There’s no need to put anything in the
catch
statement as we only need it there so that it skips all the elements—meta or otherwise—that lack an attribute ofname="description"
.
How to scrape JSON in inline JavaScript on web pages to get the value of a specific key
Today at work I needed to scrape some of our own webpages.
I wanted to get the internal ID for about 600 pages.
The ID was not included in the HTML, only in JSON within an inline JS script on the page.
Within the JSON, the ID had the form:
"id":[{"code":"12345",
… where 12345
was the desired ID.
These IDs could vary in length but appeared to be either five or six digits.
My desired output was a table with two columns, one for the url, and one for the ID, like this:
URL | ID |
---|---|
example.com/product/moon-rover | 12345 |
example.com/product/space-dust | 123456 |
I could have just asked one of our developers to do this for me, but I figured I’d get the results sooner if I did it myself.
The problem
If you’re familiar with Julia and its regex flavour (PCRE), then you may be able to immediately see the issue with using regex to find the JSON snippet—there were double quotes in the desired string, which would need to be escaped.
And unfortunately, using a leading backslash or even multiple leading backslashes to try to escape the double quotes does not work. Read on for the solution!
The solution
Here’s the code and the walkthrough. Note that this code is just for one url, rather than multiple, but you could wrap this code in a for loop to iterate over your array of urls.
using HTTP
Import the HTTP package we need
url = "https://www.example.com"
Create a variable named
url
and populate it
s = HTTP.get(url)
Scrape the url and store the results in a new variable,
s
s = String(s)
Here we convert our scrape results into a string and overwrite the value of
s
with this string
s = replace(s, "\n" => "")
Here we delete newlines—expressed as
\n
—using Julia’s inbuiltreplace()
function. This allows the PCRE regex engine used by Julia to work smoothly.As before, we overwrite the value of
s
with the output.
s = replace(s, r""".*id":\[\{"code":""" => "")
In our
s
string, we now delete everything up to and including:
id":[{"code":
We only want what comes immediately afterwards, which in our example will be
"12345"
.The way we do this is with a regular expression, denoted by the
r
just before the first double quote.Normally, we would just use regular double quotes like this:
s = replace(s, r"foo" => bar")
… but if you recall the string we’re trying to match:
"id":[{"code":"12345",
… then you can see there are double quotes within it. So we need to escape those double quotes for the PCRE regex engine.
And the only way I’ve found to escape double quotes using the PCRE regex engine is to wrap the desired string in triple double quotes.
Using a backslash or multiple backslashes to escape the double quotes in PCRE does not work.
So that’s why we are using the form:
s = replace(s, r"""foo""" => "bar")
Note that we don’t need triple quotes around
"bar"
since we’re not escaping anything there—we only need it for"""foo"""
.Our result will be a string that begins with
"12345",
Here’s the thing: the fact that our newly-truncated string
s
now starts with a double quote presents its own obstacle—which we can also solve—see below.
s = chop(s, head = 1, tail = 0)
Here we are going to chop off the leading double quote, so that our string starts with
12345
rather than"12345
We use Julia’s inbuilt function
chop()
to do this. The argumentshead
andtail
tell Julia how many characters to chop off the head and tail respectively. They appear to default to a value of one, so we’ll explicitly settail
to zero. We’ll set the value ofhead
to one just for clarity’s sake.
id = match(r"[0-9]+", s)
Finally, we use another regex match to get the numeric value of our ID, and store it in a value named
id
. We don’t need to use triple double quotes in our Julia regex this time, because there aren’t any double quotes to escape—our match will have the form12345
or123456
.
Getting just the status code of web pages
If you’re not sure what a status code is, check out the box below.
If you know what status codes are, let’s jump straight into some examples—scroll past the boxes to begin!
Getting the status code of a working page (status code 200)
Let’s say we want to get the status code of https://example.com/
. This is a working page, which therefore has the status code of 200
.
To do this, we’ll have Julia visit the url above and assign it to a variable r
, like this:
r = HTTP.head("https://example.com/")
Now we can trivially get the status code:
r.status
You should see the following output:
200
As explained in the box above, 200
is the status code that indicates the page has loaded correctly.
Getting the status code of a 404 page
If we want to get the status code of a 404 page, then we need to add a bit more to our code.
Let’s say we want the status code of https://example.com/404
, which has the status code of 404
, which means the page could not be found.
And let’s say we want to store the status code in a new variable, r2
.
If we just try this:
r2 = HTTP.head("https://example.com/404")
… then we get this error:
ERROR: HTTP.ExceptionRequest.StatusError
The way to avoid this error is by adding the argument status_exception=false
to our command:
r2 = HTTP.head("https://example.com/404", status_exception=false)
Now we can get the status code:
r2.status
… and we get our answer:
404
Getting the status code of a 500 page
When pages cannot be loaded due to a server error, there’s a good chance it has the status code of 500
.
For example, this page has a (deliberate) 500
error:
https://httpstat.us/500
Let’s say we want to store the status code of this page in a variable called r3
.
Similarly to the 404
example above, we need to use the status_exception=false
argument in our code:
r3 = HTTP.head("https://httpstat.us/500", status_exception=false)
… so that we can ask for the status code:
r3.status
… and get our answer:
500
By contrast, if we had left out the status_exception=false
argument in our code above, we would have received the same sort of error we saw earlier in our 404 example:
ERROR: HTTP.ExceptionRequest.StatusError
Getting the status code of a 301 redirect
If you want to know which pages on a website have been 301 redirected, we’ll need to use HTTP.get()
rather than HTTP.head()
. The latter doesn’t work with 301 (or 302) redirects.
For example, let’s say you want the status code for http://balloon.com
, which 301 redirects to https://LABalloons.com
, an LA-based balloon supply company.
If we simply run:
r = HTTP.head("http://balloon.com")
… we get an error, which reads, in part:
ERROR: HTTP.ExceptionRequest.StatusError(405, "HEAD", "/", HTTP.Messages.Response:
"""
HTTP/1.1 405 Method Not Allowed
We get the same 405 status error even if we run the command with an additional argument preventing redirects:
r = HTTP.head("http://balloon.com", redirect=false)
Instead, we need to use a full HTTP.get()
request instead of HTTP.head()
.
We’ll also need to disallow redirects with that redirect=false
argument:
r = HTTP.get("http://balloon.com", redirect=false)
We get, in part:
HTTP.Messages.Response:
"""
HTTP/1.1 301 Moved Permanently
Date: Fri, 12 Nov 2021 12:20:46 GMT
Content-Type: text/html; charset=utf-8
Content-Length: 57
Connection: keep-alive
Location: https://laballoons.com
We can now see that our original request for http://balloon.com
was 301 redirected to https://laballoons.com
(the new URL is in the last line of the truncated output above).
On that note, if you want to see a worked example where I scrape 301 redirects and capture the destination URL we end up at, scroll a little further down the page.
Getting the status code of a 302 redirect
The process is very similar for a 302 redirect (a temporary redirect).
Let’s take a look at an example.
Did you know that Jeff Bezos owns relentless.com
and 302 redirects it to Amazon?
So if we have Julia crawl it:
relentless_com = HTTP.get("http://relentless.com/", redirect=false)
We get, in part:
HTTP.Messages.Response:
"""
HTTP/1.1 302 Moved Temporarily
Date: Fri, 12 Nov 2021 12:32:33 GMT
Server: Server
Location: http://www.amazon.com
We can see that a 302 redirect (a temporary redirect) has occurred, and we have ended up on http://www.amazon.com
. As an aside, of course Amazon is going to then redirect that non-secure Amazon.com url to the secure https://www.amazon.com
. We can have our script follow these rabbit trails to make sure we are getting the final url—in this case, the secure https://www.amazon.com
.
How to capture the target URL of a 301 or 302 redirect
Let’s say we want a table like this:
url | status | target_url |
---|---|---|
https://httpstat.us/301 | 301 | https://httpstat.us |
https://httpstat.us/302 | 302 | https://httpstat.us |
If we give Julia the URLs in the first column, how can we have it fill out the status
and target_url
columns?
If you haven’t seen those httpstat.us
URLs before, they’re a handy way to test different HTTP status codes.
So https://httpstat.us/301
will always give you a 301 redirect (back to that site’s homepage), https://httpstat.us/302
will always give you a 302 redirect, and so on.
Let’s dive in, step by step:
Step 1: Import the Julia HTTP package
As in the other parts of this tutorial, we’ll need the HTTP.jl package to get the status code and target URL for each URL in our table.
If you don’t have the HTTP package installed already, simply hit the right square bracket ]
at the Julia prompt in your terminal, then input add HTTP
. Hit the delete key once done to exit the package manager.
Now we need to tell Julia we want to use the HTTP package:
using HTTP
Right, hopefully that was easy enough. Onto Step 2.
Step 2: Create an array of the URLs to be scraped
Now we need to put the three URLs we want to work with—example.com and the two httpstat.us URLs—into an array so that we can iterate over them to get the status
and target_url
for each.
That way, we’ll be able to make the table we saw at the start of this section.
We can make our URL array like this:
url_array = ["https://httpstat.us/301", "https://httpstat.us/302"]
In this tutorial, I’ll add some line breaks to make the code more readable, but this isn’t necessary in production code. So in this tutorial, our array will look like this:
url_array = [
"https://httpstat.us/301",
"https://httpstat.us/302"
]
Step 3: Create the “for loop” structure
The basic structure of the code we’re writing will be a “for loop”. When we add the for loop, our code so far will look like this:
using HTTP
url_array = [
"https://httpstat.us/301",
"https://httpstat.us/302"
]
for url in url_array
# do something
end
We haven’t actually done any scraping yet; we’ll put that in to replace the # do something
comment. First, though, we need to do something else.
Step 4: Add a try/catch block to handle errors
We don’t want our script to break if one of the URLs is unreachable—we want it to continue scraping the remaining urls in our array.
Let’s handle that case by adding a try/catch block:
using HTTP
url_array = [
"https://httpstat.us/301",
"https://httpstat.us/302"
]
for url in url_array
try
# do something
catch
# do something else if things go wrong
end
end
Now we’re ready to add our scraper code! We’ll do that in the very next step.
Step 5: Get the status code
Earlier, we saw we could get the status code of a given web page by using the HTTP package’s built-in function .status
.
We’ll use that now to store the status code in a variable we’ll call status
:
...
for url in url_array
try
r = HTTP.head(url, redirect=false)
status = r.status
println(status)
catch
println("Something went wrong")
end
end
The new line is status = r.status
, in case you’re wondering.
I also added in an extra line just below that one to have the code show us the status code in the terminal for each url as it iterates over the array.
When we run the code, we get:
301
302
Okay that’s great! Let’s move on to getting the target URL of each URL in our array, then we’ll add them all into a table for the finished product.
Step 6a: Get the target URL
To get the target URL, we need to do a bit more work. Unfortunately, there’s no built-in function for this in the HTTP package.
Let’s first take a look at what the value of r
looks like for httpstat.us/301
:
HTTP.Messages.Response:
"""
HTTP/1.1 301 Moved Permanently
Content-Length: 21
Content-Type: text/plain
Date: Thu, 25 May 2023 12:35:54 GMT
Server: Kestrel
Location: https://httpstat.us
Set-Cookie: ARRAffinity=
(truncated)
This is what we need—we can see our target URL in the line
Location: https://httpstat.us
We want to store this in a variable we’ll call target_url
.
However, we can’t actually extract our target URL from r
just yet.
If we have Julia tell us the type of our variable r
:
typeof(r)
… we get:
HTTP.Messages.Response
We need to turn this into a string so that we can extract our target URL.
Step 6b: Turn r into a string
Here’s how we turn the value of r
into a string:
r_string = String(r)
Let’s put that into our script:
...
for url in url_array
try
r = HTTP.head(url, redirect=false)
status = r.status
println(status)
r_string = String(r)
catch
println("Something went wrong")
end
end
Let’s take a look at what r_string
looks like.
When we input r_string
into the terminal, for https://httpstat.us/301
, it looks like something this:
"HTTP/1.1 301 Moved Permanently\r\nContent-Length: 21\r\nContent-Type: text/plain\r\nDate: Thu, 25 May 2023 12:35:54 GMT\r\nServer: Kestrel\r\nLocation: https://httpstat.us\r\nSet-Cookie: ARRAffinity=
(truncated)
Now—because we’re dealing with a string and not a HTTP.Messages.Response
—we can use Julia’s replace()
function to get rid of everything except the target URL. We’ll do that next.
Step 6c: Remove extraneous text from r_string
Now it’s time to remove the excess text from our newly created string, r_string
.
The first thing we want to do is get rid of all the \n
(newline) characters:
r_string = replace(r_string, "\n" => "")
Now our r_string
variable looks like this:
"HTTP/1.1 301 Moved Permanently\rContent-Length: 21\rContent-Type: text/plain\rDate: Thu, 25 May 2023 12:35:54 GMT\rServer: Kestrel\rLocation: https://httpstat.us\rSet-Cookie: ARRAffinity=
(truncated)
We needed to get rid of those \n
newline characters or else the rest of our forthcoming find-and-replace code won’t work properly. That’s because in the regex flavour used by Julia—pcre
—the “match all” expression .*
doesn’t match newlines. You don’t necessarily need to understand that fully—just know that we needed to delete the \n
newline characters before doing the next steps.
Right, that’s that.
Now we need to delete everything up to and including Location:
.
We do that like this:
r_string = replace(r_string, r".*Location: " => "")
We get:
"https://httpstat.us\rSet-Cookie: ARRAffinity=
(truncated)
We’re getting there—we can see our target URL at the start of our variable. Now all we need to do is to remove everything after it. We do that by matching /r
and everything that comes after it:
r_string = replace(r_string, r"\r.*" => "")
And we get what we wanted the whole time, our target URL:
"https://httpstat.us"
Okay, let’s put that into our code:
...
for url in url_array
try
r = HTTP.head(url, redirect=false)
status = r.status
# Code to get target URL below:
r_string = String(r)
r_string = replace(r_string, "\n" => "")
r_string = replace(r_string, r".*Location: " => "")
r_string = replace(r_string, r"\r.*" => "")
catch
println("Something went wrong")
end
end
So, our status code lives in our variable status
, and our target URL lives in our variable r_string
.
Okay, that was a big step. Now for the home stretch—it’s time to build our table:
Step 7: Write the CSV
If you recall, at the beginning of this section, we said we wanted a table that looks like this:
url | status | target_url |
---|---|---|
https://httpstat.us/301 | 301 | https://httpstat.us |
https://httpstat.us/302 | 302 | https://httpstat.us |
Well, now we have everything we need. We just need to write our url
, status
and r_string
(target URL) variables to a dataframe, so we can write our table.
Let’s amend our code accordingly. Here’s the full finished script, with commentary below:
using HTTP
url_array = [
"https://example.com",
"https://httpstat.us/301",
"https://httpstat.us/302"
]
open("/Users/ron/Desktop/url-statuses.csv", "a") do io
write(io, "url,status,target_url")
end
for url in url_array
try
r = HTTP.head(url, redirect=false)
status = r.status
# Code to get target URL below:
r_string = String(r)
r_string = replace(r_string, "\n" => "")
r_string = replace(r_string, r".*Location: " => "")
r_string = replace(r_string, r"\r.*" => "")
open("/Users/ron/Desktop/url-statuses.csv", "a") do io
write(io, "\n$url,$status,$r_string")
end
catch
open("/Users/ron/Desktop/url-statuses.csv", "a") do io
write(io, "\n$url,unknown,unknown")
end
end
end
Let’s go through the changes one at a time:
open("/Users/ron/Desktop/url-statuses.csv", "a") do io
write(io, "url,status,target_url")
end
This is where we write the column headers (url,status,target_url
) into a new CSV we’re calling url-statuses.csv.
Because we only do this once, this code goes outside our for loop.
Then, inside our for loop, in our try block, we write the url, the status code and the target URL to our CSV:
open("/Users/ron/Desktop/url-statuses.csv", "a") do io
write(io, "\n$url,$status,$r_string")
end
And finally, in our catch block, we write some fallback values (unknown,unknown
) if our script can’t get the status code and target URL:
open("/Users/ron/Desktop/url-statuses.csv", "a") do io
write(io, "\n$url,unknown,unknown")
end
That’s it! That was a big exercise—great job!
Of course, we could do more such as writing in cases for when the URL doesn’t redirect—i.e. when it has a status code of 200
, 404
, and so on, but I didn’t want to make this tutorial too long!
Crawling just the headers
If you’re not sure what headers are, check out the box below.
Worked example
Let’s fetch the headers from example.com.
As mentioned earlier, because we aren’t fetching the whole page, we don’t need to use HTTP.get()
, which fetches the headers and all the HTML.
Instead, we can save bandwidth by using HTTP.head()
, which only fetches the headers instead.
Okay, moving on. First we fetch the page:
r = HTTP.head("https://example.com/")
… and then we request the headers:
headers = r.headers
This is the output I got:
12-element Array{Pair{SubString{String},SubString{String}},1}:
"Accept-Ranges" => "bytes"
"Age" => "390198"
"Cache-Control" => "max-age=604800"
"Content-Type" => "text/html; charset=UTF-8"
"Date" => "Sat, 01 Aug 2020 03:25:23 GMT"
"Etag" => "\"3147526947\""
"Expires" => "Sat, 08 Aug 2020 03:25:23 GMT"
"Last-Modified" => "Thu, 17 Oct 2019 07:18:26 GMT"
"Server" => "ECS (sjc/4E74)"
"Vary" => "Accept-Encoding"
"X-Cache" => "HIT"
"Content-Length" => "1256"
Note that some or all of the dates in your output may be different.
Extracting the relevant headers from the Julia array by index
You can then pull the relevant key-value pair(s) from this array as needed. For instance, if you want the Last-Modified
header (which is the eighth element in the headers
array we created), you could do this:
last_modified = headers[8]
… and you’d get this:
"Last-Modified" => "Thu, 17 Oct 2019 07:18:26 GMT"
Or if you want just the value, you could do this:
last_modified = headers[8][2]
… and you’d get this:
"Thu, 17 Oct 2019 07:18:26 GMT"
Extracting the relevant headers from the Julia array by key name
It’s probably a better idea to extract the value of a given header by its name (e.g. "Last-Modified"
rather than its position in the array.
For example, if you’re scraping multiple web pages, some might have 11-element header arrays, and some might have 9, 10, 12, and so on.
Here’s how to get the value of a given header (we’ll work with "Last-Modified"
) by name.
Using our same headers
array we created above:
for header in headers
if header[1] == "Last-Modified"
println(header[2])
end
end
Above, we iterate over each header in the headers
array (for header in headers
).
Then we say that if the header key (name) is “Last-Modified” (if header[1] == "Last-Modified"
), then print the value of that header (println(header[2])
). Of course, you could save it to another array or a dataframe; I’ve just provided you with a simple example above.