How to scrape web pages with Julia

Julia can be used for fast web scraping, not just data analysis.

Ron Erdos
Updated May 26, 2023
Tested with Julia version 1.9.0

Do you want to crawl and scrape web pages with the Julia language? This tutorial will show you how.

Before we begin, it goes without saying that you should only scrape web pages where you are not infringing any laws or rules by doing so. See the box below for a few websites you can scrape legally and ethically.

Introducing three Julia web scraping packages: HTTP.jl, Gumbo.jl and AbstractTrees.jl

We’re going to use three packages. Here’s what they each do:

HTTP.jl

Scrapes web pages

Gumbo.jl

Parses these web pages after we’ve scraped them with HTTP.jl, so that we can more easily retrieve specific HTML elements (such as an <h1> heading).

AbstractTrees.jl

Allows us to extract specific HTML elements (again, such as an <h1> heading) by name, rather than position.

Without AbstractTrees.jl, as far as I know, you’ll only be able to receive specific HTML elements by their position in the source code.

For example, you’d need to know that the <h1> heading is, say, the 17th element in the source code. Of course, if you are scraping multiple websites, or even multiple page types from the same websites, this won’t always be true, and your scrape won’t work.

Downloading and installing the packages

First, if you don’t have Julia itself installed, here’s how to install Julia on a Mac.)

OK, once you have Julia installed, fire up your terminal of choice, and enter the Julia REPL by typing julia at your command prompt.

You should see the green Julia command prompt:

julia>

Now it’s time to add the packages.

My favourite way to do this is to type the ] (right square bracket, located above the Return / Enter key on your keyboard) into the Julia terminal.

This changes the green prompt we saw above into a purple one that says:

pkg>

The three letters above stand for “package”. We’re now in Julia’s package manager.

We can now easily add the three web scraping and parsing packages we need, using the add command:

add HTTP Gumbo AbstractTrees

Note that we separate the package names with spaces, rather than commas.

Hit Enter and let Julia do its thing.

Once the packages have been installed, you can exit out of Julia’s package manager by pressing Delete / Backspace. You’ll then see the green prompt again:

julia>

“Using” the packages

Before we do any actual web scraping, we need to tell Julia we intend to use all three of our newly installed packages:

using HTTP, Gumbo, AbstractTrees

Unlike when we added the packages, we need to use commas to separate the package names in a using command.

Scraping example.com

Now that we’re all set up, let’s get to work scraping the homepage of example.com.

We’ll store it in a new variable we’ll create, and we’ll call this variable r (for “request”, as in “HTTP request”).

We’ll just use the HTTP package for now—we’ll use the others later.

We enter our command into the Julia REPL:

r = HTTP.get("https://example.com/")

The output

Here’s the output. We’ll walk through it below.

HTTP.Messages.Response:
"""
HTTP/1.1 200 OK
Age: 433583
Cache-Control: max-age=604800
Content-Type: text/html; charset=UTF-8
Date: Sun, 02 Aug 2020 02:37:09 GMT
Etag: "3147526947+ident"
Expires: Sun, 09 Aug 2020 02:37:09 GMT
Last-Modified: Thu, 17 Oct 2019 07:18:26 GMT
Server: ECS (sjc/4E8D)
Vary: Accept-Encoding
X-Cache: HIT
Content-Length: 1256

<!doctype html>
<html>
<head>
    <title>Example Domain</title>

    <meta charset="utf-8" />
    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
    <style type="text/css">
    body {
        background-color: #f0f0f2;
        margin: 0;
        padding: 0;
        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;

    }
    div {
        width: 600px;
        margin: 5em auto;
        padding: 2em;
        background-color: #fdfdff;
        border-radius: 0.5em;
        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);
    }
    a:link, a:visited {
        color: #38488f;
        text-decoration: none;
    }
    @media (max-width: 700px) {
        div {
            margin: 0 auto;
            width: auto;
        }
    }
    </style>
</head>

<body>
<div>
    <h1>Example Domain</h1>
    <p>This domai
⋮
1256-byte body
"""

Output walkthrough

Julia truncates the output, but we can see, in descending order:

  1. the protocol and status code (HTTP/1.1 200 OK). HTTP/1.1 is the protocol and 200 is the status code. A code of 200 means the page has loaded correctly, which is why we also see the word OK.
  2. the headers, which in this case consists of eleven key-value pairs from Age: 433583 through to Content-Length: 1256
  3. the first 1000 or so characters of the source code of the actual HTML document. In this case, we see <!doctype html> through to just after <h1>Example Domain</h1>. Note that the full web page will be stored in our variable r; it’s just the output in the Julia REPL that’s truncated.

Later on in this tutorial, we’ll explore techniques for laser-targeting just the status code or just the headers (see the table of contents at the top for links to these), but for now, let’s continue exploring how we can target individual HTML elements in the source code. First though, the next three sections show you how to set cookies and headers when web scraping with Julia and the HTTP.jl package. If you don’t need to set cookies or headers, feel free to skip past these.

How to set cookies when web scraping with Julia and the HTTP.jl package

This week, I needed to scrape our company’s staging server.

I knew that I would need my scraper to set a cookie to make our staged web application behave properly for the scrape.

To set the cookie, I started with the following (vastly simplified) code:

using HTTP
url = "https://staging.example.com"
r = HTTP.get(url; cookies=Dict("foo"=>123))

Notice that a semicolon separates the first two parameters from the cookies dictionary—it’d be easy to misread this as a comma.

Below are some additional pointers on setting cookies when scraping websites with Julia.

If you need to set the value of a cookie to true, set it to "true" (with the double quotes). For example:

r = HTTP.get(url; cookies=Dict("experiment"=>"true"))

(Don’t worry, it will come through correctly as true.)

Otherwise you’ll get an error.

How to set multiple cookies with HTTP.jl

To set multiple cookies when scraping with Julia, you can do this this:

r = HTTP.get(url; cookies=Dict("foo"=>123, "experiment"=>"true"))

I didn’t bother setting a cookie expiry since my script sets the cookie each time the code scrapes a url.

So that wraps up the “how to” on setting cookies with Julia web scraping. Before we get back to the tutorial proper, a quick tip on setting headers.

How to set headers when web scraping with Julia and the HTTP.jl package

Today I need to scrape a different staging server at work. This didn’t require cookies to be set, but it did require a custom header. In this case, I needed to set an x-country header.

Here’s the relevant line of code which does that in Julia using the HTTP.jl package:

r = HTTP.get(url; headers=Dict("x-country" => "US"))

How to set both cookies and headers when scraping with Julia and the HTTP.jl package

If you need to set both cookies and headers when scraping with Julia’s HTTP.jl package, you can do so with something like this:

r = HTTP.get(url; headers=Dict("x-country" => "US"), cookies=Dict("ALLOW_ZONE_OVERRIDE"=>"true"))

And now, let’s continue our tutorial on scraping example.com.

How to scrape specific HTML elements with Gumbo.jl and Julia

Gumbo.jl is a Julia package that enables us to transform the relatively amorphous blob of HTML we crawled with HTTP.jl into a parseable HTML tree.

What this means in practice is we can zero in on particular HTML elements, such as the <h1> heading. We’ll do just that below.

Turning our blob of HTML into a parseable HTML tree with Gumbo

Earlier, we stored our crawl of the homepage of example.com into a variable we called r (which stands for “request”, as in “HTTP request”), like this:

using HTTP, Gumbo, AbstractTrees
r = HTTP.get("https://example.com/")

Now we’ll make a “Gumbo” version—which will be easily traversable—of that r variable, and we’ll call it r_parsed.

The way to do that is below:

r_parsed = parsehtml(String(r.body))

The output from the Gumbo command

After inputting the command above, we get:

HTML Document:
<!DOCTYPE html>
HTMLElement{:HTML}:<HTML>
  <head>
    <title>
      Example Domain
    </title>
    <meta charset="utf-8"/>
    <meta content="text/html; charset=utf-8" http-equiv="Content-type"/>
    <meta content="width=device-width, initial-scale=1" name="viewport"/>
    <style type="text/css">
    body {
        background-color: #f0f0f2;
        margin: 0;
        padding: 0;
        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;

    }
    div {
        width: 600px;
        margin: 5em auto;
        padding: 2em;
...

This output is also truncated, but we’re now ready to start digging down into our parsed HTML tree. We’ll do that right now.

Digging down into Gumbo’s parsed HTML tree

To dig down into Gumbo’s parsed HTML tree, we’ll need to append .root to r_parsed to start working with it. So we’ll be using r_parsed.root.

Now, let’s say we want to see just the <head> section of this source code, the source code of example.com.

Well, there are two main sections in the source code of any web page: the <head> (available at r_parsed.root[1]), and the <body> (which can be found at r_parsed.root[2]). (Remember that Julia uses 1-based indexing.)

Since we want the <head> section to start with, we’ll use this command:

head = r_parsed.root[1]

We get:

HTMLElement{:head}:<head>
  <title>
    Example Domain
  </title>
  <meta charset="utf-8"/>
  <meta content="text/html; charset=utf-8" http-equiv="Content-type"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <style type="text/css">
    body {
        background-color: #f0f0f2;
        margin: 0;
        padding: 0;
        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;

    }
    div {
        width: 600px;
        margin: 5em auto;
        padding: 2em;
        background-color: #fdfdff;
...

Now, if we want just the <body> section, we’ll use this command:

body = r_parsed.root[2]

… and we see this:

HTMLElement{:body}:<body>
  <div>
    <h1>
      Example Domain
    </h1>
    <p>
      This domain is for use in illustrative examples in documents. You may use this
      domain in literature without prior coordination or asking for permission.
    </p>
    <p>
      <a href="https://www.iana.org/domains/example">
        More information...
      </a>
    </p>
  </div>
</body>

Getting the text of the h1 heading

OK, so how do we get the text of the <h1> heading?

Well, immediately inside the <body>, we have a <div>, and then immediately inside that, we have our <h1>. So that’s the first element inside the first element inside the <body>.

So we’ll do this:

h1 = body[1][1]

(Remember that we created the variable body above—it’s not built into Gumbo.)

… and we get this:

HTMLElement{:h1}:<h1>
  Example Domain
</h1>

OK, so how do we get just the text of this <h1>?

Well, if we enter this command:

h1[1].text

… we get this:

"Example Domain"

Booyah!

Scraping a given HTML element by name in Julia using the AbstractTrees package

Okay, so the code above is all well and good if you’re scraping a particular page type where the template is relatively fixed, and thus the order of the HTML elements doesn’t change. Your company’s blog posts, for example.

But what if you want to scrape multiple websites, or even multiple page types on the same website? When the <h1> heading is, say, element number 17 on one page, and, say, element number 23 on another page, using its position isn’t scalable.

Instead, let’s scrape our desired HTML element(s) by name.

And just to keep things interesting, this time, let’s scrape the page <title> instead of the <h1> element.

How to scrape web page title elements using Julia

Below is the complete code to scrape the page <title> element from example.com.

using HTTP, Gumbo, AbstractTrees

r = HTTP.get("https://example.com/")
r_parsed = parsehtml(String(r.body))
root = r_parsed.root

for elem in PreOrderDFS(root)
    try
        if tag(elem) == :title
            println(AbstractTrees.children(elem)[1])
        end
    catch
        # Nothing needed here
    end
end

Running that code, we get:

Example Domain

… which, if you look at example.com, is the text in the page <title>. Boom!

Let’s walk through that code so we have it straight:

using HTTP, Gumbo, AbstractTrees We’re using the two packages (HTTP and Gumbo) we covered earlier in this tutorial. We’re also using a package we haven’t yet used in this tutorial: AbstractTrees. It’s this package that will let us scrape by the name of the HTML element, rather than its position.

r = HTTP.get("https://example.com/") We used this exact line of code earlier in the tutorial. It uses the HTTP package to scrape example.com, but we’ll need to do more to get it into a usable form. Let’s keep going …

r_parsed = parsehtml(String(r.body)) We also used this exact line earlier in the tutorial. Here we’re using the parsehtml() function from the Gumbo package to make our scraped HTML more usable.

root = r_parsed.root We haven’t used this line before in this tutorial. Here we’re creating a variable, root, which contains, both the <head> and <body> sections from example.com.

for elem in PreOrderDFS(root) Here we start a “for” loop. We’ll be iterating over each element (elem) in our root variable. We’ve wrapped root inside a function from the AbstractTrees package named PreOrderDFS(), which appears to be necessary to allow us to extract HTML elements by name.

try Here we start a “try / catch” block. We need this because the tag() function in the next line of code accepts only HTMLElements, but our iterator will throw other things at it too—things that will break our script if we don’t wrap it in this “try / catch” block.

For example, our iterator will find the page <title>—which is an HTMLElement—but it will also find the text inside it—which has the type HTMLText. It’s this HTMLText which won’t be accepted by the tag() function on the next line—and thus break our script if not for this “try / catch” block.

if tag(elem) == :title Here we’re telling our code we only want the page <title> element. Both the tag() function and the :title symbol are from Gumbo.

println(AbstractTrees.children(elem)[1]) With the println() function, we are telling Julia to print to the terminal the actual text in the page <title> of the example.com homepage. Of course, you don’t have to print the page <title> to the terminal, you can write it to a dataframe, a CSV or a text file.

The rest of the code just closes all the loops; a necessary task.

Let’s look at how to scrape meta descriptions next.

How to scrape web page meta descriptions using Julia

Since example.com doesn’t have a meta description, let’s use the meta description of the page you’re reading right now. If you look in the source code, you’ll see:

<meta name=description content="Julia can be used for fast web scraping, not just data analysis.">

But before we go on, there’s an important difference between the meta description and the title element that we need to take into account.

Generally, web pages have only one <title>, but they do have multiple meta elements. See box below for examples.

So the page we’re going to scrape—the one you’re reading right now—has multiple meta elements.

This means we can’t just scrape our meta description based on the fact that it’s a meta element—we’ll need to do more to target the meta description.

It’s easy when you know how, though. Here’s the code, followed by a code walkthrough:

using HTTP, Gumbo, AbstractTrees

url = "https://julia.school/julia/scraping/"

r = HTTP.get(url)
r_parsed = parsehtml(String(r.body))
head = r_parsed.root[1]

for elem in PreOrderDFS(head)
	try
		if getattr(elem, "name") == "description"
			content = getattr(elem, "content")
			println(content)
		end
	catch
		# Nothing needed
	end
end

So here’s the code walkthrough:

using HTTP, Gumbo, AbstractTrees Here we’re using the three scraping packages we’ve been using earlier in this tutorial.

url = "https://julia.school/julia/scraping/" Here we put our target url into a variable named url for code elegance.

r = HTTP.get(url)
r_parsed = parsehtml(String(r.body))
head = r_parsed.root[1]
This is the same use of the HTTP and Gumbo Julia packages as per earlier in this tutorial—take a look above for explanation of these lines.

for elem in PreOrderDFS(head) This is an identical line to one in the page <title> scraping example earlier in this tutorial. We’re using the AbstractTrees Julia package to allow us to iterate over the elements in the <head> section of the webpage.

try We need a “try / catch” block here because not all of the elements in the <head> will pass the conditional logic in the forthcoming lines. Without this “try / catch” block, our script will break upon encountering a failing condition.

if getattr(elem, "name") == "description" Here we are using the getattr() function from the Gumbo Julia package to create some conditional logic.

We’re asking the script to check if the name attribute of a given meta element (any element, really) is equal to description.

If it does, then we’ve found our meta description.

(If you’re not familiar with meta descriptions, they take the form <meta name="description" content="foo bar"/>).

content = getattr(elem, "content") Here we’re using the same getattr() function from Gumbo to get the value of the meta description—which lives in its content attribute—and assign it to a value named content.

println(content) In the simple example code above, I have the script print the value of content—the actual text of the meta description—to the terminal. However, you can do whatever you want; add it to an array, write it to a dataframe, or append it to a newline in a text file, and so on.

The rest of the code is just closing out the loops. There’s no need to put anything in the catch statement as we only need it there so that it skips all the elements—meta or otherwise—that lack an attribute of name="description".

How to scrape JSON in inline JavaScript on web pages to get the value of a specific key

Today at work I needed to scrape some of our own webpages.

I wanted to get the internal ID for about 600 pages.

The ID was not included in the HTML, only in JSON within an inline JS script on the page.

Within the JSON, the ID had the form:

"id":[{"code":"12345",

… where 12345 was the desired ID.

These IDs could vary in length but appeared to be either five or six digits.

My desired output was a table with two columns, one for the url, and one for the ID, like this:

URLID
example.com/product/moon-rover12345
example.com/product/space-dust123456

I could have just asked one of our developers to do this for me, but I figured I’d get the results sooner if I did it myself.

The problem

If you’re familiar with Julia and its regex flavour (PCRE), then you may be able to immediately see the issue with using regex to find the JSON snippet—there were double quotes in the desired string, which would need to be escaped.

And unfortunately, using a leading backslash or even multiple leading backslashes to try to escape the double quotes does not work. Read on for the solution!

The solution

Here’s the code and the walkthrough. Note that this code is just for one url, rather than multiple, but you could wrap this code in a for loop to iterate over your array of urls.

using HTTP

Import the HTTP package we need


url = "https://www.example.com"

Create a variable named url and populate it


s = HTTP.get(url)

Scrape the url and store the results in a new variable, s


s = String(s)

Here we convert our scrape results into a string and overwrite the value of s with this string


s = replace(s, "\n" => "")

Here we delete newlines—expressed as \n—using Julia’s inbuilt replace() function. This allows the PCRE regex engine used by Julia to work smoothly.

As before, we overwrite the value of s with the output.


s = replace(s, r""".*id":\[\{"code":""" => "")

In our s string, we now delete everything up to and including:

id":[{"code":

We only want what comes immediately afterwards, which in our example will be "12345".

The way we do this is with a regular expression, denoted by the r just before the first double quote.

Normally, we would just use regular double quotes like this:

s = replace(s, r"foo" => bar")

… but if you recall the string we’re trying to match:

"id":[{"code":"12345",

… then you can see there are double quotes within it. So we need to escape those double quotes for the PCRE regex engine.

And the only way I’ve found to escape double quotes using the PCRE regex engine is to wrap the desired string in triple double quotes.

Using a backslash or multiple backslashes to escape the double quotes in PCRE does not work.

So that’s why we are using the form:

s = replace(s, r"""foo""" => "bar")

Note that we don’t need triple quotes around "bar" since we’re not escaping anything there—we only need it for """foo""".

Our result will be a string that begins with "12345",

Here’s the thing: the fact that our newly-truncated string s now starts with a double quote presents its own obstacle—which we can also solve—see below.


s = chop(s, head = 1, tail = 0)

Here we are going to chop off the leading double quote, so that our string starts with 12345 rather than "12345

We use Julia’s inbuilt function chop() to do this. The arguments head and tail tell Julia how many characters to chop off the head and tail respectively. They appear to default to a value of one, so we’ll explicitly set tail to zero. We’ll set the value of head to one just for clarity’s sake.


id = match(r"[0-9]+", s)

Finally, we use another regex match to get the numeric value of our ID, and store it in a value named id. We don’t need to use triple double quotes in our Julia regex this time, because there aren’t any double quotes to escape—our match will have the form 12345 or 123456.

Getting just the status code of web pages

If you’re not sure what a status code is, check out the box below.

If you know what status codes are, let’s jump straight into some examples—scroll past the boxes to begin!

Getting the status code of a working page (status code 200)

Let’s say we want to get the status code of https://example.com/. This is a working page, which therefore has the status code of 200.

To do this, we’ll have Julia visit the url above and assign it to a variable r, like this:

r = HTTP.head("https://example.com/")

Now we can trivially get the status code:

r.status

You should see the following output:

200

As explained in the box above, 200 is the status code that indicates the page has loaded correctly.

Getting the status code of a 404 page

If we want to get the status code of a 404 page, then we need to add a bit more to our code.

Let’s say we want the status code of https://example.com/404, which has the status code of 404, which means the page could not be found.

And let’s say we want to store the status code in a new variable, r2.

If we just try this:

r2 = HTTP.head("https://example.com/404")

… then we get this error:

ERROR: HTTP.ExceptionRequest.StatusError

The way to avoid this error is by adding the argument status_exception=false to our command:

r2 = HTTP.head("https://example.com/404", status_exception=false)

Now we can get the status code:

r2.status

… and we get our answer:

404

Getting the status code of a 500 page

When pages cannot be loaded due to a server error, there’s a good chance it has the status code of 500.

For example, this page has a (deliberate) 500 error:

https://httpstat.us/500

Let’s say we want to store the status code of this page in a variable called r3.

Similarly to the 404 example above, we need to use the status_exception=false argument in our code:

r3 = HTTP.head("https://httpstat.us/500", status_exception=false)

… so that we can ask for the status code:

r3.status

… and get our answer:

500

By contrast, if we had left out the status_exception=false argument in our code above, we would have received the same sort of error we saw earlier in our 404 example:

ERROR: HTTP.ExceptionRequest.StatusError

Getting the status code of a 301 redirect

If you want to know which pages on a website have been 301 redirected, we’ll need to use HTTP.get() rather than HTTP.head(). The latter doesn’t work with 301 (or 302) redirects.

For example, let’s say you want the status code for http://balloon.com, which 301 redirects to https://LABalloons.com, an LA-based balloon supply company.

If we simply run:

r = HTTP.head("http://balloon.com")

… we get an error, which reads, in part:

ERROR: HTTP.ExceptionRequest.StatusError(405, "HEAD", "/", HTTP.Messages.Response:
"""
HTTP/1.1 405 Method Not Allowed

We get the same 405 status error even if we run the command with an additional argument preventing redirects:

r = HTTP.head("http://balloon.com", redirect=false)

Instead, we need to use a full HTTP.get() request instead of HTTP.head().

We’ll also need to disallow redirects with that redirect=false argument:

r = HTTP.get("http://balloon.com", redirect=false)

We get, in part:

HTTP.Messages.Response:
"""
HTTP/1.1 301 Moved Permanently
Date: Fri, 12 Nov 2021 12:20:46 GMT
Content-Type: text/html; charset=utf-8
Content-Length: 57
Connection: keep-alive
Location: https://laballoons.com

We can now see that our original request for http://balloon.com was 301 redirected to https://laballoons.com (the new URL is in the last line of the truncated output above).

On that note, if you want to see a worked example where I scrape 301 redirects and capture the destination URL we end up at, scroll a little further down the page.

Getting the status code of a 302 redirect

The process is very similar for a 302 redirect (a temporary redirect).

Let’s take a look at an example.

Did you know that Jeff Bezos owns relentless.com and 302 redirects it to Amazon?

So if we have Julia crawl it:

relentless_com = HTTP.get("http://relentless.com/", redirect=false)

We get, in part:

HTTP.Messages.Response:
"""
HTTP/1.1 302 Moved Temporarily
Date: Fri, 12 Nov 2021 12:32:33 GMT
Server: Server
Location: http://www.amazon.com

We can see that a 302 redirect (a temporary redirect) has occurred, and we have ended up on http://www.amazon.com. As an aside, of course Amazon is going to then redirect that non-secure Amazon.com url to the secure https://www.amazon.com. We can have our script follow these rabbit trails to make sure we are getting the final url—in this case, the secure https://www.amazon.com.

How to capture the target URL of a 301 or 302 redirect

Let’s say we want a table like this:

urlstatustarget_url
https://httpstat.us/301301https://httpstat.us
https://httpstat.us/302302https://httpstat.us

If we give Julia the URLs in the first column, how can we have it fill out the status and target_url columns?

If you haven’t seen those httpstat.us URLs before, they’re a handy way to test different HTTP status codes.

So https://httpstat.us/301 will always give you a 301 redirect (back to that site’s homepage), https://httpstat.us/302 will always give you a 302 redirect, and so on.

Let’s dive in, step by step:

Step 1: Import the Julia HTTP package

As in the other parts of this tutorial, we’ll need the HTTP.jl package to get the status code and target URL for each URL in our table.

If you don’t have the HTTP package installed already, simply hit the right square bracket ] at the Julia prompt in your terminal, then input add HTTP. Hit the delete key once done to exit the package manager.

Now we need to tell Julia we want to use the HTTP package:

using HTTP

Right, hopefully that was easy enough. Onto Step 2.

Step 2: Create an array of the URLs to be scraped

Now we need to put the three URLs we want to work with—example.com and the two httpstat.us URLs—into an array so that we can iterate over them to get the status and target_url for each.

That way, we’ll be able to make the table we saw at the start of this section.

We can make our URL array like this:

url_array = ["https://httpstat.us/301", "https://httpstat.us/302"]

In this tutorial, I’ll add some line breaks to make the code more readable, but this isn’t necessary in production code. So in this tutorial, our array will look like this:

url_array = [
  "https://httpstat.us/301",
  "https://httpstat.us/302"
]

Step 3: Create the “for loop” structure

The basic structure of the code we’re writing will be a “for loop”. When we add the for loop, our code so far will look like this:

using HTTP

url_array = [
  "https://httpstat.us/301",
  "https://httpstat.us/302"
]

for url in url_array
  # do something
end

We haven’t actually done any scraping yet; we’ll put that in to replace the # do something comment. First, though, we need to do something else.

Step 4: Add a try/catch block to handle errors

We don’t want our script to break if one of the URLs is unreachable—we want it to continue scraping the remaining urls in our array.

Let’s handle that case by adding a try/catch block:

using HTTP

url_array = [
  "https://httpstat.us/301",
  "https://httpstat.us/302"
]

for url in url_array
  try
    # do something
  catch
    # do something else if things go wrong
  end
end

Now we’re ready to add our scraper code! We’ll do that in the very next step.

Step 5: Get the status code

Earlier, we saw we could get the status code of a given web page by using the HTTP package’s built-in function .status.

We’ll use that now to store the status code in a variable we’ll call status:

...

for url in url_array
  try
    r = HTTP.head(url, redirect=false)
    status = r.status
    println(status)
  catch
    println("Something went wrong")
  end
end

The new line is status = r.status, in case you’re wondering.

I also added in an extra line just below that one to have the code show us the status code in the terminal for each url as it iterates over the array.

When we run the code, we get:

301
302

Okay that’s great! Let’s move on to getting the target URL of each URL in our array, then we’ll add them all into a table for the finished product.

Step 6a: Get the target URL

To get the target URL, we need to do a bit more work. Unfortunately, there’s no built-in function for this in the HTTP package.

Let’s first take a look at what the value of r looks like for httpstat.us/301:

HTTP.Messages.Response:
"""
HTTP/1.1 301 Moved Permanently
Content-Length: 21
Content-Type: text/plain
Date: Thu, 25 May 2023 12:35:54 GMT
Server: Kestrel
Location: https://httpstat.us
Set-Cookie: ARRAffinity=

(truncated)

This is what we need—we can see our target URL in the line

Location: https://httpstat.us

We want to store this in a variable we’ll call target_url.

However, we can’t actually extract our target URL from r just yet.

If we have Julia tell us the type of our variable r:

typeof(r)

… we get:

HTTP.Messages.Response

We need to turn this into a string so that we can extract our target URL.

Step 6b: Turn r into a string

Here’s how we turn the value of r into a string:

r_string = String(r)

Let’s put that into our script:

...

for url in url_array
  try
    r = HTTP.head(url, redirect=false)
    status = r.status
    println(status)
    r_string = String(r)
  catch
    println("Something went wrong")
  end
end

Let’s take a look at what r_string looks like.

When we input r_string into the terminal, for https://httpstat.us/301, it looks like something this:

"HTTP/1.1 301 Moved Permanently\r\nContent-Length: 21\r\nContent-Type: text/plain\r\nDate: Thu, 25 May 2023 12:35:54 GMT\r\nServer: Kestrel\r\nLocation: https://httpstat.us\r\nSet-Cookie: ARRAffinity=(truncated)

Now—because we’re dealing with a string and not a HTTP.Messages.Response—we can use Julia’s replace() function to get rid of everything except the target URL. We’ll do that next.

Step 6c: Remove extraneous text from r_string

Now it’s time to remove the excess text from our newly created string, r_string.

The first thing we want to do is get rid of all the \n (newline) characters:

r_string = replace(r_string, "\n" => "")

Now our r_string variable looks like this:

"HTTP/1.1 301 Moved Permanently\rContent-Length: 21\rContent-Type: text/plain\rDate: Thu, 25 May 2023 12:35:54 GMT\rServer: Kestrel\rLocation: https://httpstat.us\rSet-Cookie: ARRAffinity=(truncated)

We needed to get rid of those \n newline characters or else the rest of our forthcoming find-and-replace code won’t work properly. That’s because in the regex flavour used by Julia—pcre—the “match all” expression .* doesn’t match newlines. You don’t necessarily need to understand that fully—just know that we needed to delete the \n newline characters before doing the next steps.

Right, that’s that.

Now we need to delete everything up to and including Location: .

We do that like this:

r_string = replace(r_string, r".*Location: " => "")

We get:

"https://httpstat.us\rSet-Cookie: ARRAffinity=(truncated)

We’re getting there—we can see our target URL at the start of our variable. Now all we need to do is to remove everything after it. We do that by matching /r and everything that comes after it:

r_string = replace(r_string, r"\r.*" => "")

And we get what we wanted the whole time, our target URL:

"https://httpstat.us"

Okay, let’s put that into our code:

...

for url in url_array
  try
    r = HTTP.head(url, redirect=false)
    status = r.status
    # Code to get target URL below:
    r_string = String(r)
    r_string = replace(r_string, "\n" => "")
    r_string = replace(r_string, r".*Location: " => "")
    r_string = replace(r_string, r"\r.*" => "")
  catch
    println("Something went wrong")
  end
end

So, our status code lives in our variable status, and our target URL lives in our variable r_string.

Okay, that was a big step. Now for the home stretch—it’s time to build our table:

Step 7: Write the CSV

If you recall, at the beginning of this section, we said we wanted a table that looks like this:

urlstatustarget_url
https://httpstat.us/301301https://httpstat.us
https://httpstat.us/302302https://httpstat.us

Well, now we have everything we need. We just need to write our url, status and r_string (target URL) variables to a dataframe, so we can write our table.

Let’s amend our code accordingly. Here’s the full finished script, with commentary below:

using HTTP

url_array = [
  "https://example.com",
  "https://httpstat.us/301",
  "https://httpstat.us/302"
]

open("/Users/ron/Desktop/url-statuses.csv", "a") do io
	write(io, "url,status,target_url")
end

for url in url_array
  try
    r = HTTP.head(url, redirect=false)
    status = r.status
    # Code to get target URL below:
    r_string = String(r)
    r_string = replace(r_string, "\n" => "")
    r_string = replace(r_string, r".*Location: " => "")
    r_string = replace(r_string, r"\r.*" => "")
    open("/Users/ron/Desktop/url-statuses.csv", "a") do io
				write(io, "\n$url,$status,$r_string")
		end
  catch
      open("/Users/ron/Desktop/url-statuses.csv", "a") do io
				write(io, "\n$url,unknown,unknown")
		  end
  end
end

Let’s go through the changes one at a time:

open("/Users/ron/Desktop/url-statuses.csv", "a") do io
	write(io, "url,status,target_url")
end

This is where we write the column headers (url,status,target_url) into a new CSV we’re calling url-statuses.csv.

Because we only do this once, this code goes outside our for loop.

Then, inside our for loop, in our try block, we write the url, the status code and the target URL to our CSV:

open("/Users/ron/Desktop/url-statuses.csv", "a") do io
  write(io, "\n$url,$status,$r_string")
end

And finally, in our catch block, we write some fallback values (unknown,unknown) if our script can’t get the status code and target URL:

open("/Users/ron/Desktop/url-statuses.csv", "a") do io
  write(io, "\n$url,unknown,unknown")
end

That’s it! That was a big exercise—great job!

Of course, we could do more such as writing in cases for when the URL doesn’t redirect—i.e. when it has a status code of 200, 404, and so on, but I didn’t want to make this tutorial too long!

Crawling just the headers

If you’re not sure what headers are, check out the box below.

Worked example

Let’s fetch the headers from example.com.

As mentioned earlier, because we aren’t fetching the whole page, we don’t need to use HTTP.get(), which fetches the headers and all the HTML.

Instead, we can save bandwidth by using HTTP.head(), which only fetches the headers instead.

Okay, moving on. First we fetch the page:

r = HTTP.head("https://example.com/")

… and then we request the headers:

headers = r.headers

This is the output I got:

12-element Array{Pair{SubString{String},SubString{String}},1}:
  "Accept-Ranges" => "bytes"
            "Age" => "390198"
  "Cache-Control" => "max-age=604800"
   "Content-Type" => "text/html; charset=UTF-8"
           "Date" => "Sat, 01 Aug 2020 03:25:23 GMT"
           "Etag" => "\"3147526947\""
        "Expires" => "Sat, 08 Aug 2020 03:25:23 GMT"
  "Last-Modified" => "Thu, 17 Oct 2019 07:18:26 GMT"
         "Server" => "ECS (sjc/4E74)"
           "Vary" => "Accept-Encoding"
        "X-Cache" => "HIT"
 "Content-Length" => "1256"

Note that some or all of the dates in your output may be different.

Extracting the relevant headers from the Julia array by index

You can then pull the relevant key-value pair(s) from this array as needed. For instance, if you want the Last-Modified header (which is the eighth element in the headers array we created), you could do this:

last_modified = headers[8]

… and you’d get this:

"Last-Modified" => "Thu, 17 Oct 2019 07:18:26 GMT"

Or if you want just the value, you could do this:

last_modified = headers[8][2]

… and you’d get this:

"Thu, 17 Oct 2019 07:18:26 GMT"

Extracting the relevant headers from the Julia array by key name

It’s probably a better idea to extract the value of a given header by its name (e.g. "Last-Modified" rather than its position in the array.

For example, if you’re scraping multiple web pages, some might have 11-element header arrays, and some might have 9, 10, 12, and so on.

Here’s how to get the value of a given header (we’ll work with "Last-Modified") by name.

Using our same headers array we created above:

for header in headers
  if header[1] == "Last-Modified"
    println(header[2])
  end
end

Above, we iterate over each header in the headers array (for header in headers).

Then we say that if the header key (name) is “Last-Modified” (if header[1] == "Last-Modified"), then print the value of that header (println(header[2])). Of course, you could save it to another array or a dataframe; I’ve just provided you with a simple example above.

Get Julia tips in your inbox a few times per year. Unsubscribe anytime.

Despite being an amateur at Julia programming and having no prior web-scraping experience, a JuliaSchool tutorial helped me learn enough to make a contribution to an open-source project in the Julia community using these tools, all in the course of a single afternoon. Thanks JuliaSchool! - Joe from Germany