How to encode URLs using Julia (the proper way)

There is a Julia package that will encode strings for you, but it doesn’t handle url encoding properly. This code fixes that.

Ron Erdos
Updated January 22, 2024
Tested with Julia version 1.9.4

The goal

Last week at work, I needed to encode some urls that used non-Latin characters.

One of the languages was Traditional Chinese.

A (made-up) example is:

https://example.com/例子

When faced with this task, I of course turned to Julia.

I learned Julia can encode strings but it doesn’t—as far as I can see—handle url encoding perfectly.

So I wrote a script, which you can find below, which picks up where Julia and the current package ecosystem leaves off.

Let’s see what Julia does out of the box. We’re going to use the best out-of-the-box solution I could find, the HTTP package:

using HTTP
url = "https://example.com/例子"
encoded_url = HTTP.escapeuri(url)

We get the following percent-encoded result:

"https%3A%2F%2Fexample.com%2F%E4%BE%8B%E5%AD%90"

You can even decode it to check:

encoded_url = "https%3A%2F%2Fexample.com%2F%E4%BE%8B%E5%AD%90"
decoded_url = HTTP.unescapeuri(encoded_url)

We get:

https://example.com/例子

… which was our original url.

The problem with that out-of-the-box solution

However, when it comes to URLs, we don’t want to encode the whole thing.

We actually want our end result to look like this:

https://example.com/%E4%BE%8B%E5%AD%90

… where the colon and forward slashes are written in the normal way (decoded) and the Traditional Chinese characters are percent-encoded.

This is because, when it comes to url encoding, there are two groups of characters that we want to remain decoded (i.e. written out normally).

Let’s cover that in the next two sections.

Valid characters: The first group of characters that should remain decoded when encoding urls

The first group of characters that should remain decoded when encoding urls are known as “valid characters”.

They are:

A-Z (uppercase alphabetical)

a-z (lowercase alphabetical)

0-9 (numbers 0 to 9)

- (hyphen a.k.a. dash)

_ (underscore)

. (full stop or period)

~ (tilde)

Let’s try an example url that contains all of these to see what the HTTP Julia package does:

using HTTP
url = "https://example.com/Aa123-_~"
encoded_url = HTTP.escapeuri(url)

We get:

"https%3A%2F%2Fexample.com%2FAa123-_%7E"

… when what we want is:

"https://example.com/Aa123-_~"

Julia’s HTTP package encoder keeps the full stop (period), the letters, numbers, hyphen and underscore decoded, but it encodes the forward slashes and the tilde.

Before we code up our fix for this, let’s look at the second group of characters that should remain decoded when encoding urls.

Reserved characters: The second group of characters that should (usually) remain decoded when encoding urls

This group of characters should also remain decoded when encoding urls, but only when they are used for their traditional url purposes.

For example, a colon should go after http or https, a question mark should lead the first parameter, and so on. Basically if you use these characters conventionally in urls, they should be decoded.

Here are the reserved characters when url encoding, followed by commentary and examples:

Colon `:`

Found after https and also in some other cases.

Forward slash `/`

A double forward slash comes after https:

Single forward slashes divide subfolders a.k.a. subdirectories.

Question mark `?`

Found before the first parameter in a url (if present).

Example: https://example.com/?post=123

Equals sign `=`

Found between a parameter and its value.

Example: https://example.com/?post=123

Ampersand `&`

Found before the second parameter and every parameter from then on (if present).

Example: https://example.com/?size=11&color=green

Hash or pound sign `#`

Found when using anchor links.

Example: https://example.com/?post=123#read-this-part

Plus sign `+`

Sometimes replaces spaces in words.

Example: https://example.com/pretty+url

Note: I prefer having the web application remove spaces and replace them with dashes. This is better from an SEO and readability point of view.

Percent sign `%`

The percent sign is a reserved character when it’s been used in percent encoding.

Example: https://example.com/foo%20%bar

Above, the %20% is the encoding of a space; but once again, I prefer to have the web application remove spaces and replace them with dashes.

“At” symbol `@`

I haven’t used this, but it’s for authentication details (username, password) on a protocol level.

Other reserved characters

And the final reserved characters, presented without explanation or examples, are:

[ ] ! $ ( ) * , ;

Okay, so what’s the solution?

So we need to use Julia’s HTTP package for encoding, but we’re only going to restrict it from encoding characters that should remain decoded.

Now, some of these valid characters and reserved characters are already skipped by the Julia HTTP package, but I don’t want my script to break in the future if the HTTP package maintainers change the rules.

Dollar signs and urls

In my experience, it’s rare for urls to contain dollar signs, however, they are allowed.

The problem is, Julia uses the dollar sign for string interpolation:

lunch = "meat"
println("I am having $lunch for lunch today.")

Running the code above, we get: I am having meat for lunch today.

It’s for this reason that my script below doesn’t handle dollar signs.

I would manually check the urls you want to encode for dollar signs before running the solution script below. I may add dollar sign handling later.

My script

Here’s my script; scroll horizontally to see the full thing:

using HTTP

url = "https://example.com/例子"
# made up example with Traditional Chinese script

url_split = split(url, "")
# The pair of double quotes splits the string at each character

encoded_url_array = String[]
# create empty array so we can reconstruct the encoded url

for char in url_split
	
	if char ∉ [":", "/", "?", "=", "&", "#", "+", "%", "@", "[", "]", "!", "(", ")", "*", ",", ";", "-", "_", ".", "~"]
	# The ∉ character means "not in"
	# You type it by typing \notin and then pressing tab
		
		encoded_char = HTTP.escapeuri(char)
		# If the character is not in the array above, encode it ...
		
		push!(encoded_url_array, encoded_char)
		# ... and push it to our array
		# NB: alphanumeric characters remain decoded by default
	
	else
		push!(encoded_url_array, char)
		# But if the character is in the array above
		# keep it decoded and push it to our array
	end
end

encoded_url = join(encoded_url_array)
# Re-join all our characters into a single string

println(encoded_url)

Of course, you can have that script iterate over a list of urls and also write the results to a file, rather than just printing to the terminal. All the best!

How to encode URLs using Julia (the proper way)

The goal

The problem with that out-of-the-box solution

Valid characters: The first group of characters that should remain decoded when encoding urls

Reserved characters: The second group of characters that should (usually) remain decoded when encoding urls

Colon :

Forward slash /

Question mark ?

Equals sign =

Ampersand &

Hash or pound sign #

Plus sign +

Percent sign %

“At” symbol @