How to encode URLs using Julia (the proper way)
There is a Julia package that will encode strings for you, but it doesn’t handle url encoding properly. This code fixes that.
Updated January 22, 2024Tested with Julia version 1.9.4
The goal
Last week at work, I needed to encode some urls that used non-Latin characters.
One of the languages was Traditional Chinese.
A (made-up) example is:
https://example.com/例子
When faced with this task, I of course turned to Julia.
I learned Julia can encode strings but it doesn’t—as far as I can see—handle url encoding perfectly.
So I wrote a script, which you can find below, which picks up where Julia and the current package ecosystem leaves off.
Let’s see what Julia does out of the box. We’re going to use the best out-of-the-box solution I could find, the HTTP package:
using HTTP
url = "https://example.com/例子"
encoded_url = HTTP.escapeuri(url)
We get the following percent-encoded result:
"https%3A%2F%2Fexample.com%2F%E4%BE%8B%E5%AD%90"
You can even decode it to check:
encoded_url = "https%3A%2F%2Fexample.com%2F%E4%BE%8B%E5%AD%90"
decoded_url = HTTP.unescapeuri(encoded_url)
We get:
https://example.com/例子
… which was our original url.
The problem with that out-of-the-box solution
However, when it comes to URLs, we don’t want to encode the whole thing.
We actually want our end result to look like this:
https://example.com/%E4%BE%8B%E5%AD%90
… where the colon and forward slashes are written in the normal way (decoded) and the Traditional Chinese characters are percent-encoded.
This is because, when it comes to url encoding, there are two groups of characters that we want to remain decoded (i.e. written out normally).
Let’s cover that in the next two sections.
Valid characters: The first group of characters that should remain decoded when encoding urls
The first group of characters that should remain decoded when encoding urls are known as “valid characters”.
They are:
A-Z
(uppercase alphabetical)
a-z
(lowercase alphabetical)
0-9
(numbers 0 to 9)
-
(hyphen a.k.a. dash)
_
(underscore)
.
(full stop or period)
~
(tilde)
Let’s try an example url that contains all of these to see what the HTTP Julia package does:
using HTTP
url = "https://example.com/Aa123-_~"
encoded_url = HTTP.escapeuri(url)
We get:
"https%3A%2F%2Fexample.com%2FAa123-_%7E"
… when what we want is:
"https://example.com/Aa123-_~"
Julia’s HTTP package encoder keeps the full stop (period), the letters, numbers, hyphen and underscore decoded, but it encodes the forward slashes and the tilde.
Before we code up our fix for this, let’s look at the second group of characters that should remain decoded when encoding urls.
Reserved characters: The second group of characters that should (usually) remain decoded when encoding urls
This group of characters should also remain decoded when encoding urls, but only when they are used for their traditional url purposes.
For example, a colon should go after http
or https
, a question mark should lead the first parameter, and so on. Basically if you use these characters conventionally in urls, they should be decoded.
Here are the reserved characters when url encoding, followed by commentary and examples:
Colon :
Found after https
and also in some other cases.
Forward slash /
A double forward slash comes after https:
Single forward slashes divide subfolders a.k.a. subdirectories.
Question mark ?
Found before the first parameter in a url (if present).
Example: https://example.com/?post=123
Equals sign =
Found between a parameter and its value.
Example: https://example.com/?post=123
Ampersand &
Found before the second parameter and every parameter from then on (if present).
Example: https://example.com/?size=11&color=green
Hash or pound sign #
Found when using anchor links.
Example: https://example.com/?post=123#read-this-part
Plus sign +
Sometimes replaces spaces in words.
Example: https://example.com/pretty+url
Note: I prefer having the web application remove spaces and replace them with dashes. This is better from an SEO and readability point of view.
Percent sign %
The percent sign is a reserved character when it’s been used in percent encoding.
Example: https://example.com/foo%20%bar
Above, the %20%
is the encoding of a space; but once again, I prefer to have the web application remove spaces and replace them with dashes.
“At” symbol @
I haven’t used this, but it’s for authentication details (username, password) on a protocol level.
Other reserved characters
And the final reserved characters, presented without explanation or examples, are:
[ ] ! $ ( ) * , ;
Okay, so what’s the solution?
So we need to use Julia’s HTTP package for encoding, but we’re only going to restrict it from encoding characters that should remain decoded.
Now, some of these valid characters and reserved characters are already skipped by the Julia HTTP package, but I don’t want my script to break in the future if the HTTP package maintainers change the rules.
My script
Here’s my script; scroll horizontally to see the full thing:
using HTTP
url = "https://example.com/例子"
# made up example with Traditional Chinese script
url_split = split(url, "")
# The pair of double quotes splits the string at each character
encoded_url_array = String[]
# create empty array so we can reconstruct the encoded url
for char in url_split
if char ∉ [":", "/", "?", "=", "&", "#", "+", "%", "@", "[", "]", "!", "(", ")", "*", ",", ";", "-", "_", ".", "~"]
# The ∉ character means "not in"
# You type it by typing \notin and then pressing tab
encoded_char = HTTP.escapeuri(char)
# If the character is not in the array above, encode it ...
push!(encoded_url_array, encoded_char)
# ... and push it to our array
# NB: alphanumeric characters remain decoded by default
else
push!(encoded_url_array, char)
# But if the character is in the array above
# keep it decoded and push it to our array
end
end
encoded_url = join(encoded_url_array)
# Re-join all our characters into a single string
println(encoded_url)
Of course, you can have that script iterate over a list of urls and also write the results to a file, rather than just printing to the terminal. All the best!