Service Growth: Implementing the shortener // Work Dad Dev

Now that we have a storage API that abstracts away the database, we can implement the URL Shortener business logic. The idea is that all the semantics of the url shortener should be done at this layer, leaving upper layers the task to present the service to the user (web api / website / command line interface).

Creating a short name

The core functionality of an URL Shortener is to create a short name and store the mapping between it and an URL. But how long should that name be? We want it to be short enough so that it makes sense to use it but it can’t be too short or we won’t have enough names for URLs.

With that in mind, I chose to create short names as base64 encoded 6 bytes (48 bits), resulting in an 8 character string for every short name with the 6 bytes being read from a cryptographically secure random number generator.

With 6 bytes we can create 281 trillion different names, which should be more than enough. And by having the name be a fixed size, all our short urls will always have the same number of characters. The code to generate the name is very simple, thanks to Go’s crypto/rand and base64 packages.

Name clashing

While 48 bits gives us trillions of possible names, there is still the chance that we might randomly generate the same name. The odds of this happening increases as we create more names and surprisingly, due to the birthday paradox, we get a 50% chance of name clashing at around 20 million (24 bits) names generated. To account for this, we need to generate a new name in case there is a clash.

The database schema gives us the means to identify this issue: I added an unique index to the short name. If we try to insert the same short name twice, a constraint violation is triggered and the insert fails. The code for creating and storing the name is run in a loop with a set number of retries in case we trigger a violation.

URL Validation

Shortened URLs can be a source of abuse because they mask the original URL. Attackers can disguise their intent using the shortener (http://sho.rt/ruwklD93k vs http://www.bad-bank.com) and trick users to e.g download malware, enter their bank credentials in a phishing site, steal sensitive cookies (XSS/CSRF attacks) among other abuses. A production-ready URL Shortener must handle abusive URLs and this is the correct layer to handle this.

But, anti-abuse is its own domain and there are a multitude of strategies to adopt in order to remove as many sources of abuse as possible. The topic deserves an article on its own so for now the only validation that we will do is to check if the URL string is empty or not and if it is less than 8k characters.

The limit was chosen to give a large enough room for most URLs while allowing me to calculate space usage. In the worst case I would need 8200 bytes to store a mapping, giving me ~130,000 mappings per gigabyte. Typical URLs will be much smaller, in the range of a few dozens to a few hundred characters, so we are looking at 1.3 - 10 million mappings per gigabyte.

Putting it all together

We are now ready to create the APIs:

package service

type Shortener struct {
  // ...
}

func New(/*...*/) *Shortener {
  // ...
}

// ShortNameFor stores and returns a short name for the given long.
func (s *Shortener) ShortNameFor(
    ctx context.Context,
    long string,
) (string, error) {
    // ...
}


// LongFrom returns the long name for the given short.
func (s *Shortener) LongFrom(
    ctx context.Context,
    short string,
) (string, error) {
    // ...
}

The service implements 2 methods: ShortNameFor and LongFrom. The former creates the short name, stores the mapping and returns it and the latter retrieves the long name given the shortname.

The Shortner struct

The service depends on the storage API, not the underlying database. This is evidenced by the fields of the struct:

type Shortener struct {
	queries     *storage.Queries
	randSource  io.Reader
	existsRetry int
}

We have a dependency on *storage.Queries, which is the struct that sqlc generated to add the methods we created from SQL queries. randSource is a source of random bytes which we use to generate the random name and existsRetry is the number of times we can retry in case we have a name clash.

Notice also that I depend on the concrete type *storage.Queries, not an interface and there are a couple of reasons I do this. The first one is SQLite; as I mentioned before I can create an in-memory version of the database and so it is easy to test my storage API backed by the SQLite database. This has the added benefit that I don’t need to mock or create and maintain a fake for my database.

The second and more Go-related reason is that while I do depend on the concrete type, strictly speaking I only depend on the method calls. Go makes it easy to swap out the concrete type and introduce an interface in its place if I need or want to do so. This allows me to delay this decision and only create the interface if I actually need it e.g I want to add extra logic to the storage.SetShort / storage.GetLong api calls.

Create, store and retrieve

Most of the work is done when creating and storing the URL short name. The ShortNameFrom implementation is straightforward:

Validate the URL.
Create a short name.
Call storage.SetShort with the short name and the URL.
If there are no errors, return the short name.
If storage.SetShort fails because of a name clash, retry from (2) until we succeed or we exhaust retries.
For any other errors, return them

The LongFrom method just checks if the short name has exactly 8 characters and then performs the storage.GetLong call to retrieve the URL. If there is no name stored, we just return a not found error.

Both methods have extensive testing to make sure we cover all relevant scenarios.

Benchmarking

One of the best things about Go is its tooling. In particular, creating benchmarks for your code is very easy and because of that I tend to add benchmarks to most of my code, giving me a number to look at if I ever need optimize it. Creating a benchmark in Go is as easy as creating a test:

func BenchmarkCode(b *testing.B) {
  for i := 0; i < b.N; i++ {
    // Run the code you want to benchmark.
  }
}

The loop using the b.N variable is the main thing to remember. b.N is the number of iterations and it is set by the go benchmark tool. It changes dynamically based on how fast the code runs, and when the benchmark is over the number of iterations is reported, along with the time the code takes to run.

Let’s look at a benchmark for the GenerateShort function:

package service_test

import (
  // ...
  "crypto/rand"
  "testing"
)
var globalShort string

func BenchmarkGenerateShort(b *testing.B) {
	var localShort string
	var err error

	for i := 0; i < b.N; i++ {
		localShort, err = service.GenerateShort(rand.Reader)
		if err != nil {
			panic(err)
		}
	}
	globalShort = localShort
}

The benchmark just calls the GenerateShort function in a b.N loop and checks for errors. The assignment to a global variable is to ensure that the return value from GenerateShort doesn’t get optimized away. The results on my chromebook are as follows:

goos: linux
goarch: amd64
pkg: github.com/avalonbits/shortener/service
cpu: 11th Gen Intel(R) Core(TM) i3-1115G4 @ 3.00GHz
BenchmarkGenerateShort-4         2550715               452.0 ns/op            24 B/op          2 allocs/op

It takes only 0.5 microseconds to generate a short name! You can run the other benchmarks as well, which measure how long it takes for ShortNameFor and LongFrom to perform their actions using an in-memory SQLite database. On my chromebook they take ~6 microseconds.

We have a service

With this code we now have a package that fully implements the core functionality of an URL Shortener, with proper tests and benchmarks. The package can be used to store short names for any string, not just URLs.

In the next post we will expose this service as a web app and enforce that only URLs are being shortened.