Kostya on Programming

Erlang: Functions (Part 3).

Recursion gets handy when you work with complex data types. Before we move forward let me remind you some basic concepts on lists.

Erlang has a few built-in functions allowing you to manipulate lists. The main 2 are hd and tl. The first one returns you the head of the list while the second one returns you a tail. Let's illustrate it.

1> hd([1, 2, 3]).
1
2> tl([1, 2, 3]).
[2,3]
3> tl([1]).
[]

So, for those of you who are not familiar with that concept, the tail is actually the list minus the very first element. It means, if you apply that function to a list of a single element you would end up with the empty list. If you try and apply the tl function to an empty list you would get en error (so don't do this!).

We could detect an empty list by simply using the pattern matching and in most cases an empty list would be the base case for any recursion you write when dealing with lists. Knowing this we could try and illustrate our first example.

-module(lesson3).
-export([sum/1]).

sum([]) -> 0;
sum(L) -> hd(L) + sum(tl(L)).

This function returns you a sum of all numbers in the list (assuming you provide a proper homogeneous list of numbers). The first clause is a base case that terminates the recursion. The second clause simply states that the sum of numbers is actually the first element (the head!) of the list plus the sum of the tail.

While this is a valid function it's recommended to use a special pattern matching syntax on separating the head from the tail. Let's implement another version of the sum function to demonstrate it.

sum2([]) -> 0;
sum2([H|T]) -> H + sum2(T).

As you could see, the base case stays the same; the only difference is how we're dealing with the head and tail.

You could match against the _ variable as well if you, for instance, don't need the exact value of the head but rather you rely on the fact the tail is one element shorter now. This property could be used when implementing functions like the length function.

len([]) -> 0;
len([_|T]) -> 1 + len(T).

The length of the empty list is 0; otherwise we add 1 to the final result each time the head could be taken out from the list.

We could go further and pattern match the head against some real values to distinguish clauses one from another.

dna_calc([], A, T, G, C) ->
    {A, T, G, C, (G + C) / (A + T + G + C) * 100};
dna_calc([a|Tail], A, T, G, C) -> dna_calc(Tail, A + 1, T, G, C);
dna_calc([t|Tail], A, T, G, C) -> dna_calc(Tail, A, T + 1, G, C);
dna_calc([g|Tail], A, T, G, C) -> dna_calc(Tail, A, T, G + 1, C);
dna_calc([c|Tail], A, T, G, C) -> dna_calc(Tail, A, T, G, C + 1).

This is a DNA calc function that accepts a list of atoms, where each atom supposed to be either a, t, g or c, representing the 4 nucleobases forming the DNA. The 4 following parameters are the initial states of each nucleobase, each assumed to be equal 0 whenever you call that function. The result of the function is a 5-elements tuple, where the very last element is a GC-content.

It's common to write functions requiring initial state this way but it's not a good practice. Instead, it's better to make the dna_calc function private and export its unary version instead.

dna_calc(L) -> dna_calc(L, 0, 0, 0, 0).

Now we could call it and see the result.

1> lesson3:dna_calc([a, t, a, c, g, c, a, t, a, c, a, g, c, a, g, t, g, g]).
{6,3,5,4,50.0}

You could not only match against a single value but actually against a sequence of values separated by a comma. For instance, let's count the amount of the CAG sequences (3 sequential nucleobases, forming a codon).

cag_count([]) -> 0;
cag_count([c, a, g|T]) -> 1 + cag_count(T);
cag_count([_, _, _|T]) -> cag_count(T).

The interesting fact here is that we're not looking for all possible CAG combinations in the list, but actually each time skipping the exact 3 nucleobases if they don't match our pattern in the second clause.

1> lesson3:cag_count([a, t, a, c, g, c, a, g, a, c, a, g, c, a, g, t, g, g]).
2

Now let's talk about the list synthesis. You could only add a new element (or elements) to the head. Let me provide you with a few examples.

1> [0|[]].
[0]
2> [1, 2|[3, 4]].
[1,2,3,4]
3> [0|[1, 2|[5, 6]]].
[0,1,2,5,6]

In the first case, we apply an element to an empty list, so the result is a single element list. In the second case, we apply elements 1 and 2 to the list [3, 4]. The result is a list with all the elements in order. The third case demonstrates you could actually use this syntax recursively where the very inner expression is evaluated first. So, to summarize, the syntax is the same as decomposing the list using the pattern matching. The only difference here is that we're creating a list, not decomposing it.

This syntax could be generalized as [Elem1, ..., ElemN|List]. You could have as much elements separated by a comma as you need but there should be at least one element. Combined with recursion this property gives you a lot of interesting things. Let's generate a sequence of numbers.

seq(0) -> [];
seq(N) -> [N|seq(N - 1)].

This function does work for any positive integer so if we call it like lesson3:seq(3) we would get a list of 3 elements [3, 2, 1]. What's interesting here is the base case which returns you an empty list. But what if we make it some other list? Like a list we want to concatenate with? Let's implement it.

concat([], B) -> B;
concat([H|T], B) -> [H|concat(T, B)].

So, the concat function follows the same pattern as the seq function, but in the base case instead of returning an empty list we return the B variable itself. What's also interesting is that the strings are actually lists of characters. It means, our concat function would work with strings too. Let's see.

1> lesson3:concat("abc", "def").
"abcdef"
2> "abc" ++ "def".
"abcdef"
3> lesson3:concat([1, 2, 3], [4, 5, 6]).
[1,2,3,4,5,6]
4> [1, 2, 3] ++ [4, 5, 6].
[1,2,3,4,5,6]

As you could see from the previous example Erlang has a special operator that concatenates lists (strings are also lists, remember?). This operator is associative, means the order of evaluation when there are two or more such operators in the row doesn't really matter as long as the operands keep their order. That's because ([1, 2] ++ [3, 4]) ++ [5, 6] gives you the same result as [1, 2] ++ ([3, 4] ++ [5, 6]). We will discuss it in detail in further articles.

What's actually interesting about that operator is that you could use it in pattern matching. There are some limitations though.

strip_abc("abc" ++ T) -> T;
strip_abc(L) -> L.

The function strips off the abc prefix (if any) of its argument. Unfortunately, you can only do this when matching the initial symbols of a string or list. You can't, for instance, match against prefix and suffix at the same time, like this "abc" ++ T ++ "cba". This considered illegal.

The above function is in fact a short for this:

strip_abc2([$a, $b, $c|T]) -> T;
strip_abc2(L) -> L.

To me, the main disappointment is that you can't do as in the following example, because the first part of the expression should be constant.

strip_prefix(P, P ++ T) -> T;
strip_prefix(_, L) -> L.

Let's now get back to the strip_abc function. As it was mentioned above you can't really pattern match it in the way so you have both prefix and suffix and some variable in between. One of the possible solutions would be to match the prefix, reverse the string and then match against the suffix (also reversed, just in case).

We can't reverse a string using the technique from the seq and concat functions; but what we could do is to involve an additional argument that we would use as an accumulator.

reverse([], Acc) -> Acc;
reverse([H|T], Acc) -> reverse(T, [H|Acc]).

When there is no elements in the first argument (which is a list, obviously), the result is the second argument itself. Otherwise, we recursively take out the head from the first argument and combine it together with the second argument, so that if you repeat this process you would end up with the list reversed.

The second argument of the reverse function supposed to be an empty list each time calling this function. So, we utilize the same approach as we did for the dna_calc function; the binary version of the reverse function is now local, we only export its unary version.

reverse(L) -> reverse(L, []).

This approach is very common and widely used; when you have to check some preconditions or to provide some initial state you create a wrapper that's in turn calling the "main" function that forms the recursion.

We conclude this article with the new syntax and approach that helps you to create more sophisticated functions.

Consider a function that allows you to insert an element to the exact position where the element would be less than the next one in the list but greater than the previous element: so that if the element we insert is L[N] then the following equation is correct L[N-1] < L[N] < L[N+1].

verbose_insert(X, []) -> [X];
verbose_insert(X, [H|T] = L) ->
    if
        X < H -> [X|L];
        true -> [H|verbose_insert(X, T)]
    end.

This version of the insert function is kind of verbose, you'll see why. But despite of being verbose it introduces some new technique. That's for sure the use of the equal sign (=) in pattern matching. This technique allows you to represent both an association with the entire argument and some pattern at the same time. If you just need to distinguish a clause from another one or for some reason to have both the argument and its components, like in this example we're getting the tail, the head and the list itself then this technique is for you. Please keep in mind, it's always preferable to utilize the argument itself rather than reconstructing it using its components due to performance reasons.

Let's see how this function works.

1> lesson3:verbose_insert(3, [1, 2, 4, 5]).
[1,2,3,4,5]

Now, if you take a look, the second clause of the verbose_insert function contains an if-expression with 2 clauses where the first one is a boolean expression checking whether the X argument is smaller than the head (H) of the list and the second argument is just an atom true that would actually trap any other ("else") cases. This could be simplified with the use of Guards.

Unfortunately guards is one of those topics that are too big to explain in a few sentences but too small to create a dedicated article. For now I decided to explain it by examples. We would face cases with guards and I will explain each of them. I'll probably generalize the syntax of guards in future articles, especially when discussing the control flow and its connection to functions. Now let's proceed with the first example.

insert(X, []) -> [X];
insert(X, [H|_] = L) when X < H -> [X|L];
insert(X, [H|T]) -> [H|insert(X, T)].

In this version of the insert function clauses 2 and 3 have the same patterns (even though they're slightly different the main idea behind each of them is "there is a list with at least one element in it"), so there is no way to distinguish them in a usual way. Though, if you take a look the clause number 2 is now annotated with the guard expression which adds an additional layer of control and states this particular clause should be chosen if only the X argument is smaller than the head (H) of the second argument.

Now, an interesting fact here, if we repeatedly apply this function to each element of a list we would get the list sorted! Let's try and illustrate this.

ins_sort([]) -> [];
ins_sort([H|T]) -> insert(H, ins_sort(T)).

This is a functional-style version of the insertion sort. Let's call it!

1> lesson3:ins_sort([4, 2, 7, 1]).
[1,2,4,7]

That's it for now. Play with the examples and see you in the next articles where we discuss new advanced topics.

P.S.: Don't forget to inspect and download the sources.

July 24, 2020

Konstantin Duman

Erlang: Functions (Part 3).