Good feature engineering is crucial in improving model accuracy. Depending on the context, it can be useful to create and test out some unconventional features that provides your model a different perspective without being too correlated to other features that are already being used.

Lets say you are attempting to model some time-series data that involves the notion of queues/waiting to be served. Typically, applying some queueing theory models might help but queueing theory has a lot of assumptions that makes it hard to use in real-world scenarios. Perhaps the server has some preferences for certain characteristics of a queuing-node such that it might choose to serve a node that arrived later, first. This then breaks the first-in-first-out rule. Thus, how might we numerically measure the adherence to FIFO?

Examples

#1 This is an example of perfect FIFO:

Arrivals:  ['A', 'B', 'C', 'D', 'E', 'F']
Departure: ['A', 'B', 'C', 'D', 'E', 'F']

           ------------------------------> Time

#2 Here, C is served before A and B:

Arrivals:  ['A', 'B', 'C', 'D', 'E', 'F']
Departure: ['C', 'B', 'A', 'D', 'E', 'F']

           ------------------------------> Time

#3 Here, C is served before A and B but B is served after A:

Arrivals:  ['A', 'B', 'C', 'D', 'E', 'F']
Departure: ['C', 'A', 'B', 'D', 'E', 'F']

           ------------------------------> Time

In my opinion, example #2 is worse than #3 in terms of FIFO-ness as 2 items are out of order compared to just 1 in example #3. We will use these examples to evaluate the goodness of the FIFO metric functions.

I have explored 5 different ways of calculating FIFO-ness and I will be discussing the pros and cons of each method.

Levenshtein Distance (Edit Distance)

Levenshtein Distance, also known as Edit Distance, is a measure of the difference between two sequences of characters. It is the minimum number of single-character edits (insertions, deletions, or substitutions) required to transform one sequence into the other. This measure can be used to compare the similarity of two strings or to evaluate the adherence to FIFO in a queueing process. However, we will not be dealing with insertions or deletions as the number of arrivals and departure will be the same.

By using the textdistance package we can calculate the levenshtein distance as see that we example #2 and #3 would give the same score.

arrivals = ['A', 'B', 'C', 'D', 'E', 'F']
dep1 = ['C', 'B', 'A', 'D', 'E', 'F']
dep2 = ['C', 'A', 'B', 'D', 'E', 'F']

textdistance.levenshtein.normalized_similarity(arrivals, dep1)
>>> 0.6666666666666667
textdistance.levenshtein.normalized_similarity(arrivals, dep2)
>>> 0.6666666666666667

While these results might be suitable for some cases, I believe that levenshtein distance does not fully capture the idea of FIFO well.

Index distance

This is method simply makes use of the differences in the indices of a particular item in the arrival and departure arrays to calculate the FIFO-ness. It is possible to take the absolute difference or the squared difference. Using the squared distance heavily penalises cases where a node was able to skip ahead of many other nodes to be served.

Here is an example of squared index distance:

def fifo_rating_squared_index_distance(arrival_order, departure_order):
    """
    This function calculates the squared index distance between the 
		arrivals order and the departure order.
    """
		n = len(waiting_order)
    assert set(arrival_order) == set(departure_order)
		assert len(arrival_order) == len(departure_order)
    distance = 0
    
    for idx, imo in enumerate(arrival_order):
        distance += (idx - departure_order.index(imo))**2

    return distance / n

Delay based distance

Sometimes, simply using the order or indices of the arrivals and departure may not capture the FIFO-ness. If the exact arrival and departure time is available, then we can use the time difference to calculate the FIFO-ness.

def fifo_rating_delay_based(arrival_order, departure_order, departure_time):
    """
    This function calculates the average "delay" experienced by a node due 
		to other nodes being served before it.
    """
    n = len(arrival_order)
    assert set(arrival_order) == set(departure_order)
    total_delay = 0
    # iterate over the arrival order
    for i, node in enumerate(arrival_order):
        if node != departure_order[i]:
            this_node_departure_position = departure_order.index(node)
            # iterate over all node which started waiting after the current node
            for j in range(i+1, n):
                node_wait_after = arrival_order[j]
                # compare the departure orders of the current node and all the nodes that started 
								# waiting after the current node
                node_wait_after_departure_position = departure_order.index(node_wait_after)
                if node_wait_after_departure_position < this_node_departure_position:
                    # this means there is a node that started waiting after this node
                    # but departed before this node
                    delay = (departure_time[this_node_departure_position] - departure_time[node_wait_after_departure_position]).seconds / 3600
                    total_delay += delay
    # the total delay is the total delay for all nodes compared to all nodes that started waiting after
    # that's why we divide by the total n
    final_metric = total_delay / n

    return final_metric

Sort-ratio algorithm

The sort-ratio algorithm takes a different approach to the other methods. It since we know that if FIFO is obeyed, then the departure order must be the same as the arrival order, we can measure how “disorderly” the departure order is by tracking the number of swaps need to sort the array.

Particularly, I have chosen to go with the most optimized version of the bubble sort algorithm. It is important that we do not use the basic implementation because the number of comparisons will always be O(n^2). The optimized version avoids any unnecessary comparisons.

I found the sort-ration algorithm to perform the best in terms of feature importances to my model. I think it is best to try out all the methods listed and run SHAP to understand the impact each has on the model.

def bubble_sort_optimized(array):
    """
    Helper function for fifo_rating_sort_ratio
    """
    comparisons = 1
    swaps = 0
    swapped = True
    n = len(array)
    while swapped:
        swapped = False
        new_n = 0
        for j in range(n - 1):
            comparisons += 1
            if array[j] > array[j + 1]:
                array[j], array[j+1] = array[j+1], array[j]
                swapped = True
                swaps += 1 
                new_n = j + 1
        n = new_n
    return comparisons, swaps

def fifo_rating_sort_ratio(arrival_order, departure_order):
    """
    This function calculates the ratio of the number of swaps needed to make the mooring order 
    the same as the waiting order, to the number of comparisons needed to sort the mooring order.
    """
    n = len(arrival_order)
    assert set(arrival_order) == set(departure_order)
		assert len(arrival_order) == len(departure_order)
    index_array = [arrival_order.index(node) for node in departure_order]
    # comparisons are the total number of pair comparisons needed to sort the departure order according to the
    # the bubble sort algorithm
    comparisons, swaps = bubble_sort_optimized(index_array)
    return (swaps/comparisons)