Comparison of Corsort variants

This notebook shows how to use the corsort package to compare different variants of Corsorts. In details we tested the following possibilities:

  • Internal scorer: \(\Delta\) or \(\rho\)

  • Information measure: \(I(i)+I(j)\) or \(\max(I(i), I(j))\)

  • External scorer: \(\Delta\) or \(\rho\)

There are much more variants available. We restrain to these 8 for conciseness.

First we load some packages.

[1]:
import numpy as np
from matplotlib import pyplot as plt
from multiprocess.pool import Pool

from corsort import *

We populate the list of corsorts we want to study.

[2]:
sort_list = [
    JitCorsortDeltaSumDelta(compute_history=True),
    JitCorsortDeltaMaxDelta(compute_history=True),
    JitCorsortRhoSumDelta(compute_history=True),
    JitCorsortRhoMaxDelta(compute_history=True),
    JitCorsortRhoSumRho(compute_history=True),
    JitCorsortRhoMaxRho(compute_history=True),
    JitCorsortDeltaSumRho(compute_history=True),
    JitCorsortDeltaMaxRho(compute_history=True)]
legends = {'corsort_delta_sum_delta': r'Corsort-$\Delta\sum$ $\Delta$',
 'corsort_delta_max_delta': r'Corsort-$\Delta$Max $\Delta$',
 'corsort_rho_sum_delta': r'Corsort-$\rho\sum$ $\Delta$',
 'corsort_rho_max_delta': r'Corsort-$\rho$Max $\Delta$',
 'corsort_rho_sum_rho': r'Corsort-$\rho\sum$ $\rho$',
 'corsort_rho_max_rho': r'Corsort-$\rho$Max $\rho$',
 'corsort_delta_sum_rho': r'Corsort-$\Delta\sum$ $\rho$',
 'corsort_delta_max_rho': r'Corsort-$\Delta$Max $\rho$'}

We decide the value of \(n\) to study and the number of trials.

[3]:
n = 500
nt = 10000

We now run the experiments, with some multi-processing to speed-up things.

[4]:
with Pool() as p:
    convergence = evaluate_convergence(sort_list=sort_list, n=n, nt=nt, pool=p)
Evaluate convergence of corsort_delta_sum_delta for n = 500
100%|██████████| 10000/10000 [28:06<00:00,  5.93it/s]
Evaluate convergence of corsort_delta_max_delta for n = 500
100%|██████████| 10000/10000 [27:59<00:00,  5.96it/s]
Evaluate convergence of corsort_rho_sum_delta for n = 500
100%|██████████| 10000/10000 [30:50<00:00,  5.40it/s]
Evaluate convergence of corsort_rho_max_delta for n = 500
100%|██████████| 10000/10000 [30:01<00:00,  5.55it/s]
Evaluate convergence of corsort_rho_sum_rho for n = 500
100%|██████████| 10000/10000 [31:49<00:00,  5.24it/s]
Evaluate convergence of corsort_rho_max_rho for n = 500
100%|██████████| 10000/10000 [30:45<00:00,  5.42it/s]
Evaluate convergence of corsort_delta_sum_rho for n = 500
100%|██████████| 10000/10000 [28:10<00:00,  5.91it/s]
Evaluate convergence of corsort_delta_max_rho for n = 500
100%|██████████| 10000/10000 [27:48<00:00,  5.99it/s]

We save the raw results.

[5]:
import dill as pickle
from pathlib import Path

fn = Path(f"corsorts_n_{n}_nt_{nt}.pkl")
if fn.exists():
    with open(fn, 'rb') as f:
        convergence = pickle.load(f)
else:
    with open(fn, 'wb') as f:
        pickle.dump(convergence, f)

Then we display the results.

[6]:
m = n*(n-1)/2
fig = plt.figure(figsize=(15, 10))
ax = plt.axes()
decim=40
color_dict = auto_colors(sort_list)
for name in legends:
    ref = convergence[name]
    color = color_dict[name]
    p_m = ref.shape[1]
    x = np.arange(p_m)[::decim]
    ref = ref[:, ::decim]
    q = np.zeros((5, ref.shape[1]))
    for i, per in enumerate([2.5, 50, 97.5]):
        q[i, :] = np.percentile(ref, per, axis=0)
    q = q/m
    ax.plot(x, q[1, :], label=legends[name], color=color)
    ax.fill_between(x, q[0, :], q[2, :], alpha=.2, color=color)
plt.legend()
plt.grid()
ax.tick_params(labelright=True, right=True)
plt.ylabel('Distance to sorted list')
plt.xlabel('Comparisons performed')
plt.ylim([0, None])
plt.xlim([0, None])
plt.show()
../_images/notebooks_corsorts_13_0.png

Quick interpretation :

  • max and plus variants (for measure of information) are very close to each other, with a slight advantage to max.

  • Using a \(\Delta\) external scorer creates a kind of bump. The \(\rho\) external scorer does not have this issue.

  • Most of the time (the exception been around the bump from previous item), the \(\Delta\) internal scorer is better.

In the end, we selected Corsort-\(\Delta\)Max \(\rho\):

  • Internal scorer: \(\Delta\)

  • Information measure: \(\max(I(i), I(j))\)

  • External scorer: \(\rho\)