Benchmarks

We compare

a Fortran implementation due to Sébastien Motsch using the Verlet list method with double precision float numbers, run on the NextGen compute cluster at Imperial College London.
the CPU version of SiSyPHE with double precision tensors (float64) and the blocksparse-reduction method (BSR), run on an Intel MacBook Pro (2GHz Intel Core i5 processor with 8 Go of memory) ;
the GPU version of SiSyPHE with single precision tensors (float32) with and without the blocksparse-reduction method (BSR), run on a GPU cluster at Imperial College London using an nVidia GTX 2080 Ti GPU ;
the GPU version of SiSyPHE with double precision tensors (float64) with and without the blocksparse-reduction method (BSR), run on a GPU cluster at Imperial College London using an nVidia GTX 2080 Ti GPU.

Note

The choice of single or double precision tensors is automatically made according to the floating point precision of the input tensors.

We run a Vicsek model in a square periodic box with fixed parameters \(L=100\), \(\nu=5\), \(\sigma=1\), \(c=R\), \(dt=0.01\) and various choices of \(N\) and \(R\). The simulation is run for 10 units of time (i.e. 1000 iterations).

For each value of \(N\), three values of \(R\) are tested which correspond to a dilute regime, a moderate regime and a dense mean-field regime. When the particles are uniformly scattered in the box, the average number of neighbours in the three regimes is respectively \(\sim3\), \(\sim30\) and \(\sim300\). The regime has a strong effect on the efficiency of the Verlet list and blocksparse reduction methods.

In the table below, the total computation times are given in seconds except when they exceed 12 hours. The best times are indicated in bold and the worst times in italic.

		Fortran	sisyphe64 CPU BSR	sisyphe32 GPU	sisyphe32 GPU BSR	sisyphe64 GPU	sisyphe64 GPU BSR
N = 10k	R = 1	19s	26s	2.1s	3.3s	13s	3.6s
	R = 3	59s	29s		3.4s		3.9s
	R = 10	494s	69s		3.4s		7.9s
N = 100k	R = 0.3	309s	323s	29s	4.3s	973s	9.3s
	R = 1	1522s	384s		4.5s		11s
	R = 3	3286s	796s		4.9s		28s
N = 1M	R = 0.1	>12h	6711s	2738s	22s	>12h	120s
	R = 0.3	>12h	6992s		23s		135s
	R = 1	>12h	9245s		26s		194s

The GPU implementation is at least 5 times faster in the dilute regime and outperform the other methods by three orders of magnitude in the mean-field regime with large values of \(N\). Without the block-sparse reduction method, the GPU implementation does suffer from the quadratic complexity. The block-sparse reduction method is less sensitive to the density of particles than the traditional Verlet list method. It comes at the price of a higher memory cost (see Tutorial 04: Block-sparse reduction) but allows to run large scale simulations even on a CPU. On a CPU, the performances are not critically affected by the precision of the floating-point format so only simulations with double precision tensors are shown.

Note

The parameters of the blocksparse reduction method are chosen using the method best_blocksparse_parameters() (see Tutorial 04: Block-sparse reduction). As a guideline, in this example and on the GPU, the best number of cells is respectively \(\sim 15^2\), \(\sim 42^2\) and \(\sim 70^2\) for \(N=10^4\), \(N=10^5\) and \(N=10^6\) with sisyphe32 and \(\sim 25^2\), \(\sim 60^2\) and \(\sim 110^2\) for sisyphe64. Note that the number of cells will be automatically capped at \((L/R)^2\). For the last case (sisyphe64 with \(R=1\)), the maximal number of cells depends on the memory available.

As an example, the case sisyphe32 GPU BSR with \(N=10^5\) and \(R=1\) is obtained with the following script.

import time
import math
import torch
import sisyphe.models as models
from sisyphe.display import save

dtype = torch.cuda.FloatTensor

# Replace by the following line for double precision tensors.
# dtype = torch.cuda.DoubleTensor

N = 100000
L = 100.
dt = .01

nu = 5.
sigma = 1.

R = 1.
c = R

# The type of the initial condition will determine the floating point precision.

pos = L*torch.rand((N,2)).type(dtype)
vel = torch.randn(N,2).type(dtype)
vel = vel/torch.norm(vel,dim=1).reshape((N,1))

simu = models.Vicsek(
    pos = pos,
    vel = vel,
    v = R,
    sigma = sigma,
    nu = nu,
    interaction_radius = R,
    box_size = L,
    dt = dt,
    block_sparse_reduction = True,
    number_of_cells = 42**2)

simu.__next__() # GPU warmup

s = time.time()
data = save(simu, [10.], [], [])
e = time.time()

print(e-s)