Skip to content

Enable user to take ownership of MPI#722

Open
otbrown wants to merge 21 commits intoQuEST-Kit:develfrom
otbrown:mpi-update
Open

Enable user to take ownership of MPI#722
otbrown wants to merge 21 commits intoQuEST-Kit:develfrom
otbrown:mpi-update

Conversation

@otbrown
Copy link
Copy Markdown
Collaborator

@otbrown otbrown commented Apr 14, 2026

Closes #712.

  • QuEST no longer uses MPI_COMM_WORLD everywhere. Instead it uses mpiCommQuest (which will very often just be a duplicate of MPI_COMM_WORLD). This aligns with best practise for library developers, as it helps avoid avoid clashes with user MPI calls.
  • Added a new custom environment constructor which lest the user positively declare that they want to take ownership of MPI. QuEST relinquishes responsibility, and will not attempt to initialise or finalise MPI itself. It will still throw if there are an incompatible number of MPI processes.
  • Added some minimal testing... We should probably look at environment testing in more detail at some point! In fairness, I think if the mpiCommQuest stuff was broken essentially any QuEST program would also be broken.
  • Swapped Irecv and Isend in the comms routine as is the first thing I was asked about when we encountered issues with slingshot-11. It doesn't make any difference.
  • New Subcommunicator interface lets the user create a custom QuESTEnv on communicator that is not MPI_COMM_WORLD. This interface is guarded behind compile-time flags as it requires bringing the MPI header into the interface, which means all consumers of libQuEST need to include MPI as well. This interface is not yet documented or tested. I think this is tolerable in the short term as it should only appeal to advanced users.

Speaking of tests! Test results on my laptop:

otbrown@oliver-win:~/quantum/quest/source/QuEST/build$ mpirun -n 4 ./tests/tests

QuEST execution environment:
  precision:       2
  multithreaded:   1
  distributed:     1
  GPU-accelerated: 0
  GPU-sharing ok:  0
  cuQuantum:       0
  num nodes:       4

Testing configuration:
  test all deployments:  0
  num qubits in qureg:   6
  max num qubit perms:   10
  max num superop targs: 4
  num mixed-deploy reps: 10

Tested Qureg deployments:
  CPU + OMP + MPI

Randomness seeded to: 2200603699
===============================================================================
All tests passed (293841 assertions in 275 test cases)

and min_example output:

Date:
2026-04-13T18:54:07+01:00

OMP_NUM_THREADS=36
SLURM_NTASKS_PER_NODE=8
SLURM_NNODES=1

QuEST execution environment:
  [precision]
    qreal.................double (8 bytes)
    qcomp.................std::complex<double> (16 bytes)
    qindex................long long int (8 bytes)
    validationEpsilon.....1e-12
  [compilation]
    isMpiCompiled....................1
    isMpiSubCommunicatorCompiled.....0
    isGpuCompiled....................0
    isOmpCompiled....................1
    isCuQuantumCompiled..............0
  [deployment]
    isMpiEnabled............1
    doesUserOwnMpi..........0
    isGpuEnabled............0
    isOmpEnabled............1
    isCuQuantumEnabled......0
    isGpuSharingEnabled.....0
  [cpu]
    numCpuCores.......576 per machine
    numOmpProcs.......36 per machine
    numOmpThrds.......36 per node
    cpuMemory.........754.7 GiB per machine
    cpuMemoryFree.....unknown
  [gpu]
    numGpus...........N/A
    gpuDirect.........N/A
    gpuMemPools.......N/A
    gpuMemory.........N/A
    gpuMemoryFree.....N/A
    gpuCache..........N/A
  [distribution]
    isMpiGpuAware.....0
    numMpiNodes.......8
  [statevector limits]
    minQubitsForMpi.............3
    maxQubitsForCpu.............35
    maxQubitsForGpu.............N/A
    maxQubitsForMpiCpu..........37
    maxQubitsForMpiGpu..........N/A
    maxQubitsForMemOverflow.....58
    maxQubitsForIndOverflow.....63
  [density matrix limits]
    minQubitsForMpi.............3
    maxQubitsForCpu.............17
    maxQubitsForGpu.............N/A
    maxQubitsForMpiCpu..........20
    maxQubitsForMpiGpu..........N/A
    maxQubitsForMemOverflow.....28
    maxQubitsForIndOverflow.....31
  [statevector autodeployment]
    8 qubits......[omp]
    29 qubits.....[omp] [mpi]
  [density matrix autodeployment]
    4 qubits......[omp]
    15 qubits.....[omp] [mpi]

Qureg:
  [deployment]
    isMpiEnabled.....1
    isGpuEnabled.....0
    isOmpEnabled.....1
  [dimension]
    isDensMatr.....0
    numQubits......20
    numCols........N/A
    numAmps........2^20 = 1048576
  [distribution]
    numNodes.....2^3 = 8
    numCols......N/A
    numAmps......2^17 = 131072 per node
  [memory]
    cpuAmps...........2 MiB per node
    gpuAmps...........N/A
    cpuCommBuffer.....2 MiB per node
    gpuCommBuffer.....N/A
    globalTotal.......32 MiB

Qureg (20 qubit statevector, 1048576 qcomps over 8 nodes, 4 MiB per node):
    -0.00062991+0.00025551i   |0⟩        [node 0]
    (8.3557e-5)-0.00063022i   |1⟩
    -0.00053343-0.00011416i   |2⟩
    0.00028553-0.00018421i    |3⟩
    -0.00094126-0.00097974i   |4⟩
    0.00016001+0.00042058i    |5⟩
    0.00073393-0.002013i      |6⟩
    0.00061578+0.00046316i    |7⟩
    0.00048523+0.00068937i    |8⟩
    -0.00081878-0.00060096i   |9⟩
    0.0012373+0.00049652i     |10⟩
    0.0001958+0.00014134i     |11⟩
    0.0013159-0.00040911i     |12⟩
    -0.00061634+0.00094838i   |13⟩
    0.00019392-0.00049259i    |14⟩
    -0.00031263+0.00018272i   |15⟩
                ⋮
    -0.00053361+0.00010033i   |1048560⟩  [node 7]
    0.0017006-0.00053801i     |1048561⟩
    0.0010772-(1.5397e-5)i    |1048562⟩
    -0.00085736-0.00062965i   |1048563⟩
    0.00083238-0.00053071i    |1048564⟩
    -0.00025421+0.00064579i   |1048565⟩
    -0.00069857-0.00043407i   |1048566⟩
    (-9.9973e-5)-0.00094092i  |1048567⟩
    0.00092175-0.00068456i    |1048568⟩
    -0.00026758-0.00027345i   |1048569⟩
    0.00034301+0.00011848i    |1048570⟩
    0.00026923-0.00043647i    |1048571⟩
    -0.00088784-0.00011044i   |1048572⟩
    -0.00037642-0.00087563i   |1048573⟩
    0.00031388+0.00031749i    |1048574⟩
    0.00014548+0.00030294i    |1048575⟩

Total probability: 1

JobID           JobName    Elapsed     MaxRSS
------------ ---------- ---------- ----------
161633       QuEST_min+   00:00:04
161633.batch      batch   00:00:04
161633.exte+     extern   00:00:04
161633.0     min_examp+   00:00:01

There is a caveat here: distributed tests on Cirrus keep failing, but this appears to be true of devel as well as this branch, so I'll do some more digging and raise it as a separate issue once I have a better of idea of what is going on.

otbrown and others added 21 commits February 27, 2026 15:53
…for SubComm, because of course it enforces CXX
…ything but it makes MPI library implementers less nervous
@otbrown
Copy link
Copy Markdown
Collaborator Author

otbrown commented Apr 16, 2026

On the plus side, a distributed QFT benchmark circuit runs perfectly correctly:

QFT Benchmark on 34 Qubits
Date:
2026-04-14T15:24:08+01:00

OMP_NUM_THREADS=36
SLURM_NTASKS_PER_NODE=8
SLURM_NNODES=1

QuEST QFT Benchmark

QuEST execution environment:
  [precision]
    qreal.................double (8 bytes)
    qcomp.................std::complex<double> (16 bytes)
    qindex................long long int (8 bytes)
    validationEpsilon.....1e-12
  [compilation]
    isMpiCompiled...........1
    isGpuCompiled...........0
    isOmpCompiled...........1
    isCuQuantumCompiled.....0
  [deployment]
    isMpiEnabled............1
    isGpuEnabled............0
    isOmpEnabled............1
    isCuQuantumEnabled......0
    isGpuSharingEnabled.....0
  [cpu]
    numCpuCores.......576 per machine
    numOmpProcs.......36 per machine
    numOmpThrds.......36 per node
    cpuMemory.........1.475 TiB per machine
    cpuMemoryFree.....unknown
  [gpu]
    numGpus...........N/A
    gpuDirect.........N/A
    gpuMemPools.......N/A
    gpuMemory.........N/A
    gpuMemoryFree.....N/A
    gpuCache..........N/A
  [distribution]
    isMpiGpuAware.....0
    numMpiNodes.......8
  [statevector limits]
    minQubitsForMpi.............3
    maxQubitsForCpu.............36
    maxQubitsForGpu.............N/A
    maxQubitsForMpiCpu..........38
    maxQubitsForMpiGpu..........N/A
    maxQubitsForMemOverflow.....58
    maxQubitsForIndOverflow.....63
  [density matrix limits]
    minQubitsForMpi.............3
    maxQubitsForCpu.............18
    maxQubitsForGpu.............N/A
    maxQubitsForMpiCpu..........20
    maxQubitsForMpiGpu..........N/A
    maxQubitsForMemOverflow.....28
    maxQubitsForIndOverflow.....31
  [statevector autodeployment]
    8 qubits......[omp]
    29 qubits.....[omp] [mpi]
  [density matrix autodeployment]
    4 qubits......[omp]
    15 qubits.....[omp] [mpi]


Performing QFT on 34 qubits.
Time to run QFT Circuit: 420.544s.

Correctness checks:
Statevector norm: 1
First amplitude: (7.62939e-06, 0)
JobID           JobName    Elapsed     MaxRSS
------------ ---------- ---------- ----------
162993       QFT_bench+   00:07:08
162993.batch      batch   00:07:08
162993.exte+     extern   00:07:08
162993.0            qft   00:07:05

@otbrown otbrown requested a review from JPRichings April 16, 2026 11:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant