Ticket #11126 (closed: fixed)
Reorganize remote algorithms so that they can use different job managers
Reported by: | Federico M Pouzols | Owned by: | Federico M Pouzols |
---|---|---|---|
Priority: | major | Milestone: | Release 3.4 |
Component: | Framework | Keywords: | |
Cc: | Blocked By: | #11064, #11122, #11123, #11124, #11392 | |
Blocking: | #9277, #11361, #11373, #11538 | Tester: | Martyn Gigg |
Description (last modified by Federico M Pouzols) (diff)
The idea is that the remote algorithms should be able to use different web service APIs or underlying mechanisms (ssh, etc.) to control remote jobs on compute resources. Examples: LSF through the IBM PAC (Platform Application Center), or SLURM.
Using the design/diagram on slide 19 of https://github.com/mantidproject/documents/blob/master/Presentations/SOS18/Mantid%20HPC%20Challenges.pptx:
If we:
We could rearrange the code that is specific to the Mantid web service API (http://www.mantidproject.org/Remote_Job_Submission_API) to a class that extends IRemoteJobManager. This class could be called MantidWSAPIJobManager for example. This specific code includes code to submit HTTP requests currently living in RemoteJobManager, and code to process parameters and response codes and messages, currently living in the individual remote algorithms.
This way it would be possible to have support for other web services such as those provided by the SLURM and LSF cluster schedulers / resource managers.
We still need to clarify a few points, but in principle this would require moving code that currently lives in the class RemoteJobManager (HTTP requests), and the remote algorithms (all current implementations, including Authenticate, AbortRemoteJob, StartRemoteTransaction, SubmitRemoteJob, etc.).
Once this is done, remote algorithms (Authenticate, SubmitRemoteJob, etc.) will just rely on methods from IRemoteJobManager. Remote algorithms will need testing at SNS (where they are currently being used in different interfaces with the Fermi cluster). These are the scripts included in the Mantid distribution that currently use remote algorithms:
- https://github.com/mantidproject/mantid/blob/master/Code/Mantid/scripts/Interface/reduction_gui/reduction/scripter.py
- https://github.com/mantidproject/mantid/blob/master/Code/Mantid/scripts/Interface/reduction_gui/widgets/cluster_status.py
These scripts are imported/used in the following interfaces:
- Diffraction -> Powder Diffraction Reduction
- Direct -> DGS Reduction
- SANS -> ORNL SANS
- Reflectometry -> REFL reduction
- Reflectometry -> REFM reduction
When all this works, it would then be possible to add a new specific RemoteJobManager for tomography jobs (and/or other types) on SCARF: SCARF_LSFRemoteJobManager or similar.
Note: when all this is done, make sure that the algorithms do not use FacilityInfo::getRemoteJobManager() which would be then removed (#11373).
Change History
comment:2 Changed 6 years ago by Federico M Pouzols
- Blocking 9277 added
(In #9277) Added a few blocking tickets (needed to make the RemoteJobManager generic to different web services and underlying job control mechanisms).
comment:10 Changed 6 years ago by Federico M Pouzols
- Summary changed from Reorganize remote algorithms so that they can use different web service APIs to Reorganize remote algorithms so that they can use different job managers
comment:11 Changed 6 years ago by Federico Montesino Pouzols
- Status changed from assigned to inprogress
new v2 of remote algorithms, re #11126
Changeset: b342af5253445e146626dcd66288f90c1197abdf
comment:12 Changed 6 years ago by Federico Montesino Pouzols
updated v1 tests and added v2 tests, re #11126
Changeset: 6b80a62057f2345c9be5e054b53d264a1404c0d1
comment:13 Changed 6 years ago by Federico Montesino Pouzols
use remote algorithms v1 in reduction gui(s), re #11126
Changeset: 1cdec780e7a634dac78b99d6bd4e09dd5e20cfac
comment:14 Changed 6 years ago by Federico Montesino Pouzols
add rst docs for the v2 remote algorithms, re #11126
Changeset: e237fe28d1d008a8e7f1a91a1b9dcd814e4586f6
comment:15 Changed 6 years ago by Federico Montesino Pouzols
- Status changed from inprogress to verify
- Resolution set to fixed
This is being verified as pull request #525.
comment:16 Changed 5 years ago by Federico Montesino Pouzols
Add note on v1 differences in v2 algorithms rst doc, re #11126
Changeset: 13cbaa18fda653290b06cba681d6c1ec9bc62f9e
comment:18 Changed 5 years ago by Martyn Gigg
- Status changed from verify to verifying
- Tester set to Martyn Gigg
comment:19 Changed 5 years ago by Federico Montesino Pouzols
use v1 also for StartRemoteTransaction, re #11126
Changeset: 7c66bb2256c5f1aa94f0f5adad8cc62dab789480
comment:20 Changed 5 years ago by Martyn Gigg
This all seems okay to me now as the new versions shouldn't affect any existing code.
comment:21 Changed 5 years ago by Martyn Gigg
- Status changed from verifying to closed
Merge pull request #525 from mantidproject/11126_remote_algorithms_use_different_job_managers
Modify remote algorithms to support different job managers (for Fermi, SCARF, etc.)
Full changeset: 72029d8edbd3fbf9aa6fcbe81892d81bb96fdaec
comment:22 Changed 5 years ago by Stuart Campbell
This ticket has been transferred to github issue 11965