calypso.dispatchers.orchestrator.scheduler.slurm module

class calypso.dispatchers.orchestrator.scheduler.slurm.Slurm(name: str, executor: BaseExecutor, **kwargs)

Bases: BaseScheduler

generate_job_script(job: Job)

The outer interface to generate script for Slurm according to the job.

Parameters:

job (Job) – A dict containing job information

Examples

>>> job = Job(
        name=name,
        job=job,
        job_id=None,
        state=JobStatus.unsubmitted,
        remote_root=machine.remoteroot,
        local_root=machine.localroot,
        machine_idx=machine.machine_idx,
    }
Returns:

The shell script of this job submitting to Slurm.

Return type:

str

generate_task_env()

Env variables should be considered in this func for script.

Returns:

env varibles command line.

Return type:

str

generate_task_head(name: str) str

The interpreter for the Slurm submission script should always be set to #!/bin/bash.

And the script must contain certain SBATCH directives for resource allocation, such as:

#SBATCH --nodes=2  # This requests two nodes to be allocated for the job.
Returns:

interperter and SBATCH directives of resource allocation.

Return type:

str

generate_task_script(jobs: list[Job], submit_script_name: str)
kill(job: Job)

Kill a job according to the given job id.

Parameters:

job (int) – job id to be killed.

Returns:

tuple contain output, error and return code.

Return type:

tuple

Raises:
  • RuntimeError – error occurs when return code is not zero.

  • RuntimeError – error occurs when there exists “error” in error.

query(job: Job)

Check the job state. First is to check if the job is finished according to a file tag, if not check the status according to the job id.

Parameters:

job (dict) – A dict containing job infomation.

Returns:

JobStatus – return job status

Return type:

JobStatus

Raises:

RuntimeError – error occurs when return code is not zero or “error” in error.

submit(jobs: list[Job], timeout=3600)

Submit job and return the job id, raise error if return code is not zero or error in stderr/log.

Parameters:

job_dict (dict) – A dict containing job information

Returns:

job_id – Parent job id.

Return type:

int

Raises:
  • RuntimeError – Error occurs if return code is not zero.

  • RuntimeError – Error occurs if there exist “error” or “command not found” or “No such file or directory” in stderr/log.