1. administrate
1.1. scheduling
1.1.1. slurm
1.1.1.1. over K8s
1.1.2. kubernetes
1.1.3. your_scheduler
1.2. container support
1.2.1. docker
1.2.2. nvidia
1.2.3. podman
1.3. migration
1.3.1. how to migrate from old to new progressively
1.4. hw/security
1.4.1. backup fs
1.4.2. which node for service and compute?
1.4.2.1. clear for slurm
1.4.2.2. not clear for K8s
1.4.3. use of Juju
1.4.4. deploy os and connect to cluster
1.4.5. cope with node failure
1.4.6. correct drivers installed for GPU
1.4.7. how to cope with root inside a container?
1.4.8. burst in and out from/to the cloud
1.4.9. or DMZ?
1.4.10. is it important to have last GPU/CPU generation?
1.4.10.1. when do we care about
1.4.10.1.1. performance
1.4.10.1.2. ease of use
1.4.10.1.3. ease of moving from target to target
1.4.11. what impact of adding gated QC?
2. make it avaible to external user
2.1. as a whole environment on their premise only submitting. jobs to our resources via api
2.2. via a bus on the cloud
2.3. train
2.3.1. internal
2.3.2. external
3. experiment tracking
3.1. What to save?
3.1.1. when to archive
3.1.2. what/when to dump?
3.2. how to track? lock it? which checksum?
3.3. how to inherit from experiments
3.4. where to save
3.5. how to compare, list
3.6. how to share data with other
3.7. insure reproducibility
3.8. how to publish
4. sw stack
4.1. how to install, deploy, maintain, track both on premise and cloud and user onpremise
4.2. specific to user, domain?
4.3. specific to architecture
4.4. cloud compatible
4.5. all containerized?
4.6. dedicated tools
4.6.1. decimate
4.6.2. ktf
5. interface
5.1. web interface
5.2. gitlab interface
5.3. command line interface
5.3.1. compatible with slurm and k8s?
5.3.2. simple enough to be used from unix
5.3.3. windows client needed?