Discussed
- Use Slurm 16:
- The cluster on Engage1 is using slurm 16. So the new node will be consistent with the existing cluster.
- Get a block of IPs from Engage1:
- these IP will be used by the virtual nodes as a part of the cluster.
- Slurm.conf will have the nodes with the IPs from the block configured to federate the virtual nodes in the cluster.
- Setup a tunnel using sshuttle:
- Create a gateway with the public IP which can be reached from the controller and create tunnels using sshuttle to bridge the networks
- Configure the virtual machine on Kaizen as compute node
- Provision a VM and configure it with basic features using Slurm 16 so that it could be federated to the cluster.
- Already have a new slurm controller configured on Engage1 on a VM
- Not being much used by others
- Can use this to add this new node
- It’s on a VM. So can be replicated as well
- Open points:
- Is there a dns server? If not how will the nodes be reached out dynamically in future
- Specifc block IPs - how will this be mapped and kept consistent with the nodes and in slurm.conf
- Ask slurm controller not to poll the nodes - polling the suspended nodes leads to down state of the node and the jobs are cleared off the queue
- Salt master location - same as that of the cluster or use a local master with config syned from the main master or repo
- Check cvmfs:
- It’s a specific file system that contains packages used by OSG jobs
- Requires specifc config to reach out and mount the FS on the node
- Check the link https://twiki.grid.iu.edu/bin/view/Documentation/Release3/InstallCvmfs to install the client on the node
To-do
- Document the configs done so far:
- Document the sshuttle config to create the gateway.
- sshfs setup
- Upgrade the slurm node with slurm 16 and check the working