Discussed

  • Use Slurm 16:
    • The cluster on Engage1 is using slurm 16. So the new node will be consistent with the existing cluster.
  • Get a block of IPs from Engage1:
    • these IP will be used by the virtual nodes as a part of the cluster.
    • Slurm.conf will have the nodes with the IPs from the block configured to federate the virtual nodes in the cluster.
  • Setup a tunnel using sshuttle:
    • Create a gateway with the public IP which can be reached from the controller and create tunnels using sshuttle to bridge the networks
  • Configure the virtual machine on Kaizen as compute node
    • Provision a VM and configure it with basic features using Slurm 16 so that it could be federated to the cluster.
  • Already have a new slurm controller configured on Engage1 on a VM
    • Not being much used by others
    • Can use this to add this new node
    • It’s on a VM. So can be replicated as well
  • Open points:
    • Is there a dns server? If not how will the nodes be reached out dynamically in future
    • Specifc block IPs - how will this be mapped and kept consistent with the nodes and in slurm.conf
    • Ask slurm controller not to poll the nodes - polling the suspended nodes leads to down state of the node and the jobs are cleared off the queue
    • Salt master location - same as that of the cluster or use a local master with config syned from the main master or repo
  • Check cvmfs:
    • It’s a specific file system that contains packages used by OSG jobs
    • Requires specifc config to reach out and mount the FS on the node
    • Check the link https://twiki.grid.iu.edu/bin/view/Documentation/Release3/InstallCvmfs to install the client on the node

To-do

  • Document the configs done so far:
    • Document the sshuttle config to create the gateway.
    • sshfs setup
  • Upgrade the slurm node with slurm 16 and check the working