###Discussed:###

  • alternative approach to dhcp
    • have a script that takes dns server, slurm.conf file location, hostname and ip as an input and create a VM for the cluster
    • have static files that stores the mapping of hostname to IP and checked everytime during VM creation, to pick a free name and IP for the VM. This gets updated with a status/flag to specify if a name/IP is available or not
      • has to be single threaded to avoid concurrent updates and ambiguity
  • update the config across the cluster for cluster update
    • restart cotroller is fine as long as the jobs in the queue doesnt gets killed
    • use slurm’s restore config to update the config across the compute nodes
  • suspend jobs and bring them back
    • hold the state of the vm during job suspend
      • suspend vs snapshot - suspend holds the process states in the memory for a specifc VM. However, snapshot keeps the state of the VM but doesnt holds the process state. Every process begings as a new one when restored
      • suspend vs pause vs shelve
  • checkpointing - why it’s not prefferd to restore the jobs
    • it doesn’t work with all the applications/jobs running on the cluster
    • requires bookkeeping by end user.
    • requires user to design the application in such a way that it could be checkpointed. Adds an overhead
    • checkpointing doesn’t support gpu related jobs
  • target OSG jobs and setup as a basic use cases
    • add more features later for general use cases

###To-do:###

  • setup a dhcp for virtual subnets on Openstack. If doesn’t work as required then work on the alternative approach
  • find the limitations of OpenStack wrt to resource management
    • check if the resources like CPU, memory etc. are released when a VM is suspended so that those resources can be allocated to other processes
    • ask OpenStack people here at MOC for this
  • how to go about suspending the state of the VM including the process states
    • find out if there is something called live snapshot similar to live migration that stores the process states, in order to have it placed onto a disk and bring it back later