Usage guide
Once you have the master board initialized and one or more slave nodes configured, you can begin using the cluster. This includes functions such as running scripts in parallel, viewing node metrics and network status, or managing the system remotely from the master.
Cluster execution
The system has the Slurm tool, which provides a queuing system for both users and processes to manage execution in the cluster. Below we explain some of the functionalities that this tool offers:
- You can check the status of the cluster using the 'sinfo' command. This command shows you the number of nodes you have and their status. If the status is ‘idle', it means that the node is operational. If it is ‘down’, it means that something has failed and you cannot use that node. You should check it to see if it is a configuration problem or if it has hung.
- To run scripts, use the 'srun' command. This allows you to add options, such as specifying the number of processes to be created, the number of nodes to use, redirect the result, etc.
- When you launch a script to be executed, you will be able to check its status, the execution time it takes, its identifier (PID) and who launched it using the squeue command.
- If you want to cancel the execution of a process, use the 'scancel' command followed by the identifier (PID) of the process to cancel.
For more information on the Slurm tool and how to use it, you can consult its documentation on its website: https://slurm.schedmd.com/documentation.html.
System monitoring
The system has a monitoring tool that is installed and configured during system initialization so that you can thoroughly monitor the cluster. This tool is Ganglia, and to access the web page that shows the metrics of your cluster you will have to access a browser from the master board.
- The first step you must do is connect to the master board. If you do it through SSH, use the '-Y' option in your command, to allow the graphical interface to pass through and you can view the browser. An example of a connection would be 'ssh -Y odroid@192.168.X.X', specifying the IP address of your master board.
- Once you are on the master board, open a web browser. The system has the Midori browser installed.
- Once you have opened it, put the following address in the browser bar: http://localhost/ganglia.
- It will be shown the main Ganglia page. In the central-left part a drop-down is shown, and when you click on it, all the nodes of your cluster will appear. To visualize the data of a node, you only have to click on its name and the page will show all the graphs that it has generated for that node.
If you want to know more about the possibilities offered by this tool, you can take a look at its website: http://ganglia.sourceforge.net/.
Remote administration
The system has a couple of scripts for cluster management and maintenance. These scripts allow the execution of commands on all cluster nodes remotely from the master board. In this way, it is possible to consult the status of services or to execute update or restart commands, for example, in a fast and centralized way, without having to connect one by one to all the nodes of the cluster.
For this task, there are two scripts, called 'global_execute_seq' and 'global_execute_par', which execute the command or sequence of commands with pipes that is indicated by parameter on all nodes of the cluster.
Some issues to consider:
- The list of cluster nodes for the execution of these scripts is contained in the '/opt/scripts/odroid.par' file. This file is automatically modified after each slave node initialization to include it. If you want to exclude one or more nodes from the execution of these remote administration scripts, you must comment the line corresponding to the node in question in the '/opt/scripts/odroid.par' file, that is, you must include a '#' at the start of the corresponding line.
- The execution of these scripts, since they are found under the 'bin/' directory, can be done from any directory. An example of using these would be the following: 'global_execute_seq hostname', which will return the name of each of the assigned slave nodes.
- The difference between the two scripts is that the one with the suffix 'seq<' executes the command sequentially, that is, it waits for the execution of the command in the first slave node to finish to launch the execution of the next slave. On the other hand, the one with the suffix 'par' launches all the executions without waiting for any of them to finish. The first, the sequential one, should be used for tasks that require packet passing, such as 'apt update', since as the master node acts as a firewall for the internal network, executing this command in parallel entails overloading the master. . For commands that do not involve passing packages through the master, the parallel option of the script can be used.