https://github.com/roscisz/TensorHive
TensorHive is an open source system for managing and monitoring your computing resources across multiple hosts. It solves the most common problems and nightmares about accessing and sharing your AI-focused infrastructure across multiple, often competing users.
It's designed with flexibility, lightness and configuration-friendliness in mind.
nvidia-smi
is required)pip3 install tensorhive
git clone https://github.com/roscisz/TensorHive.git
cd TensorHive
pip install .
If you want to also build the web app manually:
(cd tensorhive/app/web/dev && npm install && npm run build)
At first, you must tell TensorHive how it can establish SSH connections to hosts you want to work with.
You can do this by editing ~/.config/TensorHive/hosts_config.ini
(see example)
tensorhive
Sample output:
The Web application and API Documentation can be accessed through te given URLs.
If you need the Web application to be accessible from remote machines, set the host
and port
fields in the
[web_app.server]
section in ~/.config/TensorHive/main_config.ini
. The host field should be set to a hostname
or IP that resolves to an external network interface.
The available infrastructure can be monitored in the Nodes overview tab. Sample screenshot:
The "Add watch" button allows to add a new chart which can be configured to show chosen metrics of the selected devices. Currently, the metrics include GPU metrics from nvidia-smi and a process overview with corresponding usernames.
The computing resource reservations can be viewed and managed in the Reservations overview tab. Sample screenshot:
The select boxes at the bottom of the page (easily accessible by the Adjust Filters button) allow to specify which nodes or devices should be visible in the view. Adding reservations is possible through selecting a time interval and filling the reservation details in a form. Cancelling reservations is possible for the reservation owner and admin user by clicking on a given reservation and confirming the cancellation.
You can fully customize TensorHive behaviour from ~/.config/TensorHive/main_config.ini
(see example)
Along with TensorHive, we are developing a set of sample deep neural network training applications in Distributed TensorFlow which will be used as test applications for the system. They can also serve as benchmarks for various GPU, distributed multiGPU and distributed multinode architectures. For each example, a full set of instructions to reproduce is given.
We'd :heart: to collect your observations, issues and pull requests.
You can do this by making use of our issue template.
Project created and maintained by: