This set-by-step instruction on this page lists the requirements to operate our reference implementation of the concept described by Weise et al. 2022, and does a assisted cloud setup in a guide.
The deployment of our secure data infrastructure requires from the Data Provider or Carrier some hardware and software requirements.
The set-up takes about 30-60 minutes depending on if you have all the necessary information available at hand. We outline a required information list in the box below for preparation in case you are planning to deploy OSSDIP and look for compatibility indicators.
Checklist to operate the secure data infrastructure
- At least two persons operating the core infrastructure components (database administrator, system administrator).
- At least two hosts for test deployments (Data Node, Node containing the rest), at least three hosts for production deployments (Data Node, Monitoring Node, Node containing the rest), preferably nine (one for each component).
- At least two fixed IPv4 addresses (OpenStack terminology: two floating IPs) that are reachable from the open Internet: one for the VPN Node and one for the Identity Node (even though we use a FQDN, the FQDN is not required).
- Sufficient hardware for operating the infrastructure according to your use-case. In our reference implementation running on a single host, we currently occupy 24 vCPUs, 48 GB RAM and 180 GB disk space to operate the core components and one Data Owner-VM, one Remote Desktop-VM and Analyst-VM.
- About 30-60 minutes of time where the system administrator configures the set-up and provides the database administrator receives the credentials to access their nodes.
Clone the source code repository to your local work laptop/computer/virtual environment:
git clone ssh://email@example.com:822/martin.weise/ossdip.git
We refer to the system administrator's notebook/computer/virtual machine as to the control node. It is important to understand the concept of the control node who sits outside the infrastructure and manages the infrastructure.
The following Ansible collections must be installed on the control node:
ansible-galaxy collection install ansible.posix community.crypto openstack.cloud
Also, the following tools must be installed on the control node, we give the package manager commands below:
Install the packages with
apt install git ansible python3 pwgen python3-openstacksdk
Install the packages with
dnf install git ansible python3 pwgen openstacksdk
Security-Enhanced Linux on RHEL/Fedora/CentOS/Rocky
Currently, Ansible has problems with enabled SE Linux for templates on the control node, you need to disable it. First find out if it is enabled:
If it prints anything other than disabled, temporarily disable it through setting it to permissive:
Permanently disable it by modifying the configuration file
/etc/selinux/config and reboot your control node:
Install the packages with homebrew:
brew install git ansible python pwgen
Openstack SDK is not available in brew, install it through the Python package installer:
pip3 install -g openstacksdk
In this guide, we use the OpenStack Xena cloud operating system. Please check with your organization if there is already a cluster in staging or production that you can use. It is a mandatory requirement, check the official how to get started guide.
In order to connect with your cluster, you need to fill the
cp clouds.yaml.example clouds.yaml
Then, fill it with
of your cluster. You find your credentials in the
openrc.sh file or in the OpenStack web interface
Identity > Projects and Identity > Users. Usually only the following variables need to be adapted It is required
by the Ansible playbooks that manage the cloud, i.e. set-up, creation of Analyst-VMs, creation of Owner-VMs, etc.:
clouds: ossdip: auth: auth_url: https://remote-host.org:5000 project_id: 12345 username: foo password: bar user_domain_name: Default # in many cases this does not need to be changed project_domain_name: Default # in many cases this does not need to be changed
The infrastructure hardware requirements highly depend on the envisioned deployment use-case. We configure the infrastructure in the guides to enable researchers to work with the data and do standard data science tasks such as executing R scripts and perform KNIME data analytics.
The identity node is responsible for identity management and authentication of nodes in the infrastructure through sssd and Kerberos. We follow the official guide for installing Identity Management and set-up the identity server without integrated DNS with an integrated CA as the root CA.
... idm_domain: id.yourdomain.com ... idm_realm: ID.YOURDOMAIN.COM ...
No FQDN available
It is possible to set-up the infrastructure without registering a public FQDN for your deployment
of OSSDIP. We modify the
/etc/hosts file of all nodes by default to reach the (self-claimed) FQDN.
During the set-up, the system- and database administrator account are created at the identity node, you will need to change the default information or omit this step for test-deployments, you should at least change the two passwords for the directory manager and default admin account:
... idm_dm_passwd: ossdipat ... idm_adm_passwd: ossdipat ...
We use the default admin account only for enrolling the nodes to the identity node during creation of a node, it is
automatically taken from
vars/secure.yml. You can always change it in the identity manager after the set-up
When the configuration is finished, encrypt the
var/secure_plain.yml file and delete the
as it contains sensitive data:
ansible-vault encrypt vars/secure_plain.yml --output vars/secure.yml
Physical security that restricts access to the server hardware is the first control to protect the sensitive data in the infrastructure. We recommend to place the hardware needed into a dedicated locked server rack where only a designated and certified operator can open the lock. Following the four eye principle, a key card for the server room is needed that is held solely by another operator.
We use only encrypted volume types in the reference implementation. After the OpenStack configuration the key manager. The name for encrypted volume type depends on your organization's naming convention, you need to specify it:
The VPN Node needs to be reachable from the open Internet through a IPv4 address as it provides a way for all roles to connect to the infrastructure. This IP address needs to be available as floating IP for OpenStack. Manually reserve a floating IP from a pool in the OpenStack web interface and specify it:
The reference implementation allows TCP port 22 (SSH) to the VPN Node only from within our organization's network, you will need to modify it to reflect your organization's IP range:
ssh_range: 18.104.22.168/15 # range
Or a single authenticated machine with fixed IP
ssh_range: 22.214.171.124/0 # range
The reference implementation is configured to use pre-defined flavors that vary from deployment to deployment. It uses four flavors, depending on the use-case of the nodes:
||4||8||Analyst-VM running the analysis tools|
||4||8||Remote Desktop-VMs running the VNC server|
||4||8||Owner-VMs running some data science tools|
You will need to change the flavors for the nodes to match your organization's flavor names:
core_flavor: m1a.medium analyst_flavor: m1a.xlarge desktop_flavor: m1a.xlarge owner_flavor: m1a.xlarge
After the configuration from above is finished, the system administrator can start the cloud setup playbook on the control node. The database administrator should also supervise the set-up. It configures the OpenStack firewall, creates encrypted volumes, downloads the images, assigns floating IPs and installs the six core nodes:
Further it configures two default users:
ansible-playbook setup_cloud.yml -e @vars/secure.yml --ask-vault-pass
When the setup of the core nodes completes, immediately change the passwords for both system- and database administrator on the identity node via the web interface or through self-service.
Change administrator passwords¶
This step can be executed with the least effort through the graphical interface accessible via the domain you specified
No domain needed
The infrastructure does not rely on a real domain name. It is only needed when the Identity Node should be exposed to other users via SSL/TLS encryption on-top of the OpenVPN encryption or as a standalone-endpoint.
For a test-deployment it is sufficient to add the ip/domain tuple to your hosts file on the Control Node as the website redirects automatically and the connection is secured through OpenVPN encryption:
... 126.96.36.199 id.ossdip.at # domain from idm_domain ```
If you haven't updated them in
vars/secure_plain.yml, they are:
- sysadmin:ossdipat, and
Open the web browser on the control node and authenticate with the default credentials.
Connect to the Identity Node through OpenVPN or through other means.
ipa user-mod sysadmin --password # system administrator ipa user-mod dbadmin --password # let database administrator type the new password
Afterwards, download the OpenVPN profiles from the VPN Node located at, and distribute the dbadmin.ovpn file securely to the Database Administrator without saving it to the System Administrator Control Node:
The setup of the infrastructure is now finished. The system administrator now can:
- Create a user account for an owner/analyst
- Create an Owner-VM to ingest data
- Create an Analyst-VM to offer analysts to work with the data
In case you have problems getting started, please contact the developer. You see the contact details at the Contact page.