Skip to content

Getting Started

Abstract

This set-by-step instruction on this page lists the requirements to operate our reference implementation of the concept described by Weise et al. 2022, and does a assisted cloud setup in a guide.

Requirements

The deployment of our secure data infrastructure requires from the Data Provider or Carrier some hardware and software requirements.

Duration

The set-up takes about 30-60 minutes depending on if you have all the necessary information available at hand. We outline a required information list in the box below for preparation in case you are planning to deploy OSSDIP and look for compatibility indicators.

Checklist to operate the secure data infrastructure

  1. At least two persons operating the core infrastructure components (database administrator, system administrator).
  2. At least two hosts for test deployments (Data Node, Node containing the rest), at least three hosts for production deployments (Data Node, Monitoring Node, Node containing the rest), preferably nine (one for each component).
  3. At least two fixed IPv4 addresses (OpenStack terminology: two floating IPs) that are reachable from the open Internet: one for the VPN Node and one for the Identity Node (even though we use a FQDN, the FQDN is not required).
  4. Sufficient hardware for operating the infrastructure according to your use-case. In our reference implementation running on a single host, we currently occupy 24 vCPUs, 48 GB RAM and 180 GB disk space to operate the core components and one Data Owner-VM, one Remote Desktop-VM and Analyst-VM.
  5. About 30-60 minutes of time where the system administrator configures the set-up and provides the database administrator receives the credentials to access their nodes.

Clone the source code repository to your local work laptop/computer/virtual environment:

git clone ssh://git@gitlab.tuwien.ac.at:822/martin.weise/ossdip.git

Control Node

We refer to the system administrator's notebook/computer/virtual machine as to the control node. It is important to understand the concept of the control node who sits outside the infrastructure and manages the infrastructure.

Control Node

System Administrator managing the infrastructure through the Control Node, using the Gate Node as jump host

The following Ansible collections must be installed on the control node:

ansible-galaxy collection install ansible.posix community.crypto openstack.cloud

Also, the following tools must be installed on the control node, we give the package manager commands below:

Install the packages with apt:

apt install git ansible python3 pwgen python3-openstacksdk

Install the packages with dnf or yum:

dnf install git ansible python3 pwgen openstacksdk

Security-Enhanced Linux on RHEL/Fedora/CentOS/Rocky

Currently, Ansible has problems with enabled SE Linux for templates on the control node, you need to disable it. First find out if it is enabled:

sestatus

If it prints anything other than disabled, temporarily disable it through setting it to permissive:

setenforce 0

Permanently disable it by modifying the configuration file /etc/selinux/config and reboot your control node:

SELINUX=permissive
SELINUXTYPE=targeted

Install the packages with homebrew:

brew install git ansible python pwgen

Openstack SDK is not available in brew, install it through the Python package installer:

pip3 install -g openstacksdk

Cluster Requirements

In this guide, we use the OpenStack Xena cloud operating system. Please check with your organization if there is already a cluster in staging or production that you can use. It is a mandatory requirement, check the official how to get started guide.

In order to connect with your cluster, you need to fill the clouds.yaml.example file:

cp clouds.yaml.example clouds.yaml

Then, fill it with the credentials of your cluster. You find your credentials in the openrc.sh file or in the OpenStack web interface Identity > Projects and Identity > Users. Usually only the following variables need to be adapted It is required by the Ansible playbooks that manage the cloud, i.e. set-up, creation of Analyst-VMs, creation of Owner-VMs, etc.:

clouds.yaml:

clouds:
  ossdip:
    auth:
      auth_url: https://remote-host.org:5000
      project_id: 12345
      username: foo
      password: bar
      user_domain_name: Default    # in many cases this does not need to be changed
      project_domain_name: Default # in many cases this does not need to be changed

Hardware Requirements

The infrastructure hardware requirements highly depend on the envisioned deployment use-case. We configure the infrastructure in the guides to enable researchers to work with the data and do standard data science tasks such as executing R scripts and perform KNIME data analytics.

Identity Node

The identity node is responsible for identity management and authentication of nodes in the infrastructure through sssd and Kerberos. We follow the official guide for installing Identity Management and set-up the identity server without integrated DNS with an integrated CA as the root CA.

vars/auth.yml:

...
idm_domain: id.yourdomain.com
...
idm_realm: ID.YOURDOMAIN.COM
...

No FQDN available

It is possible to set-up the infrastructure without registering a public FQDN for your deployment of OSSDIP. We modify the /etc/hosts file of all nodes by default to reach the (self-claimed) FQDN.

During the set-up, the system- and database administrator account are created at the identity node, you will need to change the default information or omit this step for test-deployments, you should at least change the two passwords for the directory manager and default admin account:

vars/secure_plain.yml:

...
idm_dm_passwd: ossdipat
...
idm_adm_passwd: ossdipat
...

Password management

We use the default admin account only for enrolling the nodes to the identity node during creation of a node, it is automatically taken from vars/secure.yml. You can always change it in the identity manager after the set-up completes.

When the configuration is finished, encrypt the var/secure_plain.yml file and delete the vars/secure_plain.yml file as it contains sensitive data:

ansible-vault encrypt vars/secure_plain.yml --output vars/secure.yml

Server Room

Physical security that restricts access to the server hardware is the first control to protect the sensitive data in the infrastructure. We recommend to place the hardware needed into a dedicated locked server rack where only a designated and certified operator can open the lock. Following the four eye principle, a key card for the server room is needed that is held solely by another operator.

Storage

We use only encrypted volume types in the reference implementation. After the OpenStack configuration the key manager. The name for encrypted volume type depends on your organization's naming convention, you need to specify it:

vars/vms.yml: (example)

vol_enc_type: __DEFAULT__

Network

The VPN Node needs to be reachable from the open Internet through a IPv4 address as it provides a way for all roles to connect to the infrastructure. This IP address needs to be available as floating IP for OpenStack. Manually reserve a floating IP from a pool in the OpenStack web interface and specify it:

vars/vms.yml: (example)

vpn_ip: 128.130.202.124

The reference implementation allows TCP port 22 (SSH) to the VPN Node only from within our organization's network, you will need to modify it to reflect your organization's IP range:

vars/vms.yml: (example)

ssh_range: 128.130.0.0/15     # range

Or a single authenticated machine with fixed IP

vars/vms.yml: (example)

ssh_range: 128.130.202.30/0   # range

Computing

The reference implementation is configured to use pre-defined flavors that vary from deployment to deployment. It uses four flavors, depending on the use-case of the nodes:

Flavor Name Variable vCPU RAM Description
Core core_flavor 2 4 General purpose
Analyst analyst_flavor 4 8 Analyst-VM running the analysis tools
Desktop desktop_flavor 4 8 Remote Desktop-VMs running the VNC server
Owner owner_flavor 4 8 Owner-VMs running some data science tools

You will need to change the flavors for the nodes to match your organization's flavor names:

vars/vms.yml: (example)

core_flavor: m1a.medium
analyst_flavor: m1a.xlarge
desktop_flavor: m1a.xlarge
owner_flavor: m1a.xlarge

Cloud Setup

After the configuration from above is finished, the system administrator can start the cloud setup playbook on the control node. The database administrator should also supervise the set-up. It configures the OpenStack firewall, creates encrypted volumes, downloads the images, assigns floating IPs and installs the six core nodes:

  1. Data Node
  2. Data Monitoring Node
  3. Gate Node
  4. Monitoring Node
  5. VPN Node
  6. Identity Node

Further it configures two default users:

  1. System Administrator
  2. Database Administrator

Core Nodes

ansible-playbook setup_cloud.yml -e @vars/secure.yml --ask-vault-pass

User Setup

When the setup of the core nodes completes, immediately change the passwords for both system- and database administrator on the identity node via the web interface or through self-service.

Change administrator passwords

This step can be executed with the least effort through the graphical interface accessible via the domain you specified at idm_domain in vars/auth.yml.

No domain needed

The infrastructure does not rely on a real domain name. It is only needed when the Identity Node should be exposed to other users via SSL/TLS encryption on-top of the OpenVPN encryption or as a standalone-endpoint.

For a test-deployment it is sufficient to add the ip/domain tuple to your hosts file on the Control Node as the website redirects automatically and the connection is secured through OpenVPN encryption:

/etc/hosts: (example)

...
128.130.202.59   id.ossdip.at   # domain from idm_domain
```

If you haven't updated them in vars/secure_plain.yml, they are:

  • sysadmin:ossdipat, and
  • dbadmin:ossdipat

Open the web browser on the control node and authenticate with the default credentials.

Image title
Reset password for sysadmin and dbadmin accounts

Connect to the Identity Node through OpenVPN or through other means.

ipa user-mod sysadmin --password   # system administrator
ipa user-mod dbadmin --password    # let database administrator type the new password

Afterwards, download the OpenVPN profiles from the VPN Node located at, and distribute the dbadmin.ovpn file securely to the Database Administrator without saving it to the System Administrator Control Node:

Account Location
System Administrator /home/sysadmin/sysadmin.ovpn
Database Administrator /root/dbadmin.ovpn

The setup of the infrastructure is now finished. The system administrator now can:

Troubleshooting

In case you have problems getting started, please contact the developer. You see the contact details at the Contact page.