In this post, we will discuss a more robust and scalable way of capturing and persisting data for your Home Assistant installation plus a few other optimizations and tricks.

( Photo by h heyerlein )

In previous episodes, we discussed how to run Home Assistant on Docker and how to set up and integrate a basic Zigbee sensor network. In my case, this has allowed me to get up and running with a few useful automations and I ended up ordering a few other sensors for my installation. While waiting for these to be delivered, I have started wondering how to be ready to collect all the wealth of data that these devices will be duly noting into the Home Assistant DB and, more importantly, how to have more flexible tools to slice and dice the pile of raw information that will eventually get collected.

A basic Home Assistant set up uses SQLite for persistent data storage. While this is convenient to get up and running quickly, it uses a regular file for storing the data and does not necessarily scale up well as soon as you want to retain data for longer periods. This can be improved by pointing Home Assistant to use a relational database instead - and I will explain shortly how to achieve this.

Having addressed the Home Assistant local DB configuration, we can still go one step further and send the sensor data to an additional data storage (InfluxDB) that is more suited for collecting the type of records that a sensor network captures over time, and that can be connected to act efficiently as a source for dedicated visualization tools such as Grafana.

Let's do this!


Configure Home Assistant to use MariaDB

As I mentioned above, the SQLite support that comes out of the box with Home Assistant can only go so far concerning enabling a reliable and scalable database infrastructure for the data collected in your home.

For sure, this works fine when you experiment with a few sensors. However, with more sensors and over time you can capture a pretty large amount of data. You may want to run some analysis or perform some more intelligent operation based on historical analysis of your sensor data. This would be quite sluggish if performed on the standard SQLite DB.

The first thing to notice are the data retention configurations in Home Assistant. The component that takes care of this is called recorder and it comes with a few parameters that are of interest to us, purge_keep_days and purge_interval

recorder:
  purge_keep_days: 7
  purge_interval: 1
configuration.yaml

With the above values, we are telling recorder to retain 7 days worth of data and purge it on a daily interval. It is also possible to invoke a service recorder.purge to perform this based on specific conditions/events. Finally, you can also disable the purging function, however, this would leave you with an ever-growing data file!

The first thing we want to do is to move away from the SQLite solution and use a dedicated RDBMS. Home Assistant uses SQLAlchemy which is a database abstraction layer capable of supporting multiple options, detailed in the documentation page I linked above.

I personally decided to use MariaDB, mainly because I want to host it on my Synology NAS and there is a ready to use package to deploy it. The database can be hosted anywhere, as long as your Home Assistant installation can reach it and as long as you are happy with figuring out the alternative approaches!

Once the database is installed, you will need to modify your configuration.yaml file to point to the database via a connection string. In the example, I am storing the actual connection string in the usual secrets.yaml file.

recorder:
  db_url: !secret mariadb_conn_string
  purge_keep_days: 7
  purge_interval: 1
configuration.yaml

The below explanation assume that you have already created a database in MariaDB and a user with a password that is authorised to connect to the MariaDB instance from the Home Assistant host, as well as with appropriate read and write rights on the specified database name.

To do that, you will have to connect to the DB instance with your mysql client. On Synology you can ssh on the NAS and then use the mysql client to open a SQL shell. Finally, perform the below operations. Remember to substitute the placeholders and also to change the % character with the actual host of your Home Assistant installation if you want to be more restrictive than allowing any host to connect to MariaDB with these credentials.

CREATE_DATABASE <database_name>;
GRANT ALL PRIVILEGES ON <database_name>.* TO '<username>'@'%' IDENTIFIED BY '<password>';
FLUSH PRIVILEGES;

The connection string is quite simple

mariadb_conn_string: mysql://<username>:<password>@<mariadb_host>:<mariadb_port>/<database_name>?charset=utf8
secrets.yaml

Specifically:

  • username and password must be modified with your MariaDB username and password.
  • mariadb_host is the hostname where your MariaDB instance is running
  • mariadb_port is the port where the DB listens for incoming connections (usually 3307)
  • database_name is the database you created for Home Assistant

You can now restart your Home Assistant server and it will start storing data in MariaDB. You can watch the startup log to ensure that there are no errors - typical problems can be failed connections to MariaDB which are frequently due to incorrect networking or wrong privilege set up for the user in the database.

One last consideration: any existing historical data that you had in SQLite wouldn't be migrated automatically in MariaDB. I haven't bothered doing this, as any way with the above configuration I am planning to store only 7 days worth of data in it.

The next step is to spin up InfluxDB, which will be the secondary, dedicated time-series storage for the sensor data collected by Home Assistant.

NASA Picture of auroras from ISS
( Photo by NASA )


Set up InfluxDB

In our set up, InfluxDB will collect and persist long term the sensor data obtained from Home Assistant. For the setup, I opted to use the Docker image available.

The image itself is pretty simple to configure. As I wanted to host it on my Synology NAS, I simply created a folder and mapped it as the /var/lib/influxdb read/write mount point in the container image. This is where the data will be stored.

Once the container is up and running, the first task is to create an InfluxDB database (home_assistant) for the Home Assistant data. This can be achieved by running the Influx client on the InfluxDB container just launched

docker exec -it influxdb influx

Connected to http://localhost:8086 version 1.7.9
InfluxDB shell version: 1.7.9
>  CREATE DATABASE home_assistant

In general, the InfluxDB uses its own Influx Query Language (InfluxQL). You can find out more about it on their documentation page. Our objective, for now, is to run Grafana as the frontend interface to the data so that we will not have to run raw queries.

With a default configuration where the credentials are not changed and the InfluxDB instance is running on the same host where Home Assistant is (which is my case), all we need to do is add one line in the Home Assistant configuration file and then restart it. Find out more here.

influxdb:

After the restart, Home Assistant will begin sending sensor data to InfluxDB (in addition to saving it in the MariaDB database we configured earlier). Again, when you restart check the logs of your Home Assistant container to see if there are any issues around connectivity to the InfluxDB host.

It is also possible to set up username and password credentials for accessing the DB which I am skipping for now. If you do set these up, remember these will be needed in the next step.


Set up Grafana

To recap, we have moved the Home Assistant backend storage away from SQLite and onto MariaDB. We have also configured Home Assistant to push data into InfluxDB as well, which will be the long term storage of our sensor information.

The last step is to install Grafana, which we will use as a visualization tool. You can think of this as a way to conjure "dashboards on steroids", compared to the ones provided out of the box by Home Assistant.

Firstly, let's spin up a separate Docker container for it. The Grafana documentation does a pretty good job already on explaining this. In my case, it was even simpler as I installed and ran the image from the Synology Docker UI. A key aspect to note is to map a read/write folder on your Synology to the container's /var/lib/grafana to provide persistent storage.

Grafana runs on port 3000 by default and after having run the container successfully you can navigate to http://<your_host>:3000 to begin setting up your instance. Log in with the default credentials admin/admin and you will be prompted to modify them. With that out of the way, you need to tell Grafana where to find the sensors data, that links it to your InfluxDB database. This is done in the Data Sources section ( http://<your_host>:3000/datasources )

Fill the URL section with the location of your InfluxDB instance (in my case http://localhost:8086), then provide the InfluxDB details (Database: home_assistant and also username and password if you decided to set that up when spinning up InfluxDB). Save and Test will tell you immediately if the connection was successful. If it was, you are ready to run queries and build dashboards based on your sensor data.

Here's an example query and standard graph visualisation using my temperature sensor:

Grafana query editor

And a dashboard widget

Grafana gauge widget

These dashboards can be then integrated directly into Home Assistant by using cards. If you are serving Home Assistant over HTTPS, then your Grafana installation also needs to be served over HTTPS to be embedded successfully.

Achieving this is farily easy if you followed my prior blog articles where I explained how to obtain a wildcard certificate using Let's Encrypt. For Grafana, all you need to do is to stop the container and add a mount point where your certificates reside, then pass additional environment variables to set them up (the full list of configuration values available is here)

GF_SERVER_DOMAIN    = grafana.<yourdomain>
GF_SERVER_ROOT_URL  = https://grafana.<yourdomain>
GF_SERVER_CERT_KEY  = <your certificates folder mount point>/privkey.pem
GF_SERVER_CERT_FILE = <your certificates folder mount point>/cert.pem
GF_SERVER_PROTOCOL  = h2
GF_SERVER_ENFORCE_DOMAIN = true
GF_SECURITY_ALLOW_EMBEDDING = true

A couple of notes here:

  • You can replace h2 with https if you don't want to use HTTP/2)
  • GF_SERVER_ENFORCE_DOMAIN redirects to the correct domain if the host header does not match the domain. Prevents DNS rebinding attacks.
  • GF_SECURITY_ALLOW_EMBEDDING allows embedding Grafana into frames (which is what you may want to do from Home Assistant). By default, Grafana returns X-Frame-Options: deny in all responses, indicating browsers to disallow embedding otherwise.

You are now ready to just restart your container with the new settings. Pretty neat isn't it?


InfluxDB Retention Policies

Before we wrap up this article, I wanted to clarify an aspect around InfluxDB that was not exactly clear to me when I started looking into it. In fact, as a part of my day to day job, I am always wary when I run into generic statements such as "long term / infinite storage" and even so, I ended up using these myself in my own blog post! An explanation is therefore due to you, my reader 😎

The fact here is that firstly, we want to store a big enough slice of data from our sensors to run a useful analysis. At the same time, we want to make sure that we are not attempting to store an infinite amount of data because in that scenario we would eventually find out that either:

  • The InfluxDB host you are using (Synology NAS in my case) does not have an infinite hard-disk.
  • The InfluxDB engine cannot efficiently run queries on the sheer amount of data we are storing and our dashboards don't work anymore.

Let's go one step further: we still want to store a humongous amount of data and run efficient queries on it. Then, we could decide to spin up an InfluxDB Cluster on Amazon Web Services Elastic Container Service and point our Home Assistant integration to that, it would probably give you a little more runway than the average Synology NAS 🤣

Even then, does this even make sense in our home automation scenario? Are you really interested in having a fine-grained record of temperatures in your bedroom from 7 years ago? Maybe you are, but also happy with just retaining fine-grained values for the last week (month, year...) and then hourly (daily, weekly...) averages for older timeframes.

That's where the InfluxDB retention policies come into play. When you created your database for Home Assistant, a default retention policy was automatically assigned to it. You can see that with the following command in the Influx console:

> show retention policies on home_assistant
name    duration shardGroupDuration replicaN default
----    -------- ------------------ -------- -------
autogen 0s       168h0m0s           1        true

Which means the following:

  • You are using the autogenerated retention policy on your database
  • The Duration is zero seconds (also known as infinite)
  • The Shard Group Duration is 168 hours (also known as one week)
  • The Number of Replicas for the data points is 1

Backtracking a bit: an Influx DB retention policy is...

The part of InfluxDB’s data structure that describes for how long InfluxDB keeps data (duration), how many copies of this data is stored in the cluster (replication factor), and the time range covered by shard groups (shard group duration). RPs are unique per database and along with the measurement and tag set define a series.

(In this tutorial we are running one InfluxDB node so you can ignore replication, number of replicas and just be content with the fact that it will be set to one).

A shard ...

... contains the actual encoded and compressed data, and is represented by a TSM file on disk.

And a shard group ...

... is a logical containers for shards. Shard groups are organized by time and retention policy.

Now, to run queries efficiently, these retention policies can be used to create and tweak the data partitions to support such a goal. In other words, depending on what kind of queries you want to run on your data, you may want to tweak the shard group duration to longer or shorter timeframes (represented by shard group duration).

Let's imagine you are running a query which looks at the last 6 months of data. At a high level, that query will need to access roughly 26 shard groups (if the shard group duration is 7 days / one week as in the default policy), but would instead access roughly 6 shard groups only if the shard group duration was set to one month.

Bookcase
( Photo by Susan Yin )

Typically each time you have a collection of data in any scenario and you partition it, it comes with overheads. For example, if you were to take your city's phone book and split it in multiple books, one for each alphabetical letter, you would then need an additional index telling you where every single book is, when you need to look up someone's number by name - rather than just opening the single aggregated phone book.

Likewise, if you are running a query looking at a 2 hours timespan, and your shard duration group is 1 month, it means that InfluxDB needs to access the whole monthly shard and then sift through a larger amount of data to find exactly what you need.

To optimise this, you can read the InfluxDB Retention Policies Best Practices article on their blog. You can also explore downsampling strategies depending on the type of visualisations you want to whip together.

For now, I am leaving the "infinite" retention policy while I experiment with a small number of sensors and will update in the future if I find an optimal way that works for me.

Thanks for reading! If you enjoyed the article and you haven't already, have a look at the other parts of this series: