Syncing PostgreSQL tables with ElasticSearch using Docker

Published Aug 7, 2021

Source code is available here

If you’re reading this then probably you already know what PostgreSQL, Elastic, and Docker are. I hope so 😎 Because I won’t be focusing on that.

The aim of this article is to show you how you can easily sync data between PostgreSQL and ElasticSearch. By syncing I mean that every time you CRUDing data in the database, those changes are immediately mirrored in the Elastic.

So, let’s get started!

Create Elastic and Kibana services

Kibana is an open-source plugin for data visualization in Elastic. It provides variety of tools for data manipulation. It also gives us console to maked direct queries to Elastic instance. It’s something similar to GraphQL Playground (if you’ve ever worked with that).

So, our docker-compose could look like this:

version: '3.8'
services:
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:7.13.4
    container_name: elasticsearch
    environment:
      - xpack.security.enabled=false
      - discovery.type=single-node
      - bootstrap.memory_lock=true
      - 'ES_JAVA_OPTS=-Xms512m -Xmx512m'
    ulimits:
      memlock:
        soft: -1
        hard: -1
    ports:
      - '9200:9200'

  kibana:
    image: docker.elastic.co/kibana/kibana:7.13.4
    depends_on:
      - elasticsearch
    ports:
      - '5601:5601'
    environment:
      - ELASTICSEARCH_HOSTS=http://elasticsearch:9200

Of course, you can create just Elastic. But with the help of Kibana you can easily make requests to Elastic and view (also create) data via UI.

PostgreSQL service

We’ll need to customize it a bit later, so there’s no need to use the basic image directly in the docker-compose file. Create PostgresDockerfile and paste this:

FROM postgres:13

Then add this to the docker-compose file under services:

database:
    build:
      context: .
      dockerfile: PostgresDockerfile
    container_name: database
    ports:
      - '5432:5432'
    environment:
      - POSTGRES_PASSWORD=postgres

PGsync

PGSync is a tool that’ll help us in our goal! You can read more about it here. All we need is just Redis (because it uses pub-sub mechanism) and the schema. So, add this lines to the docker-compose file:

redis:
    image: 'redis:alpine'
    container_name: redis
    ports:
      - '6379:6379'

This tool is written in Python, so we’ll need to create a Python container. As we can see from the docs, we should also change some configurations in postgresql.conf.

So, create directory pgsync and PgsyncDockerfile at the root of the project.

FROM python:3.7-slim

WORKDIR /usr/src/app

COPY ./pgsync ./

RUN chmod +x ./entrypoint.sh

RUN pip install pgsync==2.1.1

RUN apt update \
    && apt install -y moreutils \
    && apt install -y jq \
    && apt install -y wait-for-it

ENTRYPOINT ["bash", "./entrypoint.sh"]

To apply additional configuration to postgres container we just create as many sql files as we want and just copy them into /docker-entrypoint-initdb.d directory of our container. Nice! They will be executed at container initialization.

So, let’s create conf.sql with the following content:

ALTER SYSTEM SET wal_level = logical;
ALTER SYSTEM SET max_replication_slots = 1;
-- set this value to the number of tables you want to load into elastic

And change PostgresDockerfile to this:

FROM postgres:13

COPY ./pgsync/conf.sql /docker-entrypoint-initdb.d/

RUN chmod a+r /docker-entrypoint-initdb.d/conf.sql

Ok, as you saw we also should have entrypoint.sh in out pgsync directory. This is the main file and it will be executed if we run pgsync container.

#!/usr/bin/env bash

wait-for-it $PG_HOST:5432 -t 60
wait-for-it $REDIS_HOST:6379 -t 60
wait-for-it $ELASTICSEARCH_HOST:9200 -t 60

jq '.[].database = env.PG_DATABASE' schema.json | sponge schema.json

bootstrap --config ./schema.json
pgsync --config ./schema.json -d

We use jq utility here for setting up database name dynamically in our schema file depending on the env variable PG_DATABASE at the build stage.

We also use wait-for-it script to wait for our services. Just writing depends_on in docker-compose file won’t always work, believe me!

Schema file

Create schema.json inside pgsync directory.

[
  {
    "database": "db_name",
    "index": "elastic-index-name",
    "nodes": {
      "table": "table-name",
      "schema": "public",
      "columns": []
    }
  }
]

More info about schema definition is available here.

Service

Add these to docker-compose:

pgsync:
    build:
      context: .
      dockerfile: PgsyncDockerfile
    container_name: pgsync
    environment:
      - PG_USER=postgres
      - PG_PASSWORD=postgres
      - PG_DATABASE=postgres
      - PG_HOST=database
      - ELASTICSEARCH_HOST=elasticsearch
      - REDIS_HOST=redis

Run it!

That’s it. Just run it using docker-compose up -d. Now if you try to CRUD some data in the database table you’re syncing, the changes will be mirrored to Elastic almost immediately.

Of course there are other techniques how to sync data but this is the easiest in my opinion. Maybe we’ll talk about them next time 🤙

You can find complete example in my Github repository

Published Aug 7, 2021

Software Engineer, PhD in Computer ScienceFind me on GitHub