• IT DT - data processing, transfering and transforming

Rsync-Dir – automatic directory replication

Rsync-dir is program (available for Windows and Linux) designed to automatically replicate a specified directory to a data store, which can be:

  • etcd – distributed key-value data store,
  • redis – popular NoSQL database (also distributed key-value cache),
  • minio – object storage compatible with Amazon S3. This can be either the AWS (Amazon) S3 cloud service or a Minio S3 server instance (local or cloud),
  • dbproxy – any database for which a JDBC driver is available (ssldbproxy.jar component is required).

Only files available in the main directory or the entire tree can be replicated. This is determined by recursive parameter. The application also allows you to specify which types of files are subject to replication and which of them are to be ignored. There are also two replication modes available:

  • instant replication (start) – the file is sent to the data store immediately after a change is detected,
  • cyclic replication (mtime) – detection of changes in files and their sending to data store occurs cyclically at specified time intervals.

Both types of replication can interact with each other and can be activated simultaneously, monitoring the specified directory.

In order to improve the efficiency of sending a file to the data store, the program has implemented a mechanism for dividing files into smaller fragments (called chunks), for which checksums are calculated, thanks to which only those fragments of the file that have changed are sent to the store, not entire files. In the case of deleting a file from a directory, this file can be deleted from the data store, when the assumption is to reflect the state of the monitored directory in the data store. This is decided by the delete action in the actions parameter. The program also supports changing the file name.

The program has implemented several available actions taken by the application:

  • start – basic action to start monitoring the specified directory – activating instant replication, in which the file is sent to the data store immediately after detecting a change,
  • mtime – activating cyclic replication – changes in files are detected cyclically,
  • restore – restoring the state of the directory based on the data store. This action can be standalone and can be the only action taken by the application. It can be used to periodically restore the state of the directory on another server/node/pod,
  • delete – deleting all files from the datastore. This action can be standalone and may be the only action taken by the application. It can be used to reset the datastore from previously saved data.

The restore and delete actions can be combined with the start and mtime actions. Then the order of execution of the actions is as follows:

  • restore – first, the application recreates the directory based on the records in the data store,
  • delete – then the data store is cleared,
  • start – the continuous catalog monitoring process is started,
  • mtime – the process of detecting changes in the directory is run cyclically.


Application architecture

rsync-dir architecture


Examples of applications

The rsync-dir program can be used in many ways. Here are some examples of its use.

• in Kubernetes clusters without persistent volumes support – as we know, pods are ephemeral and all data saved by the application on disk is lost in the event of a pod failure. If we want the application to have access to previously created files after a failure, we can use rsync-dir, which will continuously replicate files generated by the application to the selected data store. After a pod failure, by using e.g. an initialization container, we can restore previously replicated files to the state before the failure, and after restoring the files, run the basic container with the appropriate application and with access to previously created files,

• for replication of transaction files in application servers (e.g. jboss, tomcat, weblogic) working in standalone mode, and after a failure run on another server, access to transaction files can allow for the correct handling of distributed transactions,

• for replication of logical database log files, e.g. in the case of PostgreSQL for current replication of WAL files – especially the current file, to which continuous writing takes place and it is not yet archived with the command defined in the archive_command parameter, in the case of the need to restore the database system from a backup, access to this file can minimize the amount of lost data,

• for backup of Linux server configuration files – monitoring changes in /etc/* files, the backup takes place immediately after the change, so in the event of a server failure we have a complete set of current configuration files,

• for High Availability applications for applications generating files needed for the application to work (e.g. application status files, session files, files with recently handled messages, etc.), after the primary server failure we can first restore files replicated with the rsync-dir program on the backup server, and then run the application that will read the status files and start working from the point shortly before the failure.

• for transferring files between separated systems – files created on the primary server are saved, for example, to the s3 storage, and then read cyclically on another server and made available to another application, e.g. a web application.

These are just some examples of applications. We assume that our users will find many other suggestions. If you would like to share your ideas, we will be very happy. The most interesting ones may be awarded with discounts on license purchases.