Introduction: Why backups are important.
The backup of database data is crucial for an organization primarily because it serves as a safety measure to protect critical information and maintain operational continuity.
Databases store essential business information—customer records, financial transactions, operational data, etc. If this data is lost due to unforeseen events such as system failures, cyber-attacks, hardware malfunctions, or human error, it can result in severe setbacks for the organization. Without a proper backup, recovering this information becomes nearly impossible or a long process. While HeavyDB is an OLTP Database not designed for use as a system of record, a proper process for backing up database contents remains essential.
Backups are handy for database migrations to new systems or environments and to prevent data losses after operations like hardware, OS, or software upgrades. They can also create new test environments for development and testing purposes.
Regularly backing up the database minimizes downtime in case of a failure, preserving essential business operations and mitigating potential financial losses that might occur without this safety measure.
Finally, the historical data stored in backups serves a critical purpose by enabling organizations to analyze past trends.
In essence, backing up database data is fundamental for business, risk management, user satisfaction, and security. It acts as a safeguard, ensuring that the organization's most valuable asset—the data—is protected, preserved, and readily available when needed.
Different Types of Backups: Physical vs Logical.
For the HeavyDB, as with any other DBMS, the choice between physical and logical backups depends on the database size, recovery requirements, and the ease of backup management. Physical backups ensure a more comprehensive and faster recovery process, but they might be more complex to manage. Logical backups, on the other hand, offer flexibility and portability but might need to be faster and more efficient for large-scale data restoration.
In summary, these are the most important characteristics of each backup type.
Physical Backup
An approach that involves copying the physical files and structures that constitute the entire instance.
- Key Aspects:
- This type of backup captures the precise state of the entire database, including all tables and other associated objects, at a specific point in time.
- It includes data files, catalog files, and licenses in their binary format.
- It's faster to perform and is usually preferred for larger databases.
- Restoration involves replacing entire data and objects rather than individual records, tables, or dashboards.
- During the backup, the database must be either closed or read-only, restricting or limiting normal operations such as data ingestion or running queries.
Logical Backup
Logical database backup techniques typically involve extracting the database's logical contents, often utilizing a different granularity than physical backups.
- Key Aspects:
- It offers flexibility by allowing selective backup and restoration of specific tables, data, or other database objects.
- It is slower to backup or restore a significant number of objects compared to physical backups, especially for large databases.
- It includes SQL statements, scripts, or exports representing the database's schema, data, or both, depending on the approach used.
- The system remains open during backup or recovery operations, ensuring business continuity. However, it's important to note that these operations might cause some system slowdown.
Both types of backups are usually used in combination to achieve comprehensive data protection strategies, providing multiple layers of redundancy and disaster recovery options. The choice between physical and logical backups should be based on the specific needs, recovery objectives, and database characteristics of the HeavyDB.
Physical Backups on HeavyDB: Understanding the file structure of the data storage.
Understanding the structure of HeavyDB data storage is very important, especially in the context of backups, and is essential for creating a robust and effective backup strategy. It enables administrators to identify the specific components that need to be backed up, ranging from data files and log files to configurations and license files.
The layout and organization of the storage directory significantly impact the decision that has to be made for the backup process, efficiency, and overall success of data recovery.
In HeavyDB, the primary folder housing nearly all of the persistent data is the base directory. This directory typically holds the system configuration file named heavy.conf and, notably, and essential directory named storage, which houses critical components necessary for the system's proper functioning. The storage directory consists of the license file and various subdirectories, each containing vital data required for the system's operations.
Within the storage directory has the following contents:
-
data: This sub-directory encompasses the actual data of various tables used within the system. These hold the primary information required for operations and processing.
-
catalogs: Another crucial sub-directory containing system metadata that helps manage and organize the data. This metadata is vital for system functionality, and it contains all the information of the system structures like databases, tables, view, users, permissions dashboard and any object created in the system.
-
disk_cache: Contains the Table data and Sync metadata of foreign tables.
-
log: Contains the log files that track any system activities, errors, and user interactions. They provide a record of system events, aiding in monitoring, auditing, and troubleshooting any issues that might arise.
-
Other Components: In addition to the mentioned elements, additional sub-directories or files may be recommended for the smooth operation of the system, such as the heavyai.license file and the key_store and lock_files directory, containing the license key and the certificates generated by the system.
Additionally, directories are not explicitly necessary for the system's recovery, like the export and import directories, typically containing files created by Heavy Immerse using the Import Wizard or exports initiated by users.
The default base directory is situated at /var/lib/heavyai, leading to the complete path for the storage directory as /var/lib/heavyai/storage. Although both paths can be altered, it's highly advisable to retain these default paths, especially in scenarios involving multiple administered systems, to facilitate software upgrades and streamline daily administrative tasks.
While it's possible to modify the base directory's position according to specific naming conventions or internal organizational standards, it's essential to note that the name of the storage directory should remain unchanged in all cases.
Foreing tables important note: To ensure an effective backup strategy, particularly when utilizing Heavy Connect in our system, it's essential to back up the files and directories within the base and storage paths and consider including additional local directories. These directories are utilized to store the source files for foreign tables defined within the databases.
Finding the exact location of these files might be challenging due to the absence of a straightforward method for their determination. One practical approach is to inspect the 'wrapper_metadata.json' files within the 'disk_cache' directory. These files contain names and relevant data of the files used for populating the tables, serving as a helpful starting point for locating these necessary files.
Physical Backups on HeavyDB: Establishing an Effective Backup Strategy Over Time
To perform a comprehensive database backup, it's necessary to completely halt the database operations or place the database in read-only mode to ensure data consistency during the backup process.
Once the database is stopped (or placed in read-only), operating system commands, such as 'tar' or other relevant backup tools, can be used to generate a snapshot of the files and directories mentioned in the preceding section. This snapshot will capture the entire database state at that specific point in time, thereby ensuring a reliable copy of the data.
Unlike exceptional events like system maintenance or software upgrades, which can be scheduled in coordination with users and involve extended system downtime, the same does not apply to daily backups. Daily backups must be performed swiftly, using automated and repeatable processes, ensuring timely completion and minimizing system downtime. Storing the backup files in a remote destination is also a good idea to ensure data security and availability.
To satisfy all those requirements, the following steps have to be taken
- Selection of the directories and the files needed for backups.
As seen in "Understanding the file structure of the data storage", it is crucial to locate the path of base and storage directories and, if any, the additional paths used to store Heavy Connect local files.
On a running system, an excellent way to do that is by running pgrep - heavydb command like this
depending on whether you are using systemd or docker, you will get an output in this format:PID /inst_path/bin/heavydb --config /base_directory/heavy.conf
In both examples, we have the /opt/heavyai as the installation directory, in the first, the /var/lib/heavyai as the base directory, and in the second, the /var/lib/heavyai/storage as the storage directory
PID /inst_path/bin/heavydb [--data] /storage_directory/2349960 /opt/heavyai/bin/heavydb --config /var/lib/heavyai/storage/heavy.conf
If we only have the configuration file, the storage directory can be determined by referencing the data parameter within the configuration file. It's specifically designated to specify the name of the storage directory.
2349960 /opt/heavyai/bin/heavydb /var/lib/heavyai/storageAfter obtaining the storage directory's name, it is crucial to validate its status as a legitimate storage location. This verification process can be easily achieved by checking for the existence of essential directories and mandatory files, utilizing a straightforward script for validation.
The rule of thumb is that the entire storage directory along the paths containing the files for the foreign tables should be included in the backup. Still, if the size of optional directories like import or export is big enough to increase significantly the storage footprint and the total time of the backup, the contents of both can be omitted.storage_dir="/var/lib/heavyai/storage" is_storage_dir=true for i in data catalogs disk_cache export log lockfiles do if ! [ -d "${storage_dir}/$i" ]; then is_storage_dir=false; break; fi; done; if ! [ -f "${storage_dir}/catalogs/system_catalogs" ]; then is_storage_dir=false; fi; if [ "${is_storage_dir}" == "false" ]; then echo "The storage dir ${storage_dir} is invalid" fi;
- Compression of the files created during the backup.
Data compression significantly expedites the backup speed, reduces the disk footprint, and minimizes network transfers. Commonly used compressors such as gzip, xz, and others available by default in systems can achieve decent compression ratios. However, their performance might only sometimes align with the optimal efficiency we aim for in backup operations.
Below is a comparative table that assesses the performance of the essential 'tar' utility in combination with gzip and other available compressors specifically for Ubuntu and CentOS systems. This comparison serves to gauge their efficiency and speed, helping in the selection of the most suitable compressor for backup procedures
compressor read throughput compression ratio CPU Usage for compression packge name (installed) plain tar 1000 MB/sec no compression 0 cores tar (Y) gzip 58 MB/sec 6.4x 1.0 cores gzip (Y) zstd -T0 1000 MB/sec 7.5x 2.1 cores zstd (N) lz4 670 MB/sec 5.1x 1.0 cores lz4 (N) pigz -1 538 MB/sec 6.4x 6.5 cores pigz (N) pixz -1 302 MB/sec 8.2x 14.1 cores
pixz (N)
Z Standard (zstd) stands out as the fastest compressor, boasting a throughput equivalent to that of the plain archiver (like 'tar') and 20x faster than the default compressor ('gzip'), with a better compression ratio. It achieves a balanced use of the CPU cores within the system, ensuring efficient compression without compromising speed or storage efficiency.
To use the zstd or any other compressor along with tar is necessary just adding our command using the -I switch along the desired compression program.
if [ -x "$(which zstd)" ]; then # use the zstd as compression program tar --use-compress-program "zstd -T0" \ -cf bkp_mydatabase_2023_45_9_0400_full.tar.zstd \
/path_to_storage_dir /other_paths else # we haven't found the zstd, fall back to serial gzip compression tar -zcf bkp_mydatabase_2023_45_9_0400_full.tar.zstd \
/path_to_storage_dir /other_paths fi; - Use of Incremental and Differential backups along with Full Backups.
Using incremental and/or differential backups balances storage and network efficiency, faster backup, and a more comprehensive and frequent backup schedule, offering a more robust and efficient data protection and recovery backup strategy.
We can run the full backups when the database activity is lighter, for example, during the weekends and the incremental.
To perform incremental backups using the 'tar' command, utilizing the --listed-incremental and --level switches while creating each snapshot is essential. The --listed-incremental switch captures the necessary filenames for tracking changes in the source data. Simultaneously, the --level switch is employed during the initial backup to create the comprehensive full backup, establishing the baseline for subsequent incremental backups.
Example of a full backup. The --level 0 switch is used to initialize the snar file
tar --use-compress-program "zstd -T0" \ --level 0 --listed-incremental mydatabase.snar \ -cf $(date +"/backup_dir/bkp_mydatabase_%Y_%U_%u_%H%M_full.tar.zst) \ $storage_dir
Example of an incremental backup. The --level 0 switch is removed, and the snar file is used to track the changes
tar --use-compress-program "zstd -T0" \ --listed-incremental mydatabase.snar \ -cf $(date +"/backup_dir/bkp_mydatabase_%Y_%U_%u_%H%M_incremental.tar.zst) \ $storage_dir
Important notes:
When implementing an incremental backup strategy for database restoration, it is essential to restore the initial full backup and all subsequent incremental backups. Although this process can be slower and more intricate, particularly for large databases, the advantages of time and resource utilization become evident.
Differently a full backup is consistently preferred for specific activities, such as software upgrades, as it provides a snapshot at the point in time preceding the upgrade. This full backup allows for a potential rollback to that state in case of any issues or complications during the upgrade process. Also, a full backup of the upgraded database is highly recommended once the upgrade has been validated. This post-upgrade full backup establishes a new baseline, capturing the database in its updated state and enabling a fresh set of backups for the latest version. - Automate the backup process and conduct periodic verifications to ensure the effectiveness of the backup strategy.
While the sequence of checks and operations for a backup might appear straightforward, the repetitive nature of this task leaves it vulnerable to human error. Therefore, automating the backup process, such as using ad-hoc scripts, is crucial to ensure the consistent and reliable protection of essential data. Implementing automated backup procedures not only streamlines the process but also significantly reduces the risk of errors, thereby enhancing the overall reliability of data backups.
Also, periodic verification of the backup strategy's effectiveness is important. Regular checks and evaluations ensure that the backup protocols remain relevant, up-to-date, and capable of restoring data effectively in case of an unexpected event or system failure. This approach not only streamlines the backup procedure but also provides peace of mind regarding the reliability of our data backup and recovery system.
Physical Backups on HeavyDB: Practical Examples of Data Backup and Restore.
A series of steps must be diligently followed to ensure a reliable physical full or incremental backup. This includes verifying the absence of long-running activities on the database before halting it and confirming that the data to be backed up is not in use.
Backup the database
The basic steps to take the backups are as follows:
- On the host where the database resides, connect to the database using heavyset and run show queries or show user sessions SQL commands to inspect any ongoing activity within the connected databases. For instance, you can verify if there are any long-running queries.
Example of a simple command to check if there are database queries running.
admin@localhost:~$ [ "$(echo "show queries;" | heavysql -p [password])" ] \ && echo "No queries are running. The database can be stopped." \ || echo "One or more queries are running." No queries are running. The database can be stopped.
This check helps prevent the backup process from halting long-running activities such as batches ETL, which often consume a significant amount of time and modify a substantial volume of data. - Utilize the systemctl stop or docker container stop command to halt the database service, ensuring a consistent snapshot is maintained throughout the backup process.
Examples of commands to stop the heavydb service or the docker container.
sudo systemctl stop heavydb
sudo docker container stop my_heavyai_container
Important note: Heavy Immerse will be stopped if your docker container runs the Heavy Web Server. - Use the lsof command to ensure that all the heavydb instances are stopped and the storage directory is no longer in use.
Example showing the the storage directory isn't in use by any heavydb instance.
admin@localhost:~$ [ ! $(lsof /var/lib/heavyai/storage/heavydb_pid.lck \ |grep heavydb_pid.lck >/dev/null) ] \ && echo "Storage directory in use. The backup could be inconsistent." \ || echo "Storage directory isn't actually used."
If the storage directory is in use, re-execute the command lsof /var/lib/heavyai/heavydb_pid.lck independently to identify the process locking the file. Stop the identified processes before proceeding with the backup.
Storage directory isn't actually used.
Example showing the storage directory is currently being locked by both the heavydb and java (calcite) processes.
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME heavydb 2414408 mapd 8uW REG 253,0 7 85069947 /var/lib/heavyai/storage/heavydb_pid.lck java 2414456 mapd 8u REG 253,0 7 85069947 /var/lib/heavyai/storage/heavydb_pid.lck
- Execute the tar command to create a full or incremental database backup. It would be best to designate a remote destination for the backup. However, if network conditions are anticipated to significantly extend the backup duration, potentially affecting system availability, opt for a local destination. Subsequently, transfer the files to the remote destination at a later time.
Full backup using a local storage to store the tar and snar files. The --level 0 is used to intialize the snar file and used along the --listed-incremental is crucial for a backup strategy using a mix of full and incremental backup.
tar --use-compress-program "zstd -T0" \ --level 0 --listed-incremental /local_backups/mydatabase.snar \ -cf $(date +"/local_backups/bkp_mydatabase_%Y_%U_%u_%H%M_full.tar.zst") \ /var/lib/heavyai/storage
Using SSH to perform a full backup and store the tar file in a remote location while keeping the snar file in a local path.
tar --use-compress-program "zstd -T0" \ --level 0 --listed-incremental /local_backups/mydatabase.snar \ -cf - /var/lib/heavyai/storage \ | ssh admin@remote.host \ "cat >$(date +/remote_backup/bkp_mydatabase_%Y_%U_%u_%H%M_full.tar.zst)"
An Incremental backup can be taken after the full backup. The required checks and the command are nearly identical, except for the removal of the --level 0 switch from the tar command.
tar -use-compress-program "zstd -T0" \ --listed-incremental mydatabase.snar \ -cf $(date +"/local_backups/bkp_mydatabase_%Y_%U_%u_%H%M_incremental.tar.zst) \ /var/lib/heavyai/storage
By omitting the --level switch, tar will employ the snar file to track modifications since the last backup, backing up only the modified files. This method accelerates the backup, conserves system resources, and yields a more compact backup file set. Consequently, compared to a strategy solely relying on full backups, this method notably reduces system downtime and conserves backup storage space, enabling the feasibility of conducting more frequent backups.
Important notes: For a comprehensive disaster recovery strategy, storing the snar and tar files in a remote location is advisable. In the case of an incremental backup, it is highly recommended to maintain records of the backup files taken between two full backups. This ensures complete certainty when restoring data, guaranteeing the availability of all backup files taken up to the desired point in time. - The database services can now be restarted using the systemctl start or the docker container start command.
sudo systemctl start heavydb
sudo docker container start my_heavyai_container
The steps for restoring a database are often similar, if not identical, to those used for creating backups. However, the restoration process may vary and become more complex, especially in full and incremental backup strategies.
It's crucial to emphasize that using a previously validated restore process is strongly recommended. Verifying within the records that all necessary files are available for restoring the database to the desired point in time is essential for a successful restoration.
Restore the database
The steps for restoring a database to its original location typically include halting the services before commencing the restoration process. However, this step is deferred or unnecessary if you're utilizing a different storage directory to minimize system downtime or setting up a new environment.
The steps needed to restore are as follows (steps 2 and 3 are strictly required if you are restoring to the original base directory only):
- Verify the necessary files for restoration and confirm their availability on the chosen host for the intended point-in-time restore.
When the intended point-in-time for the restore coincides with a full backup, only the single archive containing the full backup is necessary. However, if it doesn't align, the requirement involves restoring the preceding full backup and all incremental backups until the desired point in time.
Consider this scenario: A database was accidentally dropped on the night of November 10th, just before the incremental backup was run. A restore to the backup preceding that event is necessary to recover the data. In this case, the only required archive is the full backup taken on the night of November 9th.
admin@localhost:$ ls -ltr /local_backups/bkp_mydatabase_2023_45* -rw-rw-r-- 1 heavyai heavyai 5164M nov 5 04:01 bkp_mydatabase_2023_45_0_0400_full.tar.zst -rw-rw-r-- 1 heavyai heavyai 572M nov 6 04:00 bkp_mydatabase_2023_45_1_0400_incremental.tar.zst -rw-rw-r-- 1 heavyai heavyai 440M nov 7 04:00 bkp_mydatabase_2023_45_2_0400_incremental.tar.zst -rw-rw-r-- 1 heavyai heavyai 23M nov 8 04:00 bkp_mydatabase_2023_45_3_0400_incremental.tar.zst -rw-rw-r-- 1 heavyai heavyai 5854M nov 9 04:01 bkp_mydatabase_2023_45_4_0400_full.tar.zst -rw-rw-r-- 1 heavyai heavyai 350M nov 10 04:00 bkp_mydatabase_2023_45_5_0400_incremental.tar.zst -rw-rw-r-- 1 heavyai heavyai 1M nov 10 04:01 bkp_mydatabase.snar
In the case the database would be dropped after the incremental backup, the correct data recovery would necessitate the archives of both the full and subsequent incremental backups, as highlighted in the provided list.
admin@localhost:$ ls -ltr /local_backups/bkp_mydatabase_2023_45* -rw-rw-r-- 1 heavyai heavyai 5164M nov 5 04:01 bkp_mydatabase_2023_45_0_0400_full.tar.zst -rw-rw-r-- 1 heavyai heavyai 572M nov 6 04:00 bkp_mydatabase_2023_45_1_0400_incremental.tar.zst -rw-rw-r-- 1 heavyai heavyai 440M nov 7 04:00 bkp_mydatabase_2023_45_2_0400_incremental.tar.zst -rw-rw-r-- 1 heavyai heavyai 23M nov 8 04:00 bkp_mydatabase_2023_45_3_0400_incremental.tar.zst -rw-rw-r-- 1 heavyai heavyai 5854M nov 9 04:01 bkp_mydatabase_2023_45_4_0400_full.tar.zst -rw-rw-r-- 1 heavyai heavyai 350M nov 10 04:00 bkp_mydatabase_2023_45_5_0400_incremental.tar.zst -rw-rw-r-- 1 heavyai heavyai 1M nov 10 04:01 bkp_mydatabase.snar
After the needed archives are identified, they can eventually be moved to the host, where the recovery will be performed.
- Use the systemctl or docker command to halt the database service.
sudo systemctl stop heavydb
sudo docker container stop my_heavyai_container
- Ensure all heavydb instances are stopped and the storage directory is no longer in use by using the lsof command.
admin@localhost:~$ [ ! $(lsof /var/lib/heavyai/storage/heavydb_pid.lck \ |grep heavydb_pid.lck >/dev/null) ] \ && echo "Storage directory in use. The backup could be inconsistent." \ || echo "Storage directory isn't actually used."
Storage directory isn't actually used. - Restore the database data, utilizing the 'tar' command, following the sequence of the full and incremental backups in the order they were taken.
Restore the full backup to recover the database on November 9th.tar --use-compress-program "zstd -T0" \ --listed-incremental=/dev/null -C / \ -xf /local_backups/bkp_mydatabase_2023_45_4_0400_full.tar.zst
Then, restore the incremental to move the contents to the desired date of November 10th.tar --use-compress-program "zstd -T0" \ --listed-incremental=/dev/null -C / \ -xf /local_backups/bkp_mydatabase_2023_45_5_0400_incremental.tar.zst
Important note: When restoring the database over the original base directory, removing the existing files is essential before initiating the recovery process. - Now that the restore is completed, the database services can be restarted using the systemctl start or docker container start command
sudo systemctl start heavydb
sudo docker container start my_heavyai_container
Physical Backups on HeavyDB: Automating the process and deploying into multiple systems.
A critical component of ensuring data integrity and security is the implementation of automated physical backups. This becomes especially vital when managing HeavyDB across multiple systems.
Administrators often use bash scripting to automate backups and perform pre-run checks to improve the reliability of the process. These checks help ensure the backup operation is successful and the copied data is correct.
Bash scripts are versatile tools that can be customized based on specific system requirements, which makes them an essential tool for administrators who want to optimize their database environment.
Incorporating pre-run checks introduces an additional layer of sophistication to the backup strategy. Before initiating the backup process, the scripts thoroughly examine the database's health, identify potential issues, and ensure the data slated for copying is consistent for a successful backup. This preventative approach significantly reduces the risk of data inconsistencies, fostering a more resilient and effective data protection strategy without human intervention.
With a standardized and automated approach that includes preemptive checks, administrators can confidently navigate the complexities of data protection across diverse HeavyDB environments.
To establish a dependable backup strategy, it is crucial to create a bash shell that incorporates all necessary checks, archiving techniques, and concepts discussed in the previous chapters. By doing so, it ensures that all important files and directories on the system are backed up regularly, securely and accurately.
To make it more user-friendly and flexible for different hosts and environments, a light interface can be added to manage the necessary parameterizations. This interface allows users to customize the backup process according to their preferences, such as selecting the backup frequency, backup location, and the files to include or exclude from the backup.
Furthermore, to simplify the process of creating a backup strategy, a bash script named "heavybackup.sh" is available. This script can be used as a solid starting point to develop a backup strategy on any system without special OS features or dependencies. It supports taking full and incremental compressed backups in a secure manner. It also can be used to estimate the space and time required for backing up a specific database without using any storage space.
The heavy backup.sh script is a customizable tool that can be used to back up database storage and additional paths needed by your instance. It supports incremental backups, which means that only the changes made since the previous backup will be backed up. As previously discussed this feature saves storage space and reduces the backup time, making the backup process faster and more efficient.
In addition, the script creates a detailed log file for each backup operation, which includes the date and time of the backup, the size of the backup, and any errors encountered during the process. This log file can be used to monitor the backup process and troubleshoot any issues that may arise.
Overall, the heavybackup.sh script is a reliable and efficient tool to develop a solid backup strategy on any system. It provides many useful features and options that can be customized to suit the user's needs, making the backup process easier, faster, and more secure.
Comments
0 comments
Please sign in to leave a comment.