In most cases, HeavyDB manages files on disk without any need for manual intervention. However, certain situations may cause HeavyDB to be unable to perform its normal operations with respect to files, such as running out of disk space while heavydb is operational, losing connectivity to attached storage device, etc. For this reason, we've documented the following process for detection of unused storage folders/files.
Detecting Orphaned Data Files
You will need terminal access to the HEAVY.AI server, the application sqlite3, and ability to copy files to/from your server for analysis, and the spreadsheet software of your choice. The directions in this article will use Google Sheets, but the same formulas and approach
Prepare the list of folders
Navigate to the data subfolder of your HEAVY.AI storage folder. This is typically
/var/lib/heavyai/storage but the location can vary based on your configuration.
Run the following command within the data folder:
ls -1 > ~/data_folder_list.csv
This will prepare a list of files/folders in your "data" folder, in a single column into the data_folder_list.csv file in your user's home directory.
Use your preferred file transfer medium to download this file to your local machine, and save for subsequent steps.
Prepare a list of databases
/var/lib/heavyai/storage/catalogs (or equivalent path based on your custom configuration). Within this folder run the following command:
sqlite3 system_catalog 'select dbid, name from mapd_databases;' > ~/database_list.csv
Download files & Prepare Spreadsheet
Use SFTP, SCP or similar to download data_folder_list.csv and database_list.csv to your local machine or other accessible location.
A template with formulas to make subsequent steps easier has been attached to this article, and also you can click here to open this template in Google Sheets and select "File > Make a Copy".
Subsequent directions will apply specifically to Google Sheets, but can be reasonably interpreted for any desired spreadsheet software.
Using Google Sheets, open the "data_folder_list" tab, select File > Import. Upload data_folder_list.csv file and select "Append to Current Sheet" as Import Location. You can detect separator type automatically (default option). Use of the option to convert dates/formulas will have no consequence, you may leave it checked or unchecked. Carry down formulas from B2 and C2 into the range of your imported data, and after doing so delete row two. Note that the results of column C are not accurate at this time.
Next, open the "database_list" tab, and select File > Import. This time, upload database_list.csv and select "Append to Current Sheet". For Separator Type, choose Custom and type | (pipe character). Once again, carry down formulas in B2 and C2 through the range of your data, and then proceed to delete row two.
Prepare catalog report
At this point, we now have all of the data of what exists in your file system ready to evaluate. We now need to prepare the list of what the HEAVY.AI platform expects to exist within each database. To do this, formulas in columns B and C of the database_list tab of the spreadsheet prepare output of each expected dictionary and table within each database, and append results to a file in your home folder catalog_report.txt. To proceed, open up a terminal connection to your server, and navigate to the catalogd subfolder of your $HEAVYAI_STORAGE folder, this path is
/var/lib/heavyai/storage/catalogs/ in a typical installation. Copy each command from applicable rows in columns B and C within the database_list tab of your spreadsheet, and execute in your terminal client. Here are two example commands for reference:
sqlite3 exampleDatabaseName 'select "DB_1_DICT_" || dictid from mapd_dictionaries' >> ~/catalog_report.txt
sqlite3 example 'select "table_1_" || tableid from mapd_tables' >> ~/catalog_report.txt
Once all applicable commands have been executed, use SFTP, SCP or similar to download catalog_report.txt to your local machine or other accessible location. Open the catalog_report tab of your Spreadsheet, and select File > Import, upload catalog_report.txt and select "Append to Current Sheet" as Import Location, and click Import.
Identify Orphaned Folders
Return to the data_folder_list tab of your spreadsheet. Place your cursor in the range of data, and click the Filter icon to activate filters. Filter column C to only include rows with 0 as a result. Ideally, there will be no results. However, if there are, we can proceed to remove these.
With the filter applied, the rows in column a represent folders that are not currently used by the catalog of your database. If you have an unexpectedly large amount of results, double check that for each database you ran two commands to populate the catalog_report.txt file. If after verifying the same was completed properly, you still have large amount of results, we invite you to submit a Request to HEAVY.AI Support (Enterprise Customers) or create a new topic in our forum (Free/Open Source/AWS customers) before proceeding with any removal operation.
Removing Orphaned Folders
While following this guide properly will yield identification of folders that that are not in use by the current catalog, there is significant opportunity for human error when following this process. As such, we strongly recommend to prepare a physical backup of your HEAVY.AI environment prior to proceeding with any removal operation, to ensure that in any scenario reverting to the present state is possible.
With your backup prepared and confirmed to be complete, proceed to remove each of the folders listed using the filter created in the "Identify Orphaned Folders" section of this guide. You may proceed to do this either using terminal commands like
rm, or through any available user interface tool. This article does not take a prescriptive approach for deletion, other than to insist on the availability of a backup prior to proceeding.
The number of files to be removed should not typically be so large that any sort of automation is desirable, but if you like, you might consider using your spreadsheet to prepare a command to execute for each result to remove the folder. An example of this approach might be adding a column to your data_folder_list tab and populating the column with the following formula.
="rm -r "&A2&"/"
A2 represents the cell that has a folder identified for removal. As a reminder, the formula should only be copied and executed to remove folders which this process has confirmed to be no longer used by the current database catalogs.
Following this guide through to conclusion should remove any unused folders from your $HEAVYAI_STORAGE/data folder, thereby removing any unnecessary disk space usage. The normal operation of your database will not however create such situations where this process is necessary to be repeated. System administrators are further encouraged to actively monitor disk space as a system resource, and in general monitor their heavydb environment and report any crashes or unusual behavior to HEAVY.AI Support.