Starting Percona XtraDB Cluster From Cold Boot or Crash

Posted Oct 4, 2024

By Kevin Schwickrath

7 min read

A Percona XtraDB Cluster can be a resilient option for high-uptime databases. What if you lose all your servers due to a power failure? Restarting a cluster can be a pain if it is not shut down properly.

Check state file

Start by validating each server’s status by checking the state file.

sudo cat /var/lib/mysql/grastate.dat

Look for the safe_to_bootstrap value. If all servers have safe_to_bootstrap: 0 the cluster didn’t shut down cleanly. The cluster must be recovered before you can start the database.

The grastate.dat file does NOT represent the “current status” of a cluster member. The safe_to_bootstrap value is only updated on a clean shutdown of the last cluster member, not while the cluster is running. The safe-to-bootstrap flag is set to 0 on the startup of any node to prevent accidental bootstraps should a node die and restart.

serv01:~$ sudo cat /var/lib/mysql/grastate.dat
# GALERA saved state
version: 2.1
uuid:    ae3351be-251c-11ee-a69f-7615072053ca
seqno:   -1
safe_to_bootstrap: 0

serv02:~$ sudo cat /var/lib/mysql/grastate.dat
# GALERA saved state
version: 2.1
uuid:    ae3351be-251c-11ee-a69f-7615072053ca
seqno:   -1
safe_to_bootstrap: 0

serv03:~$ sudo cat /var/lib/mysql/grastate.dat
# GALERA saved state
version: 2.1
uuid:    ae3351be-251c-11ee-a69f-7615072053ca
seqno:   -1
safe_to_bootstrap: 0

Run wsrep-recover

Run wsrep-recover on all three servers before moving to the next step.

This command can take a while to run. Get some coffee and read on so you know what comes next.

If your servers are on a NAS, start one server at a time to reduce the impact on the storage server disk I/O.

serv01:~$ sudo mysqld_safe --wsrep-recover
2024-02-12T15:27:49.997853Z mysqld_safe Logging to '/var/log/mysql/error.log'.
2024-02-12T15:27:50.000716Z mysqld_safe Logging to '/var/log/mysql/error.log'.
2024-02-12T15:27:50.075530Z mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
2024-02-12T15:27:50.093710Z mysqld_safe WSREP: Running position recovery with --log_error='/var/lib/mysql/wsrep_recovery.2oP7Ar' --pid-file='/var/lib/mysql/-recover.pid'
2024-02-12T15:28:22.822359Z mysqld_safe WSREP: Recovered position ae3351be-251c-11ee-a69f-7615072053ca:431531
2024-02-12T15:28:37.529143Z mysqld_safe mysqld from pid file /var/run/mysqld/mysqld.pid ended

serv02:~$ sudo mysqld_safe --wsrep-recover
2024-02-12T15:27:10.421298Z mysqld_safe Logging to '/var/log/mysql/error.log'.
2024-02-12T15:27:10.423939Z mysqld_safe Logging to '/var/log/mysql/error.log'.
2024-02-12T15:27:10.449003Z mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
2024-02-12T15:27:10.462989Z mysqld_safe WSREP: Running position recovery with --log_error='/var/lib/mysql/wsrep_recovery.UvkweL' --pid-file='/var/lib/mysql/-recover.pid'
2024-02-12T15:27:26.098264Z mysqld_safe WSREP: Recovered position ae3351be-251c-11ee-a69f-7615072053ca:431531
2024-02-12T15:27:32.371749Z mysqld_safe mysqld from pid file /var/run/mysqld/mysqld.pid ended

serv03:~$ sudo mysqld_safe --wsrep-recover
2024-02-12T15:28:47.458666Z mysqld_safe Logging to '/var/log/mysql/error.log'.
2024-02-12T15:28:47.461340Z mysqld_safe Logging to '/var/log/mysql/error.log'.
2024-02-12T15:28:47.499902Z mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
2024-02-12T15:28:47.516263Z mysqld_safe WSREP: Running position recovery with --log_error='/var/lib/mysql/wsrep_recovery.cz5gOk' --pid-file='/var/lib/mysql/-recover.pid'
2024-02-12T15:28:56.972248Z mysqld_safe WSREP: Recovered position ae3351be-251c-11ee-a69f-7615072053ca:431531
2024-02-12T15:29:02.958667Z mysqld_safe mysqld from pid file /var/run/mysqld/mysqld.pid ended

Note the recovered position for each of the three servers (431531 in this case). If one server is more advanced than the others use that server as the startup server. If all are equal, use your best hardware or have the servers do a few rounds of rock paper scissors.

Bootstrap a primary server

On your selected server, edit the grastate.dat file. Change the safe_to_bootstrap from a 0 to a 1. I’m going to bootstrap serv02 because it has better hardware.

Before

serv02:~$ sudo vi /var/lib/mysql/grastate.dat
# GALERA saved state
version: 2.1
uuid:    ae3351be-251c-11ee-a69f-7615072053ca
seqno:   -1
safe_to_bootstrap: 0

After

serv02:~$ sudo vi /var/lib/mysql/grastate.dat
# GALERA saved state
version: 2.1
uuid:    ae3351be-251c-11ee-a69f-7615072053ca
seqno:   -1
safe_to_bootstrap: 1

Start Bootstrap process

Run the bootstrap process on the primary server only. The remaining servers will start later using another command (they don’t need to bootstrap).

1
sudo systemctl start [email protected]

This command can take a while to run. Get some more coffee and read on so you know what comes next.

Once the primary server is started, validate the status. Log into SQL and show the status.

serv02:~$ mysql -u root -p 
Enter password: 
mysql> SHOW STATUS LIKE 'wsrep_%';

Look for the following indicators:

wsrep_cluster_size: This indicates the current number of nodes in the cluster. It should reflect the expected number of nodes in the cluster. (Only one at the moment)
wsrep_cluster_status: This variable shows the status of the cluster component, which can be “Primary,” “Non-Primary,” or “Disconnected.” In a healthy cluster, the status should be “Primary” for at least one node.
wsrep_connected: This variable shows if the node is connected to the cluster. The value should be “ON” to indicate that the node is connected to the cluster components.
wsrep_local_state_comment: This variable provides a human-readable description of the local state. In a healthy cluster, this should indicate that the node is synchronized with the cluster.
wsrep_ready: This variable shows if the node is ready to accept queries. The value should be “ON” to indicate that the node is ready.

wsrep_cluster_size  1
wsrep_cluster_status Primary
wsrep_connected ON
wsrep_local_state_comment Synced
wsrep_ready ON

Check the database is returning expected data in a query.

mysql> show databases;
+--------------------+
| Database           |
+--------------------+
| information_schema |
| mysql              |
| test_db1           |
| test_db2           |
| performance_schema |
| sys                |
+--------------------+
6 rows in set (0.00 sec)

mysql> use test_db2;
Database changed
mysql> show tables;
+----------------------------------------------+
| Tables_in_test_db2                           |
+----------------------------------------------+
| address                                      |
| api_customer                                 |
| batch                                        |
| custom_key_data                              |
| doctrine_migration_versions                  |
| ext_log_entries                              |
| hours                                        |
| lease_file                                   |
| post_eng11                                   |
| property                                     |
| property_raw                                 |
| space_option                                 |
+----------------------------------------------+
29 rows in set (0.00 sec)

mysql> SELECT COUNT(*) FROM batch;
+----------+
| COUNT(*) |
+----------+
|       31 |
+----------+
1 row in set (0.01 sec)

Start The Remaining Servers

Once one server is operational, you can restart the remaining servers. If your servers are on a NAS, start one server at a time to reduce the impact on the storage server disk I/O.

Wait a few minutes for the server process to start and sync with the primary.

serv01:~$ sudo service mysql start

Check the cluster size to verify the cluster count increased.

mysql> SHOW GLOBAL STATUS LIKE 'wsrep_cluster_size';
+--------------------+-------+
| Variable_name      | Value |
+--------------------+-------+
| wsrep_cluster_size | 2     |
+--------------------+-------+
1 row in set (0.00 sec)

Repeat the process for the third and any remaining servers.

serv03:~$ sudo service mysql start

With each server start, verify the cluster size to ensure all cluster members are reporting in.

mysql> SHOW GLOBAL STATUS LIKE 'wsrep%';
+--------------------+-------+
| Variable_name      | Value |
+--------------------+-------+
| wsrep_cluster_size | 3     |
+--------------------+-------+
1 row in set (0.00 sec)

As a final step, you should stop and restart your bootstrap server. Is this required? I don’t think so. I have personally forgotten to do this step, and the servers have run fine for months. I’d put out all the other fires before I do this.

serv02:~$ sudo service mysql stop

# wait for the process to stop completely

serv02:~$ sudo service mysql start

Sources / Linkage

Crash Recovery - Percona XtraDB Cluster
Restarting the Cluster — Galera Cluster Documentation

[How to Bootstrap MySQL or MariaDB Galera Cluster - Updated

Severalnines](https://severalnines.com/blog/updated-how-bootstrap-mysql-or-mariadb-galera-cluster/)

Getting Started with MariaDB Galera Cluster - MariaDB Knowledge Base
After power loss mysql can’t start, bootstrap works - MySQL & MariaDB / Percona XtraDB Cluster 5.x - Percona Community Forum
All nodes say safe_to_bootstrap: 0 in freshly deployed cluster - MySQL & MariaDB / Percona Operator for MySQL - Percona Community Forum

sql

Percona XtraDB SQL Cluster PXC High Availability MySQL

This post is licensed under CC BY 4.0 by the author.