database migration issues in large instances upgrading to 14.5 or later (merge_request_diff_commits, Command timed out after 3600s)
<!--- Please read this! Before opening a new issue, make sure to search for keywords in the issues filtered by the "regression" or "type::bug" label: - https://gitlab.com/gitlab-org/gitlab/issues?label_name%5B%5D=regression - https://gitlab.com/gitlab-org/gitlab/issues?label_name%5B%5D=type::bug and verify the issue you're about to submit isn't a duplicate. ---> ### Summary A support ticket was raised for an Omnibus GitLab upgrade from 14.1 to 14.7 that failed as follows. GitLab team members can read more in the [ticket](https://gitlab.zendesk.com/agent/tickets/266806) ``` FATAL: Mixlib::ShellOut::CommandTimeout: rails_migration[gitlab-rails] (gitlab::database_migrations line 51) had an error: Mixlib::ShellOut::CommandTimeout: bash[migrate gitlab-rails database] (/opt/gitlab/embedded/cookbooks/cache/cookbooks/gitlab/resources/rails_migration.rb line 16) had an error: Mixlib::ShellOut::CommandTimeout: Command timed out after 3600s: ``` See https://gitlab.com/gitlab-org/omnibus-gitlab/-/issues/6677 for how to workaround this in general, but note that if database migrations exceeded 60 minutes, their full run time may be a lot longer, such as: over 30 hours. This issue is about the [code in a 14.5 migration](https://gitlab.com/gitlab-org/gitlab/-/blob/v14.5.4-ee/db/migrate/20211012134316_clean_up_migrate_merge_request_diff_commit_users.rb#L26) (`20211012134316`, `class CleanUpMigrateMergeRequestDiffCommitUsers`) which executes pending `MigrateMergeRequestDiffCommitUsers` background migrations. Related: `CleanUpMigrateMergeRequestDiffCommitUsers` issue and MR: - https://gitlab.com/gitlab-org/gitlab/-/merge_requests/73068 - https://gitlab.com/gitlab-org/gitlab/-/issues/334394 ### Steps to reproduce - Upgrade to GitLab 14.1 - Not all `MigrateMergeRequestDiffCommitUsers` batches complete. - Upgrade to 14.5 or later. Instance has to be large enough for the batches to exceed the one hour timeout. ### What is the current *bug* behavior? Upgrade fails as it's [assumed that](https://gitlab.com/gitlab-org/gitlab/-/blob/v14.5.4-ee/db/migrate/20211012134316_clean_up_migrate_merge_request_diff_commit_users.rb#L19) > self-hosted instances should have their > migrations finished a long time ago. However, self-managed instances 1. Can observe the same issues that occurred on GitLab.com. 2. Don't all follow zero downtime upgrades. Instances which don't upgrade to 14.3, and wait for the background migrations to complete, will not get the benefit of the fix that GitLab.com received in 14.3 (plus the manual work to complete batches using the 14.1 code.) ### What is the expected *correct* behavior? Upgrade completes. However, this issue is about documenting how to diagnose and resolve this issue. ### Possible fixes and workarounds. #### Summary More than likely, [fix forward](#fix-forward) will be the required approach. If it's possible to detect this proactively, and avoid a fix forwards, then - Establish that an instance is likely to have this issue. Either by assessing the size of the database, or by discovering outstanding `MigrateMergeRequestDiffCommitUsers` batches on a 14.1 or 14.2 instance. - Upgrade to 14.3 - Wait for all batches in the revised `MigrateMergeRequestDiffCommitUsers` 14.3 migration to complete - Upgrade to a later release. **The 14.1 / 14.3 migrations have to complete before the 14.5 migration can run.** So, once an instance is upgrading to 14.5, the only options are - Back out - [fix forward](#fix-forward) #### Fix forward. ##### If GitLab is working OK: 1. Run: ```shell sudo gitlab-rake db:migrate ``` 1. Wait. This may take **a long time**. The customer tried this, and cancelled it after 31 hours. --- ##### If GitLab is not working correctly: For example if merge request approval rules and merging is broken, which we've had reported to GitLab support. ([Link for team members](https://gitlab.zendesk.com/agent/tickets/272200)). This is caused by database changes being queued up behind the long-running `merge_request_diff_commits` changes, and the GitLab code being unable to run correctly with the current state of the database. 1. Set aside the problematic migrations ```shell sudo gitlab-rake gitlab:db:mark_migration_complete[20211012134316] sudo gitlab-rake gitlab:db:mark_migration_complete[20211012143815] ``` 1. Run all the rest of the outstanding migrations. ```shell sudo gitlab-rake db:migrate ``` 1. Back out the first step, via the PostgreSQL database console (`sudo gitlab-psql`) ```SQL DELETE FROM schema_migrations WHERE version IN ('20211012134316', '20211012143815'); ``` 1. Run these two migrations. This will take a long time. ```shell sudo gitlab-rake db:migrate ``` 1. Please comment on the issue about your success, or otherwise, with this workaround. #### Proactive ##### Database size assessment. - The relevant table is `merge_request_diff_commits` - Run a database console ```shell sudo gitlab-psql ``` - Query: ```sql select n.nspname as table_schema, c.relname as table_name, c.reltuples as rows from pg_class c join pg_namespace n on n.oid = c.relnamespace where c.relkind = 'r' and n.nspname not in ('information_schema','pg_catalog') order by c.reltuples desc limit 50; ``` - On the ticket, the `merge_request_diff_commits` was around 9 million rows `9.123456e+06, the sixth largest unique table (ignoring `web_hook_logs_archived`) and only ten tables had more than 1 million rows. ##### Check for pending 14.1 batches - run in the database console; `sudo gitlab-psql` ```sql select status, count(*) from background_migration_jobs where class_name = 'MigrateMergeRequestDiffCommitUsers' group by status; ``` - when completed, the batches (eight in total) will move from status `0` to status `1`. Here, only 25% are complete. ``` status | count --------+------- 0 | 6 1 | 2 ``` - [from GitLab.com issue](https://gitlab.com/gitlab-org/gitlab/-/issues/334394#note_625083373) ##### GitLab 14.3 https://gitlab.com/gitlab-org/gitlab/-/merge_requests/68769 was introduced [to resolve the final 12% of the batches on GitLab.com](https://gitlab.com/gitlab-org/gitlab/-/issues/334394#note_657875373). Migration `20210901153324` is added; it reduces the batch size and makes other optimizations. This can be used on Self managed instances to complete the migrations before upgrading to 14.5 or later. All "failed" batches from 14.1 will be cancelled, and the work rescheduled by the new migration. Use the same query to monitor the work - run in the database console; `sudo gitlab-psql` ```sql select status, count(*) from background_migration_jobs where class_name = 'MigrateMergeRequestDiffCommitUsers' group by status; ``` - when completed, the batches will move from status `0` to status `1`. - Work still to do: ``` status | count --------+------- 0 | 16 1 | 8 ``` - All done ``` status | count --------+------- 1 | 24 ``` ##### Monitor the GitLab 14.5 migration There is a [14.5 migration referred to in the upgrade notes](https://docs.gitlab.com/ee/update/#1450). Optionally, monitor it with the following query; the batches will be complete when all records have a status of `1`. ```sql select status, count(*) from background_migration_jobs where class_name = 'FixMergeRequestDiffCommitUsers' group by status; ``` - [from GitLab.com issue](https://gitlab.com/gitlab-org/gitlab/-/issues/344080#note_717570207) <!-- If you can, link to the line of code that might be responsible for the problem. -->
issue