Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

V8: Speed up content migration #6191

Closed

Conversation

stevemegson
Copy link
Contributor

@stevemegson stevemegson commented Aug 25, 2019

Some work on #5803.

Speed improvements

  • Create indexes on umbracoPropertyData to speed up updates in later migrations
  • Use sp_rename for columns on SQL Server (mainly for umbracoPropertyData.textValue)
  • Use UPDATE...FROM for umbracoPropertyData, umbracoContentVersion, umbracoMediaVersion on SQL Server
  • VariantsMigration: Check for edited property data before copying versions where "published=newest" - we know that the copied versions won't be edited
  • VariantsMigration: Create extra versions for "published=newest" with a few large queries rather than a foreach loop
  • ConvertTinyMceAndGridMediaUrlsToLocalLink: only update if value changed

Other fixes

  • VariantsMigration: Unpublished documents should have edited=1 (fixes warnings "Skip item id=x, both draft and published data are null.")
  • VariantsMigration: Handle NULL languageId and segment when checking for edited property data
  • VariantsMigration: Published documents with unpublished changes were left in an unpublished state (added update of published column after "reduce document to 1 row per content")
  • AddTypedLabels: clear varcharValue when setting int/date value

This item has been added to our backlog AB#4555

@KDharmindar
Copy link

There is cmsPropertyData and not umbracoPropertyData. Are you talking about the same table ?

@stevemegson
Copy link
Contributor Author

@KDharmindar it's the same table, though with the columns renamed and changed around a bit. It's renamed during the migration to V8.

@KDharmindar
Copy link

@stevemegson my point is that I need to convert from v7 to v8 and the cmsPropertyData contains 2.1 million records. And it is taking days to update that table. I was monitoring the updates in SQL Profiler. Any suggestions to improve things?

@stevemegson
Copy link
Contributor Author

@KDharmindar it sounds like you're stuck on updates to the versionId2 column, which happen before the table is renamed. These changes improve that significantly - that step takes a minute on my site, which is a similar size to yours.

I could share a custom build of 8.1.4 with these changes included if you're interested in testing it on your database? It only changes Umbraco.Core.dll.

@KDharmindar
Copy link

@stevemegson sure. That would be a big favor if I get able to speed up things with this.

@stevemegson
Copy link
Contributor Author

No problem, here's the modified DLL. If you could share the log file after testing it, that would help me see where other improvements could be made.

Umbraco.Core.dll.zip

@KDharmindar
Copy link

Database migration failed duplicate published document version. This is the error now.

@KDharmindar
Copy link

KDharmindar commented Sep 16, 2019

This is the log file

UmbracoTraceLog.DESKTOP-3VNLVK3.20190916.zip

@stevemegson
Copy link
Contributor Author

Probably best to move to the forum to discuss any unrelated errors you're getting, but I think that error can only happen if there are already duplicate published versions in your database before the migration. This SQL should find the offending pages:

SELECT nodeId, count(*)
FROM cmsDocument
WHERE published=1
GROUP BY nodeId
HAVING count(*) > 1

@KDharmindar
Copy link

Duplicate issue resolved. Now the new one is as follows

exec sp_executesql N'delete top (1) JQ
output DELETED.Id, DELETED.JobId, DELETED.Queue
from [HangFire].JobQueue JQ with (readpast, updlock, rowlock, forceseek)
where Queue in (@queues1) and (FetchedAt is null or FetchedAt < DATEADD(second, @timeout, GETUTCDATE()))',N'@queues1 nvarchar(4000),@timeout float',@queues1=N'default',@timeout=-1800

This query is being called repeatedly.

@matijagrcic
Copy link

Hi @stevemegson, I've tested your changes on several huge sites, more than 2 million rows in cmsPropertyData (umbracoPropertyData ) and the improvement is great. For some sites there was a need to add aliases otherwise the ToDictionary() call failed, did it thru console app and also now
noticed you have fixed that in #6651, also referenced in #6588.

One of the site can't upgrade given the

At {OrigState} {3E44F712-E2E3-473A-AE49-5D7F8E67CE3F} AddTypedLabels

 

Exception

System.FormatException: String was not recognized as a valid DateTime.

   at System.DateTimeParse.Parse(String s, DateTimeFormatInfo dtfi, DateTimeStyles styles)

   at Umbraco.Core.Migrations.Upgrade.V_8_0_0.AddTypedLabels.<>c__DisplayClass1_1.<Migrate>b__22(SqlUpd`1 u)

   at Umbraco.Core.Persistence.NPocoSqlExtensions.Update[TDto](Sql`1 sql, Func`2 updates)

   at Umbraco.Core.Migrations.Upgrade.V_8_0_0.AddTypedLabels.Migrate()

   at Umbraco.Core.Migrations.MigrationBase.Umbraco.Core.Migrations.IMigration.Migrate()

   at Umbraco.Core.Migrations.MigrationPlan.Execute(IScope scope, String fromState, IMigrationBuilder migrationBuilder, ILogger logger)

   at Umbraco.Core.Migrations.Upgrade.Upgrader.Execute(IScopeProvider scopeProvider, IMigrationBuilder migrationBuilder, IKeyValueService keyValueService, ILogger logger)

   at Umbraco.Core.Migrations.Install.DatabaseBuilder.UpgradeSchemaAndData(MigrationPlan plan)

I've tested all the data in the dataDate and date format is fine. Would you know where to look for the offending data as the SQLProfiler doesn't give the exact location?

@stevemegson
Copy link
Contributor Author

@matijagrcic This SQL should give you all the values that it's trying to parse:

select contentNodeId, versionId, propertytypeid, dataNvarchar
from cmsPropertyData
where propertytypeid in (
	select id from cmsPropertyType
	where Alias in ('umbracoMemberLastLockoutDate', 'umbracoMemberLastLogin', 'umbracoMemberLastPasswordChangeDate')
	)
and dataNvarchar is not null

@matijagrcic
Copy link

Thanks @stevemegson but running this query doesn't return any rows.
Any idea?

@stevemegson
Copy link
Contributor Author

@matijagrcic Ah, I'm thinking of AddTypedLabels' behaviour in 8.2.0. In 8.1.x it used the IDs of properties rather than aliases, which causes problems with sites which have been upgraded from older versions and have other properties using those IDs. Try this SQL:

select contentNodeId, versionId, propertytypeid, dataNvarchar
from cmsPropertyData
where propertytypeid in (32, 33, 34)
and dataNvarchar is not null

@matijagrcic
Copy link

Thanks 32, 33, 34 are

id	Alias
32	umbracoUrlAlias
33	image
34	rolloverImage

so certainly not dates, mostly ints, strings and some null values.
What would be an approach to mitigate this? - is it a possibility to skip this https://github.com/umbraco/Umbraco-CMS/pull/6191/files#diff-f2acb0a0db185446e914b4360bbc0bd6R101 or add try catch and ignore?

@stevemegson
Copy link
Contributor Author

@matijagrcic AddTypedLabels should be fixed in 8.2.0, so the easiest solution would be to upgrade to that. Here's a build of 8.2.0 with the changes from this PR applied:

Umbraco.Core.dll.zip

@matijagrcic
Copy link

Thanks @stevemegson will test this out and let you know the result.

@matijagrcic
Copy link

matijagrcic commented Oct 21, 2019

This passed the AddTypedLabels now but failed when doing

CreateKeysAndIndexes

so almost there. Seems the index already exist or hasn't been dropped?

ALTER TABLE [umbracoNode] ADD CONSTRAINT [FK_umbracoNode_umbracoNode_id] FOREIGN KEY ([parentId]) REFERENCES [umbracoNode] ([id])
 

System.Data.SqlClient.SqlException (0x80131904): The ALTER TABLE statement conflicted with the FOREIGN KEY SAME TABLE constraint "FK_umbracoNode_umbracoNode_id".

Umbraco.Web.Install.InstallException: The database failed to upgrade. ERROR: The database configuration failed with the following message: The ALTER TABLE statement conflicted with the FOREIGN KEY SAME TABLE constraint "FK_umbracoNode_umbracoNode_id". The conflict occurred in database "Project_v714", table "dbo.umbracoNode", column 'id'.

Found the node breaking the integrity, will delete it and rerun this again.

@stevemegson
Copy link
Contributor Author

That means that you have data which conflicts with the foreign key it's trying to create - a node in umbracoNode whose parentID doesn't match another node. That's rather odd because that foreign key should have already existed before the migration, and I can't think of anything in the migration process which would create invalid data after the key was dropped. I guess your database may not have the key for some reason.

@matijagrcic
Copy link

matijagrcic commented Oct 21, 2019

If anyone interested found the offending node by running

 select * from [umbracoNode]
WHERE [parentId] NOT IN
(SELECT [id] from [umbracoNode])

then removed it thru SQL using GetDeleteClauses() statements you can find on

protected override IEnumerable<string> GetDeleteClauses()

@stevemegson Your changes are now tested on several projects where cmsDataProperty table is over 1 or 2 millions or rows. Some DB went thru years from version 6 to v7 and now v8. Total time is about 2hrs.

The current official solution was running for 2-4 days, depending on the DB and didn't completed.

@leethomascook
Copy link

This PR is a lifesaver. Migration was taking multiple days, but got it down to circa 2 hours, with 5 million rows in cmsDataProperty. 👏

@stevemegson
Copy link
Contributor Author

@leethomascook I'm glad it helped. Would you be able to share the log from the migration? I'm always keen to see where there might still be bottlenecks for large databases.

@kiasyn
Copy link
Contributor

kiasyn commented Jan 9, 2020

Just ran this on an umbraco site with 2 mil+ rows in cmsPropertyData and the migration took only a couple of minutes vs the HOURS it was running before

It would be great if this could be merged in

@hemraker hemraker added the state/sprint-candidate We're trying to get this in a sprint at HQ in the next few weeks label Jan 15, 2020
@nul800sebastiaan nul800sebastiaan added community/pr and removed state/sprint-candidate We're trying to get this in a sprint at HQ in the next few weeks labels Jan 23, 2020
@nul800sebastiaan nul800sebastiaan added the state/in-sprint We've committed to work on this during the sprint indicated in the milestone label Jan 27, 2020
@nul800sebastiaan nul800sebastiaan added this to the sprint129 milestone Jan 27, 2020
Copy link
Contributor

@Shazwazza Shazwazza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lots to try to review here :) I've added some notes/comments inline for a few items. Some other notes here:

Use sp_rename for columns on SQL Server (mainly for umbracoPropertyData.textValue)

I don't see this in this PR?

VariantsMigration: Check for edited property data before copying versions where "published=newest" - we know that the copied versions won't be edited

I think this is the updatedIds HashSet change? I'm just trying to follow along here. I know the original code was probably super painful to debug and work with so ideally if you can add detailed code comments about what is happening it will be most helpful for reviewers and anyone else that needs to dive into this code in the future.

VariantsMigration: Create extra versions for "published=newest" with a few large queries rather than a foreach loop

Yep this one will take some time to figure out. It's quite difficult to figure out what is going on when reviewing this, as above some code comments would go a long way to help.

VariantsMigration: Unpublished documents should have edited=1 (fixes warnings "Skip item id=x, both draft and published data are null.")

Is this the edited=~published change?

Ideally, most of these variable names are updated to be meaningful. Sorry you've had to deal with these names like temp, temp3, etc... which are totally meaningless :/

{
var tableDefinition = Persistence.DatabaseModelDefinitions.DefinitionFactory.GetTableDefinition(typeof(PropertyDataDto), SqlSyntax);
Execute.Sql(SqlSyntax.FormatPrimaryKey(tableDefinition)).Do();
Execute.Sql("CREATE UNIQUE NONCLUSTERED INDEX[IX_umbracoPropertyData_VersionId] ON [umbracoPropertyData]([versionId],[propertyTypeId],[languageId],[segment])").Do();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to rethink these indexes.

There's currently 4x indexes on this table: languageId, propertyTypeId, segment, versionId. It is most likely very true that all of these should be encapsulated under a single index which will significantly improve performance but we need to deal with those indexes first. There are no changes in this PR for the PropertyDataDto class.

We need this index defined at the Dto level, then I thought the Delete.KeysAndIndexes(Constants.DatabaseSchema.Tables.PropertyData).Do(); in KeysAndIndexes would do what is required without manually doing this?

Was there a calculated reason behind the decision for the ordering of these columns in this new index or just adding all columns that are used in a query?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should have made it clear here that this is a temporary index for the benefit of the remaining migrations, since we've got no indexes at this stage. The index is deleted in CreateKeysAndIndexes.cs before we create the indexes defined by PropertyDataDto.

If memory serves, I ended up with this particular index for the "set 'edited' to true whenever a 'non-published' property data is != a published one" step in VariantsMigration, and it covers what's needed for all the Database.Update(propertyDataDto) calls too.

// 'cos INSERT above has inserted the 'published' document version
// and v8 always has a 'edited' document version too
Database.Execute($@"
INSERT INTO [cmsContentVersion] (nodeId, versionId, versionDate, userId, [current], text)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of this SQL needs to have non-hard coded table names, there is also a lot of column names with square brackets which should be dealt with with GetQuotedTableName. More ideally we'd use strongly typed SQL but at least we can not hard code table names or sql syntax. Same goes for NEWID(), this needs to be resolved from FormatSystemMethods(SystemMethods.NewGuid).

@stevemegson
Copy link
Contributor Author

It may be useful for me to recreate this with separate commits that make the individual changes clearer, and add comments as I go. I think I squashed it into one commit to get rid of various dead end changes which I ended up overwriting, but coming back to the code now it does make it hard to see why some of the changes are there, particularly with some blocks of code being reordered.

Use sp_rename for columns on SQL Server (mainly for umbracoPropertyData.textValue)

I don't see this in this PR?

It's the change to ReplaceColumn in MigrationBase_Extra.cs.

VariantsMigration: Check for edited property data before copying versions where "published=newest" - we know that the copied versions won't be edited

I think this is the updatedIds HashSet change? I'm just trying to follow along here. I know the original code was probably super painful to debug and work with so ideally if you can add detailed code comments about what is happening it will be most helpful for reviewers and anyone else that needs to dive into this code in the future.

It's just reordering the steps within that method, so that "need to add extra rows for where published=newest" comes after "set 'edited' to true whenever a 'non-published' property data is != a published one".

The HashSet change is to avoid repeatedly setting the edited flag for the same content if the newest version has changes to multiple properties.

VariantsMigration: Unpublished documents should have edited=1 (fixes warnings "Skip item id=x, both draft and published data are null.")

Is this the edited=~published change?

Yes - the code assumed that the existing version will become the published version, and we'll copy it to create the newest version later. If the content is unpublished then we just want to make the existing version the newest version.

@Shazwazza
Copy link
Contributor

Hi @stevemegson I agree about the individual PRs. It would certainly be easier to review and merge with smaller changes and comments. I've got some time booked out for reviewing this so just let me know here if/when you get other PRs ready and we can close this in favor of those. Thanks!

@stevemegson
Copy link
Contributor Author

I've broken out some of the changes into #7561, #7562, #7563, #7565, #7566.

There's more to do in VariantsMigration, but that's more to do with getting the behaviour of the published, current and edited flags right than speed improvements. I think it needs refactoring beyond what I've done here to make the behaviour clear, and that's probably easier to deal with after these PRs have been reviewed.

@stevemegson
Copy link
Contributor Author

Actually the other changes don't lead to quite as much merge misery as I expected, so I've added #7567, #7568, #7569.

The remaining change I made is to move the "set 'edited' to true whenever a 'non-published' property data is != a published one" step of MigrateVersions to appear before the "need to add extra rows for where published=newest" step. We know that none of the versions we add need to be marked as edited, so there's no point in reading back all the property data to check.

@Shazwazza Shazwazza added release/8.7.0 and removed state/in-sprint We've committed to work on this during the sprint indicated in the milestone labels Feb 5, 2020
@Shazwazza
Copy link
Contributor

amazing work @stevemegson ! 🎉 I've reviewed, and merged with a few fixes where necessary. great stuff! Ill close this PR for now but this is the one tagged for being due in 8.7 even though your smaller PRs are in there too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants