Fix moving large amounts of content/media #7907

Shazwazza · 2020-04-07T05:15:07Z

Turns out that after investigating this issue #7868 which was originally reported as a nucache issue that its actually a data issue.

Nucache did explode because the data in the umbracoNode table was corrupt but how did that data get like that? Luckily a couple community members were able to provide me databse snapshots of before and after they performed some move operations which seemed to be the trigger for seeing errors. More lucky, i was able to replicate the problem locally.

So it turns out we have a bug in the moving logic for both media and content and this same logic is used to delete stuff (move to the bin). The bug only rears it's head if there's a lot of stuff being moved in one operation (i.e. if there are a lot of descendant items). The problem is that a bunch of nodes don't end up getting moved at all because during the move operation, we are paging over descendants of a certain path and we are doing a normal page operation by checking while(currentPage x pageSize < totalItems) type of thing until paging is complete. Well that is wrong in this case because for each page, we then move the descendants returned to a different path which means on the next page, there are less total items. So the logic doesn't work and we exit the while loop way to early before all descendants are moved. This is only for large amounts > 500 (our page size). This ends up with a bunch of descendant nodes that should have had their paths/levels updated not be updated which means they are now inconsistent with the actual tree data structure. This will end up exploding nucache and not being able to traverse or find these items correctly.

After researching #7756 it also means that if there is corrupted data you will get YSOD errors when modifying document types.

Performance

The logic to move means an extraordinary amount of CPU and SQL calls. moving a branch with 2700 descendants takes more than 5 minutes on my computer. I've fixed this, it now takes 20 seconds.

That is a 96% improvement.

New Feature

I've created a new health check and health check category for checking data integrity. This will now validate that paths/levels are correct for content/media in the umbraconode table and have a button to try to fix it.

Changes

Reduce allocations, using value tuples instead of class Tuples
Reduce allocations and CPU by not re-creating the same Query object for each page
Change document/media repositories PersistUpdatedItem method to check if the item being updated is explicitly just being moved, we can detect that by the properties that have changed. In this case if its a descendant being moved, we don't have to do the majority of checks and updates that the repository normally does. We do not need to create new versions, change tags, relations, publish states, etc... We just need to update the umbraco node table and raise events
Adds some null checks in nucache so it doesn't explode if there are data inconsistencies and adds warning logs
New service/repo methods for content/media for verifying and fixing data inconsistencies

Testing notes

You can test this all out by manually messing your database up. Here's how:

execute this sql:

SELECT id, [level], [path], parentId FROM umbracoNode 
WHERE nodeObjectType = 'B796F64C-1F99-4FFB-B886-4BF4BC011A9C'
ORDER BY [level], parentId, sortOrder

and for rows that are not at the root, you can muck with the data, for example, the below changes will cause 4 different types of path problems:

If you do that and then restart your website and with this PR, go to the Published Status dashboard and rebuild database cache and then Memory Cache

You will not get an YSODs, but you will see some warning logged in your log because it cannot create the cache structure because the data is corrupt (also depends on how much data and nested data you have).

Then go to the new health check and execute it, you will see that it will fix these problems with a report.

Then go back and execute the steps on the Published Status dashboard and you will no longer get warnings in your log.

…e/linked list

…th the perf improvements that were done to media.

…grity checks, fixes tests

stefankip · 2020-04-07T08:44:14Z

@Shazwazza I'm afraid I'm running into the same corruption issue. I get 2 NuCache errors when copying a node with it's descendants and trying to publish the node after copy (without descendants).
This is the error I get:

An error occured
Object reference not set to an instance of an object.

Exception Details
System.NullReferenceException: Object reference not set to an instance of an object.
Stacktrace
at Umbraco.Web.PublishedCache.NuCache.ContentStore.ClearBranchLocked(ContentNode content) in D:\a\1\s\src\Umbraco.Web\PublishedCache\NuCache\ContentStore.cs:line 878
   at Umbraco.Web.PublishedCache.NuCache.ContentStore.SetBranchLocked(Int32 rootContentId, IEnumerable`1 kits) in D:\a\1\s\src\Umbraco.Web\PublishedCache\NuCache\ContentStore.cs:line 802
   at Umbraco.Web.PublishedCache.NuCache.PublishedSnapshotService.NotifyLocked(IEnumerable`1 payloads, Boolean& draftChanged, Boolean& publishedChanged) in D:\a\1\s\src\Umbraco.Web\PublishedCache\NuCache\PublishedSnapshotService.cs:line 731
   at Umbraco.Web.PublishedCache.NuCache.PublishedSnapshotService.Notify(JsonPayload[] payloads, Boolean& draftChanged, Boolean& publishedChanged) in D:\a\1\s\src\Umbraco.Web\PublishedCache\NuCache\PublishedSnapshotService.cs:line 668
   at Umbraco.Web.Cache.ContentCacheRefresher.NotifyPublishedSnapshotService(IPublishedSnapshotService service, AppCaches appCaches, JsonPayload[] payloads) in D:\a\1\s\src\Umbraco.Web\Cache\ContentCacheRefresher.cs:line 132
   at Umbraco.Web.Cache.ContentCacheRefresher.Refresh(JsonPayload[] payloads) in D:\a\1\s\src\Umbraco.Web\Cache\ContentCacheRefresher.cs:line 104
etc...

The other one is:

System.NullReferenceException: Object reference not set to an instance of an object.
at Umbraco.Web.PublishedCache.NuCache.ContentStore.AddTreeNodeLocked(ContentNode content, LinkedNode1 parentLink)

But I'm unable to reproduce that one.
Is it possible to test your improvements (data integrity fixes) on this solution?

clausjensen · 2020-04-07T09:10:37Z

@stefankip I'm a little unsure what you're reporting here. Are you reporting that you're still having issues or are you reporting that you're seeing the same (initial) issue as others were seeing?

stefankip · 2020-04-07T09:28:23Z

I think I'm having the same issue, but I don't know how to tell. That's what I'm trying to achieve here; is there a way for me to check and maybe even fix the issue?

jvtroyen · 2020-04-07T16:15:49Z

Same issue here. While we wait for the PR to be processed and released, how exactly can we identify these "broken" items in the database?

Shazwazza · 2020-04-08T00:23:18Z

I'm still finalizing this work, once that is done the best thing I'll be able to advise for your current installation will be to copy the code required for the new health check and re-create this health check into your installation. As there is quite a lot of code changes here the fix for this would need to wait until a minor release. Stay tuned, i'll post more info soon.

Shazwazza · 2020-04-08T01:02:04Z

@stefankip the issue I'm resolving is specifically about Moving nodes and only move operations that contain more than 500 nodes (i.e. when there are more than 500 descendants of a node being moved). I haven't looked at the copy logic and the one null reference you've pointed out isn't something I've looked at either. If the copy logic has a similar issue it would also only be affected with large amounts of descendants too (more than 500). Is this the case for you? Or perhaps have you previously Moved large amounts of data?
(I've just looked at the Copy logic and it wouldn't be affected by the same problem as Move (though I've added TODO notes about performance improvements for Copy).

Shazwazza · 2020-04-08T03:30:11Z

Hi all,

I have created a public GIST with an SQL script which will report and fix path issues. This is a work around you can use for now if you have this issue. Backup your db before you execute this. Use this script at your own risk. I have not tested this on production sites. As the notes say, you can do a dry run without changing data. I'd suggest you do that and then manually verify any nodes in the error report to validate that the data is inconsistent. Works on my machine :)

https://gist.github.com/Shazwazza/1a04dcd1c1b6f16d3b7f167874770a84

src/Umbraco.Core/Constants-SqlTemplates.cs

jvtroyen · 2020-04-08T07:00:29Z

I have created a public GIST with an SQL script which will report and fix path issues.

https://gist.github.com/Shazwazza/1a04dcd1c1b6f16d3b7f167874770a84

Hi @Shazwazza ,
I was hoping the timezone difference would enable you to reply with a SQL tip about how to spot the issue issue and it paid off! :-)
Couldn't sleep all night, kept waking up with code fragments flashing in front of my eyes. I have a day off, but want to fix this or I'll go crazy.

TL;DR I think there's a way of corrupting items when a moved item is transferred via uSync.

Back to business : I ran your script for contentNodes on a copy of the database and found 106 corrupt items. Just the report kind, no actual fixing yet, but this confirms my theory that this issue applies to my situation.

Just fyi, I hope it helps if I explain my theory on shedding some light on the corruption after moving an item. I am suspecting the following to be the case (so something else then moving +500 items)

create items at e.g. lvl 4 on source env
save via usync
transfer to target env
import via usync
THEN
move items 1 lvl deeper on source
save via uysnc
import via usync on target

I noticed the items are where they should be in the target content tree, but when re-exported to usync files, they still display level 4 instead of what should be level 5. Manually fixing the level in the usync-file and re-importing via usync does not alleviate the situation.

I was first alerted to a problem when you select an item at lvl 5 in the content tree and after loading the item, the tree collapses back to lvl 4. (should remain open and highlight the selected item).

For 1 item I identified thus, I could trigger a fix, by moving it to a diffferent parent and back to where it needs to be.
I needed a query however to identify all items. You've gone further and gave a query that can also fix it.

I will try your fix on the database copy and swap umbraco temporarily to that copy to test validity of your fix. Will update when I know more.

stefankip · 2020-04-08T07:06:41Z

@stefankip the issue I'm resolving is specifically about Moving nodes and only move operations that contain more than 500 nodes (i.e. when there are more than 500 descendants of a node being moved). I haven't looked at the copy logic and the one null reference you've pointed out isn't something I've looked at either. If the copy logic has a similar issue it would also only be affected with large amounts of descendants too (more than 500). Is this the case for you? Or perhaps have you previously Moved large amounts of data?
(I've just looked at the Copy logic and it wouldn't be affected by the same problem as Move (though I've added TODO notes about performance improvements for Copy).

I was hoping corrupted data was the cause of the errors I was having, maybe due to large copy actions from the past, but the SQL script didn't report any issues, so I'll have to look further for some other cause.

Shazwazza · 2020-04-08T07:30:11Z

@stefankip it does seems strange you have a null ref in nucache though, be sure you've done a database rebuild and then a mem rebuild in your Published Caches dashboard. I had a look at one of the null refs you have and in the code there doesn't seem to be a way that it can be null, i've added a new null check and notes there though https://github.com/umbraco/Umbraco-CMS/pull/7907/files#diff-1983717019d1fd1da9fa80f10ab00b38R885

stefankip · 2020-04-08T07:37:08Z

Yeah I'm running v8.6.0 and also tried all buttons inside the nucache dashboard for rebuilding and cleaning.

Shazwazza · 2020-04-08T07:39:58Z

@stefankip if you can replicate the null reference problem now with your current DB please take a back up and if you wouldn't mind you can send to shannon@umbraco.com and i can figure out what is going on.

stefankip · 2020-04-08T07:45:39Z

@Shazwazza that's a really nice gesture, thanks! I don't mind sending the DB. Do you only need the DB or also the solution?

jvtroyen · 2020-04-08T08:15:12Z

Hi @Shazwazza ,
I've finished testing your SQL procedure and happy to report all problems seem to be solved afterwards.
I can accomplish the same fix by moving the "corrupt" items back and forth, but the procedure is faster of course.

After this, I can rebuild database and memory cache without getting the aforementioned error

Unhandled controller exception occurred for request '"https://wedpaststew2umb.azurewebsites.net/umbraco/BackOffice/Api/NuCacheStatus/ReloadCache"
System.NullReferenceException: Object reference not set to an instance of an object.

Shazwazza · 2020-04-15T07:57:19Z

@stefankip much appreciated with you sharing your solution. I have managed to reproduce it with your solution and DB though strangely I could not reproduce the problem with the umbraco source and your DB 🤔 which is super weird. I copied in the umbraco build DLLs to your solution and the problem would still occur... that's a good thing since I was able to figure out the problem. ... phew! I don't quite understand yet how this could not be replicated on a normal site. I'm going to setup a unit test for this specific thing. The underlying problem is when we do ClearBranchLocked which recurses on itself, it nulls out the value in a collection and after a recurse the resulting value of the object we are working with in the loop is null since it references what was nulled in the collection, but the problem only occurs in a specific circumstance. But anyways, it's great we were able to see the problem, this is another one of those things that without the community's help in reproducing an issue, I think this would have been extremely hard to reproduce. I'll create a new PR once i can replicate in a unit test so you can track it from there.

stefankip · 2020-04-15T08:07:06Z

@Shazwazza nice, much appreciated! Looking forward to the PR!

Shazwazza · 2020-04-15T09:20:12Z

@stefankip here's the PR #7961

bergmania

Only small comments..

Code changes looks fine, and my tests don't find anything broken, and I could fix the manuel corrupted data using the healthcheck.

But I did not get any errors logged when I rebuild the cache and the published cache status was always:

Database cache is ok. ...

src/Umbraco.Core/Persistence/Repositories/Implement/ContentRepositoryBase.cs

src/Umbraco.Core/Persistence/Repositories/Implement/MediaRepository.cs

src/Umbraco.Core/Services/Implement/MediaService.cs

bergmania · 2020-04-15T15:00:17Z

src/Umbraco.Web/HealthCheck/Checks/Data/DatabaseIntegrityCheck.cs

+        private HealthCheckStatus CheckMedia(bool fix)
+        {
+            return CheckPaths(_fixMediaPaths, _fixMediaPathsTitle, Core.Constants.UdiEntityType.Media, fix,
+                () => _mediaService.CheckDataIntegrity(new ContentDataIntegrityReportOptions {FixIssues = fix}));


Is this safe without locking?

This is just reading data, no need to have a concurrent lock, there's an SQL lock anyways.

bergmania · 2020-04-15T15:00:22Z

src/Umbraco.Web/HealthCheck/Checks/Data/DatabaseIntegrityCheck.cs

+        private HealthCheckStatus CheckDocuments(bool fix)
+        {
+            return CheckPaths(_fixContentPaths, _fixContentPathsTitle, Core.Constants.UdiEntityType.Document, fix,
+                () => _contentService.CheckDataIntegrity(new ContentDataIntegrityReportOptions {FixIssues = fix}));


Is this safe without locking?

There is an SQL write lock during the reading and fixing

Shazwazza · 2020-04-16T09:29:56Z

@bergmania can you clarify what you mean here:

Code changes looks fine, and my tests don't find anything broken, and I could fix the manuel corrupted data using the healthcheck.

So are you saying that you did manually change your database to have corrupted paths and when using the health check it did report problems that it fixed? Or are you saying no errors were reported at all even though you messed up the data in your database?

But I did not get any errors logged when I rebuild the cache and the published cache status was always:

Are you saying that when you messed up the data in your database and before you fixed the data in the healthcheck that you didn't get errors when doing this? If so, that's because we log warnings instead of crashing so you might just need to check your log for warnings in this case for testing.

bergmania · 2020-04-16T10:15:28Z

So are you saying that you did manually change your database to have corrupted paths and when using the health check it did report problems that it fixed? [...]

yes, errors was reported and fixed by the healthcheck..

Are you saying that when you messed up the data in your database and before you fixed the data in the healthcheck that you didn't get errors when doing this? If so, that's because we log warnings instead of crashing so you might just need to check your log for warnings in this case for testing.

Can't find the logs, even as warnings, and the nucache status is always "OK"

Shazwazza · 2020-04-16T10:45:33Z

@bergmania that looks like you've fixed the data issues before rebuilding the cache but the data issues are already fixed so there's no problems. If you mess up your db first, then try rebuilding the cache, then check for warnings, then use the health check and try that again

Shazwazza · 2020-04-16T15:01:17Z

@bergmania right so to get the warnings logged, you need to mess the data up pretty good :P the above examples of changing the data do corrupt it but nucache will still load it because it can still sort of determine it's order. To get the warnings logged you want to change an item that is at [level] = 3, change it's level to be 2 and then remove the middle part of it's path.

This will cause a warning in your log when you rebuild the caches, then when you run the health check fix and rebuild the caches you wont see the warning.

bergmania · 2020-04-17T04:13:21Z

All good then.., we can merge when the conflicts are resolved.

# Conflicts: # src/Umbraco.Web/PublishedCache/NuCache/ContentStore.cs

Shazwazza · 2020-04-17T05:04:32Z

sweet, all resolved, will merge 🎉

rbottema · 2020-05-15T11:54:41Z

@Shazwazza Thanks a lot for this fix and especially the SQL Script, this has helped us fix a data corruption issue as well.

I did notice one thing when I applied this: Whenever the move action was to the recycle bin, the 'trashed' flag should be set on all nodes (at least, that is what I see for my 'normal' nodes). However, this may also have failed in this process. The current fix does not seem to check for and correct this as well. Is that something that needs to be added, or am I making the wrong assumptions?

These nodes that are a descendant of the recycle bin, but do not have trashed flag set, give a warning in the logs when the cache is being rebuilt, because their parent was not found.

Shazwazza · 2020-05-26T04:52:57Z

@rbottema It could/should probably be another check as part of the fixup, any chance you can create a PR for that?

Rizen-Kris · 2020-06-20T13:32:07Z

was this new health check released? I don't see it in 8.6.2.. thx

Shazwazza · 2020-06-22T05:01:13Z

This is tagged for 8.7.0

Shazwazza added 6 commits April 1, 2020 11:06

Adds some null checks, need to see if tests pass

f1b398f

removes one null check, adds notes

c9c8693

Adds a null check if things are out of order when building up the tre…

50e9f79

…e/linked list

Creates data integrity health checks

4b467bf

Fixes media moving and massively improves moving performance

727578e

Fixes content service/repository for moving things correctly along wi…

ad698f9

…th the perf improvements that were done to media.

Shazwazza added type/bug release/8.7.0 category/performance Fixes for performance (generally cpu or memory) fixes labels Apr 7, 2020

Shazwazza mentioned this pull request Apr 7, 2020

Images are missing and can't refresh Memory Cache #7868

Closed

Shazwazza added 3 commits April 7, 2020 16:42

udpates health check and service methods to be flexible for data inte…

18d9ad3

…grity checks, fixes tests

fix build (non latest c#)

3ef1883

updates health check

19f0793

Shazwazza added 2 commits April 8, 2020 10:38

data integrity report working

49c6fd2

argh! can't use switch expressions in v8

d54eed5

Adds notes

36c9cd4

Shazwazza marked this pull request as ready for review April 8, 2020 03:37

Ensures caches are refreshed after data errors are fixed

64e7a29

bjarnef reviewed Apr 8, 2020

View reviewed changes

src/Umbraco.Core/Constants-SqlTemplates.cs Show resolved Hide resolved

Shazwazza mentioned this pull request Apr 15, 2020

Umbraco 8 very slow modifying a document type #7756

Closed

Nicer sql template constants

61c39ee

Shazwazza mentioned this pull request Apr 15, 2020

Nucache NullReferenceException when copying #7961

Merged

bergmania reviewed Apr 15, 2020

View reviewed changes

nul800sebastiaan added release/8.6.1 release/8.7.0 and removed release/8.7.0 release/8.6.1 labels Apr 15, 2020

removes the extra unneeded variable

6cd7e5d

Merge branch 'v8/dev' into v8/bugfix/7868-nucache-null-parent

d1524cd

# Conflicts: # src/Umbraco.Web/PublishedCache/NuCache/ContentStore.cs

Shazwazza merged commit 75655fd into v8/dev Apr 17, 2020

clausjensen deleted the v8/bugfix/7868-nucache-null-parent branch April 30, 2020 18:16

umbrabot added the type/feature label Oct 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix moving large amounts of content/media #7907

Fix moving large amounts of content/media #7907

Shazwazza commented Apr 7, 2020 •

edited

stefankip commented Apr 7, 2020 •

edited

clausjensen commented Apr 7, 2020

stefankip commented Apr 7, 2020

jvtroyen commented Apr 7, 2020

Shazwazza commented Apr 8, 2020

Shazwazza commented Apr 8, 2020 •

edited

Shazwazza commented Apr 8, 2020

jvtroyen commented Apr 8, 2020

stefankip commented Apr 8, 2020

Shazwazza commented Apr 8, 2020

stefankip commented Apr 8, 2020

Shazwazza commented Apr 8, 2020

stefankip commented Apr 8, 2020

jvtroyen commented Apr 8, 2020

Shazwazza commented Apr 15, 2020

stefankip commented Apr 15, 2020

Shazwazza commented Apr 15, 2020

bergmania left a comment

bergmania Apr 15, 2020

Shazwazza Apr 16, 2020

bergmania Apr 15, 2020

Shazwazza Apr 16, 2020

Shazwazza commented Apr 16, 2020

bergmania commented Apr 16, 2020

Shazwazza commented Apr 16, 2020

Shazwazza commented Apr 16, 2020

bergmania commented Apr 17, 2020

Shazwazza commented Apr 17, 2020

rbottema commented May 15, 2020 •

edited

Shazwazza commented May 26, 2020

Rizen-Kris commented Jun 20, 2020

Shazwazza commented Jun 22, 2020

Fix moving large amounts of content/media #7907

Fix moving large amounts of content/media #7907

Conversation

Shazwazza commented Apr 7, 2020 • edited

Performance

New Feature

Changes

Testing notes

stefankip commented Apr 7, 2020 • edited

clausjensen commented Apr 7, 2020

stefankip commented Apr 7, 2020

jvtroyen commented Apr 7, 2020

Shazwazza commented Apr 8, 2020

Shazwazza commented Apr 8, 2020 • edited

Shazwazza commented Apr 8, 2020

jvtroyen commented Apr 8, 2020

stefankip commented Apr 8, 2020

Shazwazza commented Apr 8, 2020

stefankip commented Apr 8, 2020

Shazwazza commented Apr 8, 2020

stefankip commented Apr 8, 2020

jvtroyen commented Apr 8, 2020

Shazwazza commented Apr 15, 2020

stefankip commented Apr 15, 2020

Shazwazza commented Apr 15, 2020

bergmania left a comment

Choose a reason for hiding this comment

bergmania Apr 15, 2020

Choose a reason for hiding this comment

Shazwazza Apr 16, 2020

Choose a reason for hiding this comment

bergmania Apr 15, 2020

Choose a reason for hiding this comment

Shazwazza Apr 16, 2020

Choose a reason for hiding this comment

Shazwazza commented Apr 16, 2020

bergmania commented Apr 16, 2020

Shazwazza commented Apr 16, 2020

Shazwazza commented Apr 16, 2020

bergmania commented Apr 17, 2020

Shazwazza commented Apr 17, 2020

rbottema commented May 15, 2020 • edited

Shazwazza commented May 26, 2020

Rizen-Kris commented Jun 20, 2020

Shazwazza commented Jun 22, 2020

Shazwazza commented Apr 7, 2020 •

edited

stefankip commented Apr 7, 2020 •

edited

Shazwazza commented Apr 8, 2020 •

edited

rbottema commented May 15, 2020 •

edited