Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix moving large amounts of content/media #7907

Merged
merged 16 commits into from Apr 17, 2020

Conversation

Shazwazza
Copy link
Contributor

@Shazwazza Shazwazza commented Apr 7, 2020

Turns out that after investigating this issue #7868 which was originally reported as a nucache issue that its actually a data issue.

Nucache did explode because the data in the umbracoNode table was corrupt but how did that data get like that? Luckily a couple community members were able to provide me databse snapshots of before and after they performed some move operations which seemed to be the trigger for seeing errors. More lucky, i was able to replicate the problem locally.

So it turns out we have a bug in the moving logic for both media and content and this same logic is used to delete stuff (move to the bin). The bug only rears it's head if there's a lot of stuff being moved in one operation (i.e. if there are a lot of descendant items). The problem is that a bunch of nodes don't end up getting moved at all because during the move operation, we are paging over descendants of a certain path and we are doing a normal page operation by checking while(currentPage x pageSize < totalItems) type of thing until paging is complete. Well that is wrong in this case because for each page, we then move the descendants returned to a different path which means on the next page, there are less total items. So the logic doesn't work and we exit the while loop way to early before all descendants are moved. This is only for large amounts > 500 (our page size). This ends up with a bunch of descendant nodes that should have had their paths/levels updated not be updated which means they are now inconsistent with the actual tree data structure. This will end up exploding nucache and not being able to traverse or find these items correctly.

After researching #7756 it also means that if there is corrupted data you will get YSOD errors when modifying document types.

Performance

The logic to move means an extraordinary amount of CPU and SQL calls. moving a branch with 2700 descendants takes more than 5 minutes on my computer. I've fixed this, it now takes 20 seconds.

That is a 96% improvement.

New Feature

I've created a new health check and health check category for checking data integrity. This will now validate that paths/levels are correct for content/media in the umbraconode table and have a button to try to fix it.

71F602E5-B1D1-46D5-9D7A-F826A62CB81D

Changes

  • Reduce allocations, using value tuples instead of class Tuples
  • Reduce allocations and CPU by not re-creating the same Query object for each page
  • Change document/media repositories PersistUpdatedItem method to check if the item being updated is explicitly just being moved, we can detect that by the properties that have changed. In this case if its a descendant being moved, we don't have to do the majority of checks and updates that the repository normally does. We do not need to create new versions, change tags, relations, publish states, etc... We just need to update the umbraco node table and raise events
  • Adds some null checks in nucache so it doesn't explode if there are data inconsistencies and adds warning logs
  • New service/repo methods for content/media for verifying and fixing data inconsistencies

Testing notes

You can test this all out by manually messing your database up. Here's how:

execute this sql:

SELECT id, [level], [path], parentId FROM umbracoNode 
WHERE nodeObjectType = 'B796F64C-1F99-4FFB-B886-4BF4BC011A9C'
ORDER BY [level], parentId, sortOrder

and for rows that are not at the root, you can muck with the data, for example, the below changes will cause 4 different types of path problems:

image

If you do that and then restart your website and with this PR, go to the Published Status dashboard and rebuild database cache and then Memory Cache

image

You will not get an YSODs, but you will see some warning logged in your log because it cannot create the cache structure because the data is corrupt (also depends on how much data and nested data you have).

Then go to the new health check and execute it, you will see that it will fix these problems with a report.

Then go back and execute the steps on the Published Status dashboard and you will no longer get warnings in your log.

@Shazwazza Shazwazza added type/bug release/8.7.0 category/performance Fixes for performance (generally cpu or memory) fixes labels Apr 7, 2020
@stefankip
Copy link

stefankip commented Apr 7, 2020

@Shazwazza I'm afraid I'm running into the same corruption issue. I get 2 NuCache errors when copying a node with it's descendants and trying to publish the node after copy (without descendants).
This is the error I get:

An error occured
Object reference not set to an instance of an object.

Exception Details
System.NullReferenceException: Object reference not set to an instance of an object.
Stacktrace
at Umbraco.Web.PublishedCache.NuCache.ContentStore.ClearBranchLocked(ContentNode content) in D:\a\1\s\src\Umbraco.Web\PublishedCache\NuCache\ContentStore.cs:line 878
   at Umbraco.Web.PublishedCache.NuCache.ContentStore.SetBranchLocked(Int32 rootContentId, IEnumerable`1 kits) in D:\a\1\s\src\Umbraco.Web\PublishedCache\NuCache\ContentStore.cs:line 802
   at Umbraco.Web.PublishedCache.NuCache.PublishedSnapshotService.NotifyLocked(IEnumerable`1 payloads, Boolean& draftChanged, Boolean& publishedChanged) in D:\a\1\s\src\Umbraco.Web\PublishedCache\NuCache\PublishedSnapshotService.cs:line 731
   at Umbraco.Web.PublishedCache.NuCache.PublishedSnapshotService.Notify(JsonPayload[] payloads, Boolean& draftChanged, Boolean& publishedChanged) in D:\a\1\s\src\Umbraco.Web\PublishedCache\NuCache\PublishedSnapshotService.cs:line 668
   at Umbraco.Web.Cache.ContentCacheRefresher.NotifyPublishedSnapshotService(IPublishedSnapshotService service, AppCaches appCaches, JsonPayload[] payloads) in D:\a\1\s\src\Umbraco.Web\Cache\ContentCacheRefresher.cs:line 132
   at Umbraco.Web.Cache.ContentCacheRefresher.Refresh(JsonPayload[] payloads) in D:\a\1\s\src\Umbraco.Web\Cache\ContentCacheRefresher.cs:line 104
etc...

The other one is:

System.NullReferenceException: Object reference not set to an instance of an object.
at Umbraco.Web.PublishedCache.NuCache.ContentStore.AddTreeNodeLocked(ContentNode content, LinkedNode1 parentLink)

But I'm unable to reproduce that one.
Is it possible to test your improvements (data integrity fixes) on this solution?

@clausjensen
Copy link
Contributor

@stefankip I'm a little unsure what you're reporting here. Are you reporting that you're still having issues or are you reporting that you're seeing the same (initial) issue as others were seeing?

@stefankip
Copy link

I think I'm having the same issue, but I don't know how to tell. That's what I'm trying to achieve here; is there a way for me to check and maybe even fix the issue?

@jvtroyen
Copy link

jvtroyen commented Apr 7, 2020

Same issue here. While we wait for the PR to be processed and released, how exactly can we identify these "broken" items in the database?

@Shazwazza
Copy link
Contributor Author

I'm still finalizing this work, once that is done the best thing I'll be able to advise for your current installation will be to copy the code required for the new health check and re-create this health check into your installation. As there is quite a lot of code changes here the fix for this would need to wait until a minor release. Stay tuned, i'll post more info soon.

@Shazwazza
Copy link
Contributor Author

Shazwazza commented Apr 8, 2020

@stefankip the issue I'm resolving is specifically about Moving nodes and only move operations that contain more than 500 nodes (i.e. when there are more than 500 descendants of a node being moved). I haven't looked at the copy logic and the one null reference you've pointed out isn't something I've looked at either. If the copy logic has a similar issue it would also only be affected with large amounts of descendants too (more than 500). Is this the case for you? Or perhaps have you previously Moved large amounts of data?
(I've just looked at the Copy logic and it wouldn't be affected by the same problem as Move (though I've added TODO notes about performance improvements for Copy).

@Shazwazza
Copy link
Contributor Author

Hi all,

I have created a public GIST with an SQL script which will report and fix path issues. This is a work around you can use for now if you have this issue. Backup your db before you execute this. Use this script at your own risk. I have not tested this on production sites. As the notes say, you can do a dry run without changing data. I'd suggest you do that and then manually verify any nodes in the error report to validate that the data is inconsistent. Works on my machine :)

https://gist.github.com/Shazwazza/1a04dcd1c1b6f16d3b7f167874770a84

@Shazwazza Shazwazza marked this pull request as ready for review April 8, 2020 03:37
@jvtroyen
Copy link

jvtroyen commented Apr 8, 2020

I have created a public GIST with an SQL script which will report and fix path issues.

https://gist.github.com/Shazwazza/1a04dcd1c1b6f16d3b7f167874770a84

Hi @Shazwazza ,
I was hoping the timezone difference would enable you to reply with a SQL tip about how to spot the issue issue and it paid off! :-)
Couldn't sleep all night, kept waking up with code fragments flashing in front of my eyes. I have a day off, but want to fix this or I'll go crazy.

TL;DR I think there's a way of corrupting items when a moved item is transferred via uSync.

Back to business : I ran your script for contentNodes on a copy of the database and found 106 corrupt items. Just the report kind, no actual fixing yet, but this confirms my theory that this issue applies to my situation.

Just fyi, I hope it helps if I explain my theory on shedding some light on the corruption after moving an item. I am suspecting the following to be the case (so something else then moving +500 items)

  • create items at e.g. lvl 4 on source env
  • save via usync
  • transfer to target env
  • import via usync
    THEN
  • move items 1 lvl deeper on source
  • save via uysnc
  • import via usync on target

I noticed the items are where they should be in the target content tree, but when re-exported to usync files, they still display level 4 instead of what should be level 5. Manually fixing the level in the usync-file and re-importing via usync does not alleviate the situation.

I was first alerted to a problem when you select an item at lvl 5 in the content tree and after loading the item, the tree collapses back to lvl 4. (should remain open and highlight the selected item).

For 1 item I identified thus, I could trigger a fix, by moving it to a diffferent parent and back to where it needs to be.
I needed a query however to identify all items. You've gone further and gave a query that can also fix it.

I will try your fix on the database copy and swap umbraco temporarily to that copy to test validity of your fix. Will update when I know more.

@stefankip
Copy link

@stefankip the issue I'm resolving is specifically about Moving nodes and only move operations that contain more than 500 nodes (i.e. when there are more than 500 descendants of a node being moved). I haven't looked at the copy logic and the one null reference you've pointed out isn't something I've looked at either. If the copy logic has a similar issue it would also only be affected with large amounts of descendants too (more than 500). Is this the case for you? Or perhaps have you previously Moved large amounts of data?
(I've just looked at the Copy logic and it wouldn't be affected by the same problem as Move (though I've added TODO notes about performance improvements for Copy).

I was hoping corrupted data was the cause of the errors I was having, maybe due to large copy actions from the past, but the SQL script didn't report any issues, so I'll have to look further for some other cause.

@Shazwazza
Copy link
Contributor Author

@stefankip it does seems strange you have a null ref in nucache though, be sure you've done a database rebuild and then a mem rebuild in your Published Caches dashboard. I had a look at one of the null refs you have and in the code there doesn't seem to be a way that it can be null, i've added a new null check and notes there though https://github.com/umbraco/Umbraco-CMS/pull/7907/files#diff-1983717019d1fd1da9fa80f10ab00b38R885

@stefankip
Copy link

Yeah I'm running v8.6.0 and also tried all buttons inside the nucache dashboard for rebuilding and cleaning.

@Shazwazza
Copy link
Contributor Author

@stefankip if you can replicate the null reference problem now with your current DB please take a back up and if you wouldn't mind you can send to shannon@umbraco.com and i can figure out what is going on.

@stefankip
Copy link

@Shazwazza that's a really nice gesture, thanks! I don't mind sending the DB. Do you only need the DB or also the solution?

@jvtroyen
Copy link

jvtroyen commented Apr 8, 2020

Hi @Shazwazza ,
I've finished testing your SQL procedure and happy to report all problems seem to be solved afterwards.
I can accomplish the same fix by moving the "corrupt" items back and forth, but the procedure is faster of course.

After this, I can rebuild database and memory cache without getting the aforementioned error

Unhandled controller exception occurred for request '"https://wedpaststew2umb.azurewebsites.net/umbraco/BackOffice/Api/NuCacheStatus/ReloadCache"
System.NullReferenceException: Object reference not set to an instance of an object.

@Shazwazza
Copy link
Contributor Author

@stefankip much appreciated with you sharing your solution. I have managed to reproduce it with your solution and DB though strangely I could not reproduce the problem with the umbraco source and your DB 🤔 which is super weird. I copied in the umbraco build DLLs to your solution and the problem would still occur... that's a good thing since I was able to figure out the problem. ... phew! I don't quite understand yet how this could not be replicated on a normal site. I'm going to setup a unit test for this specific thing. The underlying problem is when we do ClearBranchLocked which recurses on itself, it nulls out the value in a collection and after a recurse the resulting value of the object we are working with in the loop is null since it references what was nulled in the collection, but the problem only occurs in a specific circumstance. But anyways, it's great we were able to see the problem, this is another one of those things that without the community's help in reproducing an issue, I think this would have been extremely hard to reproduce. I'll create a new PR once i can replicate in a unit test so you can track it from there.

@stefankip
Copy link

@Shazwazza nice, much appreciated! Looking forward to the PR!

@Shazwazza
Copy link
Contributor Author

@stefankip here's the PR #7961

Copy link
Member

@bergmania bergmania left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only small comments..

Code changes looks fine, and my tests don't find anything broken, and I could fix the manuel corrupted data using the healthcheck.

But I did not get any errors logged when I rebuild the cache and the published cache status was always:

Database cache is ok. ...

private HealthCheckStatus CheckMedia(bool fix)
{
return CheckPaths(_fixMediaPaths, _fixMediaPathsTitle, Core.Constants.UdiEntityType.Media, fix,
() => _mediaService.CheckDataIntegrity(new ContentDataIntegrityReportOptions {FixIssues = fix}));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this safe without locking?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just reading data, no need to have a concurrent lock, there's an SQL lock anyways.

private HealthCheckStatus CheckDocuments(bool fix)
{
return CheckPaths(_fixContentPaths, _fixContentPathsTitle, Core.Constants.UdiEntityType.Document, fix,
() => _contentService.CheckDataIntegrity(new ContentDataIntegrityReportOptions {FixIssues = fix}));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this safe without locking?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an SQL write lock during the reading and fixing

@Shazwazza
Copy link
Contributor Author

@bergmania can you clarify what you mean here:

Code changes looks fine, and my tests don't find anything broken, and I could fix the manuel corrupted data using the healthcheck.

So are you saying that you did manually change your database to have corrupted paths and when using the health check it did report problems that it fixed? Or are you saying no errors were reported at all even though you messed up the data in your database?

But I did not get any errors logged when I rebuild the cache and the published cache status was always:

Are you saying that when you messed up the data in your database and before you fixed the data in the healthcheck that you didn't get errors when doing this? If so, that's because we log warnings instead of crashing so you might just need to check your log for warnings in this case for testing.

@bergmania
Copy link
Member

So are you saying that you did manually change your database to have corrupted paths and when using the health check it did report problems that it fixed? [...]

yes, errors was reported and fixed by the healthcheck..

Are you saying that when you messed up the data in your database and before you fixed the data in the healthcheck that you didn't get errors when doing this? If so, that's because we log warnings instead of crashing so you might just need to check your log for warnings in this case for testing.

Can't find the logs, even as warnings, and the nucache status is always "OK"
nologs

@Shazwazza
Copy link
Contributor Author

@bergmania that looks like you've fixed the data issues before rebuilding the cache but the data issues are already fixed so there's no problems. If you mess up your db first, then try rebuilding the cache, then check for warnings, then use the health check and try that again

@Shazwazza
Copy link
Contributor Author

@bergmania right so to get the warnings logged, you need to mess the data up pretty good :P the above examples of changing the data do corrupt it but nucache will still load it because it can still sort of determine it's order. To get the warnings logged you want to change an item that is at [level] = 3, change it's level to be 2 and then remove the middle part of it's path.

This will cause a warning in your log when you rebuild the caches, then when you run the health check fix and rebuild the caches you wont see the warning.

@bergmania
Copy link
Member

All good then.., we can merge when the conflicts are resolved.

# Conflicts:
#	src/Umbraco.Web/PublishedCache/NuCache/ContentStore.cs
@Shazwazza
Copy link
Contributor Author

sweet, all resolved, will merge 🎉

@Shazwazza Shazwazza merged commit 75655fd into v8/dev Apr 17, 2020
@clausjensen clausjensen deleted the v8/bugfix/7868-nucache-null-parent branch April 30, 2020 18:16
@rbottema
Copy link
Contributor

rbottema commented May 15, 2020

@Shazwazza Thanks a lot for this fix and especially the SQL Script, this has helped us fix a data corruption issue as well.

I did notice one thing when I applied this: Whenever the move action was to the recycle bin, the 'trashed' flag should be set on all nodes (at least, that is what I see for my 'normal' nodes). However, this may also have failed in this process. The current fix does not seem to check for and correct this as well. Is that something that needs to be added, or am I making the wrong assumptions?

These nodes that are a descendant of the recycle bin, but do not have trashed flag set, give a warning in the logs when the cache is being rebuilt, because their parent was not found.

@Shazwazza
Copy link
Contributor Author

@rbottema It could/should probably be another check as part of the fixup, any chance you can create a PR for that?

@Rizen-Kris
Copy link

was this new health check released? I don't see it in 8.6.2.. thx

@Shazwazza
Copy link
Contributor Author

This is tagged for 8.7.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category/performance Fixes for performance (generally cpu or memory) fixes release/8.7.0 type/feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet