Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Markus/internal cache #4619

Merged
merged 10 commits into from Oct 28, 2022

Conversation

markus2330
Copy link
Contributor

@markus2330 markus2330 commented Oct 27, 2022

Clarifications for @atmaxinger

Basics

  • Short descriptions of your changes are in the release notes
    (added as entry in doc/news/_preparation_next_release.md which
    contains _(my name)_)
    Please always add something to the release notes.
  • Details of what you changed are in commit messages
    (first line should have module: short statement syntax)
  • References to issues, e.g. close #X, are in the commit messages.
  • The buildservers are happy. If not, fix in this order:
    • add a line in doc/news/_preparation_next_release.md
    • reformat the code with scripts/dev/reformat-all
    • make all unit tests pass
    • fix all memleaks
  • The PR is rebased with current master.

Checklist

  • I added unit tests for my code
  • I fully described what my PR does in the documentation
    (not in the PR description)
  • I fixed all affected documentation (see Documentation Guidelines)
  • I added code comments, logging, and assertions as appropriate (see Coding Guidelines)
  • I updated all meta data (e.g. README.md of plugins and METADATA.ini)
  • I mentioned every code not directly written by me in reuse syntax

Review

Labels

  • Add the "work in progress" label if you do not want the PR to be reviewed yet.
  • Add the "ready to merge" label if the basics are fulfilled and no further pushes are planned by you.

@markus2330 markus2330 mentioned this pull request Oct 27, 2022
22 tasks
Copy link
Member

@kodebach kodebach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two more options here:

  1. Change the API and remove KeySet * from kdbGet and kdbSet (also option 4 in [decisions] sequences of kdbGet and kdbSet operations #4574). If the keyset is owned by the KDB handle, it should not be as big surprise, if there is extra data in there. I certainly wouldn't try to asset anything on the contents of a KeySet * that I don't own directly, unless the condition is explicitly documented somwhere.
  2. Make all the keys returned by kdbGet completely read-only. To change the data you need to append an entirely new key to replace the existing one. Then we just need to keep a shallow copy internally.

doc/decisions/internal_cache.md Outdated Show resolved Hide resolved
doc/decisions/internal_cache.md Outdated Show resolved Hide resolved
doc/decisions/internal_cache.md Outdated Show resolved Hide resolved
Comment on lines +60 to +72
### MMAP Cache with parent key

We make the mmap cache non-optional so that we always have a keyset of configuration data internally.
From this keyset, we use `ksBelow` to return the correct keyset.

**Cons:**

- invalidation of OPMPHM

### MMAP Cache without parent key

We make the mmap cache non-optional and only use a single cache, caching everything.
We remove the parent key of `kdbGet` and `kdbSet` and always return the keyset of the whole KDB.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For using mmap this much, I think we use pointers too much. I'm sure @mpranj has done benchmarks for the pointer correction code in mmapstorage, but maybe not for huge KeySets. If we cache everything and use it a lot, the pointer correction code might become a bottleneck.

In any case, if we go down this route, we should compare doing the pointer correction with something similar to what Flatbuffers do. AFAIK they don't use pointers and store everything as offsets. The offset is resolved into a pointer when data is accessed. So a KeySet would always be in one large memory buffer and all keys only know the relative offset to their name. When you call keyName that offset is resolved and you get a char *. Would be a huge internal change, may be needed, if mmap the entire KDB (which could be very big).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we cache everything and use it a lot, the pointer correction code might become a bottleneck.

Might, but we don't know that for a fact. I assumed the same as you did, but in my benchmarks the pointer correction was never a bottleneck.

AFAIK they don't use pointers and store everything as offsets.

We also store everything as offsets, it's just that we resolve the offsets eagerly.
Definitely a bigger task, but at least it's quite clear what to do here ...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the current solution works, it's probably best to leave it. It would be interesting, if resolving the offsets only on access would do be an improvement. But it's probably too hard to benchmark. It will depend on how frequently the data is actually accessed after the mmap cache is loaded.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will depend on how frequently the data is actually accessed after the mmap cache is loaded.

Indeed.

doc/decisions/internal_cache.md Outdated Show resolved Hide resolved
assert (keyName(key) == keyName(key_dup)); // stays always valid
```

This is already implemented for the MMAP cache, so the implementation should be straightforward (do the same COW duplications as done for MMAP).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly, the mmap data is already COW. So actually not "do the same", but more rename the MMAP flags to COW.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually had a different flag in mind so that it doesn't interfere with the mmap cache. See 728fec1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I got that. What I wanted to say is that mmap already does COW, so we can reuse the code and probably the flag. If there is some code that is only needed for mmap and not COW, we could make mmap set two flags one for mmap and one for the general COW code.

doc/decisions/internal_cache.md Outdated Show resolved Hide resolved
Copy link
Contributor

@atmaxinger atmaxinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this write-up! It now makes the problem much clearer for me. I spend a lot of time today to look over the mmapstorage plugin, and some things in there would really make a lot of sense for general usage of in Elektra.

I think the In-Memory COW cache sounds brilliant, and with some modifications (storing of the original value) we can make this work for change tracking.

We also need to reach an agreement of what we do with the KeySets, as of now it seems that only Keys are COW which leads to some problems with metadata that @kodebach already pointed out.

doc/decisions/internal_cache.md Outdated Show resolved Hide resolved
doc/decisions/internal_cache.md Outdated Show resolved Hide resolved
doc/decisions/internal_cache.md Outdated Show resolved Hide resolved

### In-Memory COW Cache

We keep a duplicated keyset in-memory and tag the keys as copy-on-write (COW).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be clear: As far as I understand, this whole COW only concerns Keys, and not KeySets, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unspecified, depending on the solution with meta-data.

doc/decisions/internal_cache.md Outdated Show resolved Hide resolved
@kodebach
Copy link
Member

I think the In-Memory COW cache sounds brilliant, and with some modifications (storing of the original value) we can make this work for change tracking

AFAICT all the mmap options and the in-memory COW option could be used directly without modifications for change tracking. All these options result in a separate internal copy of the data originally loaded. Because of all them are COW (mmap is just a slightly different approach) when the caller changes the values, the copy inside KDB will remain untouched.

@markus2330
Copy link
Contributor Author

Thank you all for your comments! If there are no further questions about the problem, I would like to merge as draft and let @atmaxinger take over this decision.

@markus2330
Copy link
Contributor Author

@atmaxinger please press "Merge pull request" if you think this is ready for you to take over.

@atmaxinger atmaxinger merged commit 514319e into ElektraInitiative:master Oct 28, 2022
@mpranj mpranj modified the milestones: 0.9.*, 0.9.12 Jan 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants