1 unstable release

0.0.1	Jul 25, 2022

#28 in #object-oriented

Apache-2.0

15KB

opencache

Opencache is a cache server in rust that run in single node mode, and is an object oriented store that provides two API: range write and range read and background auto eviction.

Cache cluster management is on client side. Backsourcing is not supported, an application has to explicitly write objects into it.

Protocols

http2 PUT GET
(maybe) aws-s3
(maybe) grpc stream

Storage model

CacheServer stores object(a byte stream identified by a unique utf-8 string key) in chunks. Chunks belonging to a same object have equals sizes.

When reading an object, the first and last chunk may be partially returned if the specified range does not align to chunk size.
When writing an object, the first and last nonaligned bytes will be discarded. Because opencache stores object content in fixed size chunks.

Optionally, the chunk size can be specified when an object is added(the first write) and won't be changed.

Chunk

A Chunk is barely a byte slice.

Chunk size can be specified by an application(when the first time writing an object) or decided by opencache server. The default chunk size is calculated with:

default_chunk_size = max(min(file_size/64, 2M),64K)

Chunk group

Chunks with different sizes are stored in different chunk groups(ChunkGroup), e.g.:

(0, 4KB]
(4KB, 1MB]
(1MB, 8MB]
(8MB, +oo]

Opencache has several predefined ChunkGroup and the actual chunk size for an object chunk size is the smallest predefined chunk size that is no less than the specified one.

Object

An object in opencache includes an object header and several chunks.

The header contains application metadata in JSON and internal metadata. The internal metadata contains:

time when the object is added,
total size,
chunk size,
bitmap of stored chunks
chunk-id of every stored chunk.

   .------------.
   |   object   |
   '------------'
      |       |
      |       '-------------.
      | header              |
      |                     |
      v                     |
   .-----------------.      |        .----------.
   | json-meta       |      +------->| chunk-1  |
   | size            |      |        '----------'
   | chunk-bitmap    |      |
   | chunk-ids       |      |        .----------.
   '-----------------'      '------->| chunk-2  |
                                     '----------'

Opencache store objects in a persistent map of <key, header>. This map will be loaded in to memory when server starts up.

Data types

Chunk group id

ChunkGroupId is 2 ascii char in file name, or 1 byte in memory. ChunkGroupId is a encoded representation of the size of chunks in it, similar to float number representation.

The first char(or 4 higher bits) is the power. The second char(or 4 lower bits) is the significand.

        power          significand
ascii   0              1
bits    4 higher bits  4 lower bits

The chunk size is calculated as follow: significand * 16^power

E.g. chunk group id of 48K chunks is 3c: 12 * 16^3 = 48K // c=12;

Max chunk group id is ff: 15 * 16^15 = 15E

Several ChunkGroupId examples are listed as below:

01: 2 * 16^0 = 2        08: 8 * 16^0 = 8
11: 2 * 16^1 = 32       18: 8 * 16^1 = 128
21: 2 * 16^2 = 512      28: 8 * 16^2 = 2K
31: 2 * 16^3 = 8K       38: 8 * 16^3 = 32K
41: 2 * 16^4 = 128K     48: 8 * 16^4 = 512K
51: 2 * 16^5 = 2M       58: 8 * 16^5 = 8M
61: 2 * 16^6 = 32M      68: 8 * 16^6 = 128M
71: 2 * 16^7 = 512M     78: 8 * 16^7 = 2G

Chunk id

ChunkId is a u64, in which the most significant byte is ChunkGroupId, the other 7 bytes is mono incremental index in 54-bit int.

       <ChunkGroupId> <index>
byte:  [0]            [1, 7]

Key

Key is an arbitrary utf-8 string.

Access entry

A cache replacement algorithm depends the access log to decide which object/chunk should be evicted.

Either a Key or ChunkId, for tracing object access or chunk access respectively.

Storage layout

Opencache on-disk storage include:

Manifest file that track all other kind of on-disk files.
Key files to store object key and object header.
Chunk files to store cached file data.

Manifest file

Manifest is a snapshot of current storage. I.e., it's a list of files belonging to this snapshot.

data_ver: 0.1.0
key_meta_ver: 0.2.0
keys:
  key-wal-1, key-wal-2...
  key-compacted-1, key-compacted-2...
access:
  acc-1, acc-2...
chunk-groups:
  CG-1, CG-2...
chunks:
  chunk-id-1, chunk-id-2...

Object store

Object store is a collection of files to persist the object map. It includes one active WAL file and several compacted file.

The object header includes:

The timestamp when this record is added.
A flag indicate it is an add operation or remove operation.
Total size of this object.
User meta data in json.
ChunkGroup(CG) this object is allocated in.
ChunkId list of the chunks that are stored. In this list ChunkGroup does not need to be stored. Because every chunk in an object belong to the same ChunkGroup.

Chunk store

Object content is stored in chunks. Chunks of the same size belong to the same ChunkGroup.

Every ChunkGroup consists of one active(writable) chunk file and several closed(readonly) chunk files. The active chunk file is append only.

Deleting is implemented with another per-chunk-file WAL file(ChunkIndex, in which chunk index entries are appended one by one:

For active chunk, ChunkIndex only contains removed chunk index: (chunk_id, REMOVE); Because chunk file and ChunkIndex file are not fsync-ed when being updated.

When restarting, present chunk ids are loaded from chunk file, while removed chunk ids are loaded from ChunkIndex file.
For closed chunk, ChunkIndex contains present chunk ids and removed chunk ids(a chunk is removed after being closed(compact)): (chunk_id, ADD) or (chunk_id, REMOVE).

When the server restarts, chunk indexes are loaded just from ChunkIndex file.

Access store

Opencache evicts object or chunk by accessing pattern. Thus the sequence of recent access has to be persisted. Otherwise when a cache server restarts, cached data will be evicted in an unexpected way, due to lacking of accessing information.

Accessing data is tracked in two category: by key and by chunk id.

Accessing data is stored in several append only files, contains key and chunk-id respectively.

Old access data will be removed periodically. Access data does not need compact.

      .----------.
      | Manifest |
      '----------'


  Access   Access    Keys             ChunkGroup-i      ChunkGroup-j   ..
  Key      Chunk

  .----.   .----.    Active           Active
  |key1|   |cid1|    .---------.      .----.  Chunk     ..
  |key2|   |cid2|    | k1,meta |      | c1 |  Index
  | .. |   | .. |    | k2,meta |    .-|-c2 |  .----.
  |CKSM|   |CKSM|    | ..      |    | | .. |<-|cid5|
  |key3|   |cid3|    | CKSM    |    | | .. |  |cid6|
  | .. |   | .. |    | k3,meta |    | '----'  | .. |
  '----'   '----'    | ..      |    |   ..    '----'
    ..       ..      '---------'    |
  .----.   .----.      ..           |
  |keyi|   |cid1|                   |
  |keyj|   |cid2|    Compacted Keys | Compacted Chunks
  | .. |   | .. |    .---------.    |
  |CKSM|   |CKSM|    | k1,meta |    | .----.  Chunk
  |keyk|   |cid3|    | k2,meta |    | | c1 |  Index
  | .. |   | .. |    | ..      |    +-|-c2 |  .----.
  '----'   '----'    | CKSM    |    | | .. |<-|cid5|
                     | k3,meta |    | | .. |  |cid6|
                     | ..  |   |    | '----'  | .. |
                     '---------'    |         '----'
                           |        '---.
                           v            v
                     KeyMeta          Chunk
                     .----------.     .----.
                     | ts       |     |data|
                     | add|rm   |     |CHSM|
                     | size     |     '----'
                     | userdata |
                     | CG       |
                     | chunkids |
                     '----------'

Operations

Write

A cache server does not need strict data durability, thus when new data is written(key, chunk, or access data), it does not have to fsync at once.

For different kind of data, the fsync policies are:

Keys: Keys files are fsync-ed for every k MB(where k is a configurable parameter). A checksum record has to be written before fsync. Since on disk data still has to be internally consistent. The checksum record contains checksum of the k MB data except itself.

In other words, Keys file are split into several k MB segment.
Access data is similar to Keys files.
Chunks is different: Every chunk contains a checksum of itself. Chunk files is fsync-ed periodically.

Update manifest

A write operation does not update Manifest file. Manifest is only replaced when files(Key store files, Chunk store files or access store files) are added or removed(or configuration changed). E.g., when a new Key file is opened, or a compaction of Chunk files is done.

The Manifest file is named with monotonic incremental integers, e.g., manifest-000021.

Manifest files can only be added or removed. When restarting, opencache list all of the manifest file and use the last valid one(by checking the checksum).

Memory layout

Keys and Access data are stored in memory for quick access. Chunks can be optionally cached in memory but it should be OK there is no in-memory chunk-cache.

Keys in memory is just a HashMap: ObjectMap.

Key Access data is fed to key-eviction-algorithm. Chunk Access data is fed to chunk-eviction-algorithm.

When cache server restarts: All of the keys are loaded into memory. All Access data are replayed by the two eviction algorithm implementations.

Cache Replacement

Object and Chunks are evicted independently.

The rule is: if a chunk is accessed, the object it belongs to is accessed too. This way It's very likely an object won't be evicted if there are still chunk accesses to it.

It is possible that a chunk is evicted while the other chunks still resides in CacheServer.
It is possible that an object resides in CacheServer but all its chunks are evicted.
It is possible that an object is evicted but there are still orphan chunks. These chunks will be evicted finally since there won't be any access to them.

The cache replacement policy is a modified and simplified version of LIRS.

API

Chunk API

ChunkStorage provides chunk access:

trait ChunkStorage {
    fn add(&mut self, chunk_id, bytes);
    fn remove(&mut self, chunk_id);
    fn read(&mut self, chunk_id) -> Stream<>
}

Object API

ObjectStorage provides object access:

trait ObjectStorage {
    fn add(&mut self, key: String, size, meta, prefered_chunck_size);
    fn remove(&mut self, key)
    fn access(&mut self, key)
}

CacheServer API

Server level API

trait Cache {
    /// Write chunks to an object.
    /// Nonexistent object will added implicitly
    fn write(&mut self, key, chunks)

    /// Read specified `range` of bytes in an object
    fn read(&mut self, key, range)
}

Client

Client-side config versioning: If a new cache node C is added to the cache cluster {A,B}, and assuming the cache client uses a consistent hash, when the new cluster config {A,B,C} is published to every client, 1/3 of the data access will encounter a miss.

To upgrade cluster config smoothly, a client needs to access the cache cluster using two configs: read from the new cluster first, if there is a miss, read the old cluster, until data migration is done. During this period, there are two active config {{A,B},ver=1} and {{A,B,C},ver=2}.

And it is also possible that there are more than two active configs.

Thus for the cache client, it has to:

Be able to upgrade cluster config on the fly,
support a joint config,
and be able to retry every config.

A cluster upgrade may looks like this:

Initially, every cache-client has config [{{A,B},ver=1}];
A new cache node is added, and the new config is published to every client: [{{A,B},ver=1}, {{A,B,C},ver=2}];
After a while, config with ver=1 is removed, and another new config is published to every client: [{{A,B,C},ver=2}].

Caveat

Consistency concern

Cache has consistency issue with the backend storage, i.e., if an object is changed in the backend store, there is, NO way to provide a strong consistent view on the cache side. Unless there is a distributed consensus protocol running upon backend store and cache(which is expensive).

Thus it is recommended to only store object that won't change, i.e, an object on the backend will only be added and removed, never be modified.

Space management

It's recommended to set the cache space limit to under 80% of the disk, since in order to reduce IO, the space of removed chunk won't be reclaim until next compaction.