48 stable releases

new 2.1.4 Nov 27, 2024
2.1.0 Oct 29, 2024
2.0.3 Jul 20, 2024
1.18.26 Oct 11, 2024
0.0.0 Dec 1, 2023

#422 in Magic Beans

Download history 2415/week @ 2024-08-07 1962/week @ 2024-08-14 1912/week @ 2024-08-21 2046/week @ 2024-08-28 2016/week @ 2024-09-04 2046/week @ 2024-09-11 1474/week @ 2024-09-18 1848/week @ 2024-09-25 2489/week @ 2024-10-02 2119/week @ 2024-10-09 2603/week @ 2024-10-16 2281/week @ 2024-10-23 3539/week @ 2024-10-30 2464/week @ 2024-11-06 1810/week @ 2024-11-13 2271/week @ 2024-11-20

10,652 downloads per month
Used in 17 crates (via solana-unified-scheduler-…)

Apache-2.0

475KB
9K SLoC

The task (transaction) scheduling code for the unified scheduler

High-level API and design

The most important type is SchedulingStateMachine. It takes new tasks (= transactions) and may return back them if runnable via ::schedule_task() while maintaining the account readonly/writable lock rules. Those returned runnable tasks are guaranteed to be safe to execute in parallel. Lastly, SchedulingStateMachine should be notified about the completion of the exeuction via ::deschedule_task(), so that conflicting tasks can be returned from ::schedule_next_unblocked_task() as newly-unblocked runnable ones.

The design principle of this crate (solana-unified-scheduler-logic) is simplicity for the separation of concern. It is interacted only with a few of its public API by solana-unified-scheduler-pool. This crate doesn't know about banks, slots, solana-runtime, threads, crossbeam-channel at all. Becasue of this, it's deterministic, easy-to-unit-test, and its perf footprint is well understood. It really focuses on its single job: sorting transactions in executable order.

Algorithm

The algorithm can be said it's based on per-address FIFO queues, which are updated every time both new task is coming (= called scheduling) and runnable (= post-scheduling) task is finished (= called descheduling).

For the non-conflicting scheduling case, the story is very simple; it just remembers that all of accessed addresses are write-locked or read-locked with the number of active (= currently-scheduled-and-not-descheduled-yet) tasks. Correspondingly, descheduling does the opposite book-keeping process, regardless whether a finished task has been conflicted or not.

For the conflicting scheduling case, it remembers that each of non-conflicting addresses like the non-conflicting case above. As for conflicting addresses, each task is recorded to respective FIFO queues attached to the (conflicting) addresses. Importantly, the number of conflicting addresses of the conflicting task is also remembered.

The last missing piece is that the scheduler actually tries to reschedule previously blocked tasks while deschduling, in addition to the above-mentioned book-keeping processing. Namely, when given address is ready for new fresh locking resulted from descheduling a task (i.e. write lock is released or read lock count has reached zero), it pops out the first element of the FIFO blocked-task queue of the address. Then, it immediately marks the address as relocked. It also decrements the number of conflicting addresses of the popped-out task. As the final step, if the number reaches to the zero, it means the task has fully finished locking all of its addresses and is directly routed to be runnable. Lastly, if the next first element of the blocked-task queue is trying to read-lock the address like the popped-out one, this rescheduling is repeated as an optimization to increase parallelism of task execution.

Put differently, this algorithm tries to gradually lock all of addresses of tasks at different timings while not deviating the execution order from the original task ingestion order. This implies there's no locking retries in general, which is the primary source of non-linear perf. degration.

As a ballpark number from a synthesized micro benchmark on usual CPU for mainnet-beta validators, it takes roughly 100ns to schedule and deschedule a transaction with 10 accounts. And 1us for a transaction with 100 accounts. Note that this excludes crossbeam communication overhead at all. That's said, it's not unrealistic to say the whole unified scheduler can attain 100k-1m tps overall, assuming those transaction executions aren't bottlenecked.

Runtime performance characteristics and data structure arrangement

Its algorithm is very fast for high throughput, real-time for low latency. The whole unified-scheduler architecture is designed from grounds up to support the fastest execution of this scheduling code. For that end, unified scheduler pre-loads address-specific locking state data structures (called UsageQueue) for all of transaction's accounts, in order to offload the job to other threads from the scheduler thread. This preloading is done inside create_task(). In this way, task scheduling computational complexity is basically reduced to several word-sized loads and stores in the schduler thread (i.e. constant; no allocations nor syscalls), while being proportional to the number of addresses in a given transaction. Note that this statement is held true, regardless of conflicts. This is because the preloading also pre-allocates some scratch-pad area (blocked_usages_from_tasks) to stash blocked ones. So, a conflict only incurs some additional fixed number of mem stores, within error margin of the constant complexity. And additional memory allocation for the scratchpad could said to be amortized, if such an unusual event should occur.

[Arc] is used to implement this preloading mechanism, because UsageQueues are shared across tasks accessing the same account, and among threads due to the preloading. Also, interior mutability is needed. However, SchedulingStateMachine doesn't use conventional locks like RwLock. Leveraging the fact it's the only state-mutating exclusive thread, it instead uses UnsafeCell, which is sugar-coated by a tailored wrapper called TokenCell. TokenCell imposes an overly restrictive aliasing rule via rust type system to maintain the memory safety. By localizing any synchronization to the message passing, the scheduling code itself attains maximally possible single-threaed execution without stalling cpu pipelines at all, only constrained to mem access latency, while efficiently utilizing L1-L3 cpu cache with full of UsageQueues.

Buffer bloat insignificance

The scheduler code itself doesn't care about the buffer bloat problem, which can occur in unified scheduler, where a run of heavily linearized and blocked tasks could be severely hampered by very large number of interleaved runnable tasks along side. The reason is again for separation of concerns. This is acceptable because the scheduling code itself isn't susceptible to the buffer bloat problem by itself as explained by the description and validated by the mentioned benchmark above. Thus, this should be solved elsewhere, specifically at the scheduler pool.

Dependencies

~13–21MB
~330K SLoC