1 unstable release
0.1.0 | May 27, 2019 |
---|
#151 in #progress
89KB
2K
SLoC
git-rs
Implementing git in rust for fun and education!
This is actually my second stab at it, so big blocks will land in place from my first attempt. I'm trying again this year after reading more of "Programming Rust" (Blandy, Orendorff).
TODO
- Read objects from loose store
- Read objects from pack store
- Read packfile indexes
- Read delta'd objects
- Fix interface so we don't need to run
open
for eachread()
- BUG: certain OFS deltas are misapplied.
- Isolate the error case
- Fix it
- Load refs off of disk
- Parse git signatures ("Identity"'s)
- Create iterator for walking commit graph
- Create iterator for walking trees
- Materialize trees to disk (post gitindex?)
- Create index from packfile
- Rename
Storage
trait toQueryable
- Rework object loading API from
<Type + Boxed reader>
to "we take a writable object"- Carry the rework out through
StorageSet
- Carry the rework out through
- Create the index
- Wrap it in a nice API
- Rename
- refs v2
- Load refs on demand
- Load packed-refs
-
.git/index
support- Read git index cache
- Write git index cache
- Create interface for writing new objects
- Add benchmarks
- Create packfile from list of objects (API TKTK)
- Network protocol
- receive-pack
- send-pack
- Try publishing to crates
- Write documentation
- Use crate in another project
PLAN
2019-02-08 Update
- It's been a minute!
- As you might have seen, figuring out packfile indexing has forced a lot of changes on the repo.
- There's now a
src/pack/read.rs
file that holds generic read implementations for anyBufRead + Seek
. - The signature of the
Storage
trait changed -- instead of returning a boxed read object, it now accepts aWrite
destination. - Further,
Storage
is nowQueryable
(a better name!).- Because we moved from returning a Box to accepting generic
Write
, we could no longer boxQueryable
s.- I didn't know this about Rust, so TIL!
StorageSet
objects had to be rethought as a result -- they could no longer contain Box'dStorage
objects.- Instead, we put the compiler to work -- because storage sets are known at compile time, I implemented
Queryable
for the unit type,()
, two types(S, T)
, and arrays of single typesVec<T>
. - This means that a
StorageSet
may hold a single, top-levelQueryable
, which might contain nested heterogenousQueryable
definitions.- It gives me warm, fuzzy feelings 💞
- Instead, we put the compiler to work -- because storage sets are known at compile time, I implemented
- Because we moved from returning a Box to accepting generic
- There's now a
- You might also note that we're not actually done indexing packfiles. 😱
- Here's the sitch: in order to create a packfile index, you have to run a CRC32 over the compressed bytes in the packfile.
- The
ZlibDecoder
will pull more bytes from the underlying stream than it needs, so you can't take the route of handing it aCrcReader
and get good results. - It's got to be a multi-pass deal.
- The current plan is: run one pass to get offsets, un-delta'd shas and types.
- Run a second pass to resolve CRCs and decompress deltas. This can be done in parallel.
2019-01-23 Update
- It's time to start indexing packfiles.
- This'll let us start talking to external servers and cloning things!
- However, it's kind of a pain.
- Packfiles (viewed as a store) aren't hugely useful until you have an index, so
I had designed them as an object that takes an optional index from outside.
- My thinking was that if an index was not given, we would build one in-memory.
- That just blew up in my face, a little bit. 💥
- In order to build an index from a packfile you have to iterate over all of the objects.
- For each object, you want to record the offset and the SHA1 id of the object at that offset.
- However, the object might be an offset or a reference delta.
- That means that in order to index a packfile, you've got to be able
to read delta'd objects at offsets within the packfile (implying you already
have the
Packfile
instance created) and outside of the packfile ( implying you have aStorageSet
.) - In other words: my assumptions about the program design are wrong.
- That means that in order to index a packfile, you've got to be able
to read delta'd objects at offsets within the packfile (implying you already
have the
- So, in the next day or so I'll be reversing course.
- It should be possible to produce a
Packfile
as a non-store object and iterate over it. - The "store" form of a packfile should be the combination of a
Packfile
and anIndex
(aPackfileStore
.)- This means I'll be splitting the logic of
src/stores/mmap_pack
into "sequential packfile reads" and "random access packfile reads (with an index.)"
- This means I'll be splitting the logic of
- It should be possible to produce a
- Packfiles (viewed as a store) aren't hugely useful until you have an index, so
I had designed them as an object that takes an optional index from outside.
- It's fun to be wrong 🎉
2019-01-21 Update
- Well, that was a fun bug. Let's walk through it, shall we?
- This occasionally showed up when a delta would decode another delta'd object.
- I found a hash that would reliably fail to load.
- We'd fail the read because the incoming base object would not match the 2nd delta's "base size". Here.
- Removing the check to see if I got the deltas wrong would cause the thread to panic -- the delta's base size wasn't a lie.
- First, I switched back to my old mmap-less packfile implementation, because I recently
touched that code. "Revert the thing you touched last" is a winning strategy in these
cases: doesn't cost expensive thinking, quickly puts bounds around the bug.
- Alas, the old packfile implementation also had this bug. No dice.
- I compared the output of this git implementation to my JS implementation.
- I confirmed that the output of the JS implementation worked by comparing its output for the hash of concern to vanilla git.
- After confirming that, I logged out the offsets being read from the file and the expected
sizes. I compared this to similar debugging output I added to
git-rs
.- The offsets are the bound values sent into the packfile. For the outermost
read ("Give me the object represented by
eff4c6de1bd15cb0d28581e4da17035dbf9e8596
"), the offsets come from the packfile index. - For
OFS_DELTA
("offset delta") types, the start offset is obtained by reading a varint from the packfile and subtracting it from the current start offset.
- The offsets are the bound values sent into the packfile. For the outermost
read ("Give me the object represented by
- The offsets and expected sizes matched!
- This meant that:
- I was reading the correct data from the packfile
- I was reading varints correctly
- The bug must be in the application of the delta to a base object
- This meant that:
- From there I added logging above these state transitions, noting the particulars of the
operation.
- I added the same logging to the JS implementation, and found that (aside from totally bailing out ahead of the 2nd delta application) the commands were the same.
- So it wasn't even that my delta code was wrong: it was my
Read
state machine.
- At this point, I was like: "This is a read state machine bug. I know this."
- So, one of the things this state machine does is carefully bail out if it
can't write all of the bytes for a command. ("If there remains an
extent
to write, record the next state and break the loop.") - However, at this point we've already consumed the last command. There are no more instructions.
- So if this function were to be called again, ...
- We would politely (but firmly) tell the caller to buzz off (
written == 0
, here.).
- We would politely (but firmly) tell the caller to buzz off (
- So, one of the things this state machine does is carefully bail out if it
can't write all of the bytes for a command. ("If there remains an
- The fix turned out to be simple, as these fixes usually are.
- (I need to write a test for this, I know. I know. Pile your shame on me.)
- This occasionally showed up when a delta would decode another delta'd object.
- So what did we learn?
- Always test your state machines, folks.
- (I've said it once, and I'm saying it again.) Malleable reference implementations will save your bacon.
- Make sure you can trust your reference implementation.
- Anyway. The tree reader works now! 🌲🌳🌲
2019-01-19 Update
- It's slightly faster! 🎉
- mmap sped things along nicely, shaving 20ms off of our runtime.
- We're still reliably slower than git, though. It might be because we load the refset immediately.
- I kept the immutable "file per read" packfile store around; I think it may come in handy in the future.
- It would be excellent to capture this as a benchmark instead of running it ad-hoc.
- I integrated the tree walking iterator and got a nice surprise:
- There's a bug in my OFS delta code!
- This is interesting, because it only appears for certain blobs in certain repositories.
- Otherwise other OFS deltas seem to resolve cleanly.
- Case in point: many of the commits I load as a test of the commit walk-er are OFS-delta'd.
- Also of note: I've split from
src/bin.rs
into dedicated binaries for tree walking and commit walking.
- Today's theme: isolate the bug in a test case.
- EOD Update: It's really helpful to have a reference implementation.
- I've confirmed that the reference implementation can read the object that breaks this project.
- We are reading the same offsets, as well (phew)
- I've further confirmed that swapping out the packfile implementation for the older, slower packfile doesn't affect anything.
- I suspect this means there's either a problem in my delta code (highly possible!), my varint decoding code (very possible), or the Read implementation for Deltas. Yay, narrowed down results!
2019-01-15 Update
- I added an (experimental)
git_rs::walk::tree
iterator to take a Tree and yield a path + a blob for each item.- It's probably slower than it should be: for each item it has to clone a
PathBuf
, because I couldn't work out the lifetimes. - If you know how to fix that, please open an issue and let me know 💞
- It's probably slower than it should be: for each item it has to clone a
- I took some time to clean up the warnings during builds.
- Oh! I also installed Clippy which warns about higher level antipatterns in Rust!
- I'm still noodling over the 2-3x slowdown between vanilla git and Our Git.
- I think I might create two packfile interfaces -- one "generic" and one "mmap"'d, to see if
one or the other makes up the difference in performance.
- This also has the virtue of being
unsafe
code, which is something I have not yet used in Rust!
- This also has the virtue of being
- I think I might create two packfile interfaces -- one "generic" and one "mmap"'d, to see if
one or the other makes up the difference in performance.
2019-01-06 Update
- I wrote an iterator for commits! The first cut kept a
Vec
of(Id, Commit)
around, so we could always pick the most recent "next" commit out of the graph (since commits may have many parents.)- But in finishing up the collections section of "Programming Rust" I noticed that
BinaryHeap
was available, which keeps entries in sorted order. You don't often get to choose the underlying storage mechanism of your collections in JS, so this hadn't occurred to me! - Anyway. I swapped out the
Vec
for aBinaryHeap
in this commit. Because this pushes the ordering into anOrd
impl for a type, this opens up the possibility of using the one iterator definition for multiple different orderings. Neat!
- But in finishing up the collections section of "Programming Rust" I noticed that
- Testing against a couple long-lived repo, the results coming out of
git_rs
are exactly the same asgit
!- However, it takes about twice the time: 60ms for
git_rs
wheregit
takes 30ms. - I think I have a lead on this, and it has to do with packfile stores: each read from a packfile
opens a new
File
instance.
- However, it takes about twice the time: 60ms for
- I've added a TODO section to keep track of what comes next!
2019-01-02 Update
- I implemented ref loading. It was a bit of a pain! Translating to and
from
Path
types took a bit of doing. - I've been trying to read up on Rust idioms -- I found a couple of resources:
- The Rust API Guidelines doc has been very helpful.
- @mre's idiomatic rust repo collects many interesting links.
- I've also been idly checking out videos from RustConf 2018
- As a result, I've implemented
FromStr
forId
, (hopefully) giving it a more idiomatic API --let id: Id = str.parse()?
2018-12-27 Update
- Rust is feeling more natural. This chain felt natural to write. I was even able to cross-index a list with only a minimum of fighting the borrow checker.
- I split the objects interface into Type + boxed read with a method for reifying the data into an Object. This feels good! It lets folks check to see, for example, if they're referring to a Blob without having to load the entire Blob into memory.
- The closure interface for the loose interface works pretty well, but pack interfaces
need to be able to ask the overarching set of stores for a reference due to REF_DELTA
objects. This is a bummer, because it really quickly turns into "fighting the borrow
checker." Right now I think the way forward is to build a StorageSet that holds a Vec
of heterogenous
Box<Storage>
objects, whereStorage
is a new trait that specifiesget(&Id, &StorageSet)
.- A sidenote re: the loose store: it feels kind of odd to have to produce a
git_rs::Error
instead of astd::io::Error
. Room for improvement!
- A sidenote re: the loose store: it feels kind of odd to have to produce a
- Oh! It was pretty easy to add a binary to this lib crate. And now we can
git log
other repos!
2018-12-21 Update
- Decided to focus on moving a bit slower and making sure I have tests for primitives this time around.
- Moved away from my original
Box<Write>
trait object design for object instance reading & storage format in favor of generics.
Dependencies
~11MB
~188K SLoC