#github-api #repository #clone #api-key #search #repos #analysis

app octosurfer

Search GitHub, clone matching repos, and search through the repos

2 unstable releases

0.2.0 Sep 10, 2024
0.1.0 Mar 11, 2023

#1136 in Database interfaces

MIT license

22KB
460 lines

octosurfer

Searches GitHub for repositories, then clones each repository and greps through it for a list of terms.

Installation

This project is published on crates.io. Make sure you have a working Rust toolchain installed, then install with cargo:

$ cargo install octosurfer

Usage

Make sure you generate a GitHub API key and export it as GITHUB_TOKEN in your environment.

Usage: octosurfer -k <keywords> [-l <languages>] [-p <pushed>] [-s <stars>] [-t <topics>] -d <target-dir> -q <query-file> -o <out-file> [--rm] [-v <verbosity>]

Clone all GitHub repositories matching a query and search them

Options:
  -k, --keywords    keywords to use when searching for repos (comma-separated)
  -l, --languages   limit search to repos that use these languages
                    (comma-separated)
  -p, --pushed      limit search by date, e.g. ">1970-01-01" for repos updated
                    after Jan 1st, 1970
  -s, --stars       limit search by stars, e.g. ">100" for repos with more than
                    100 stars
  -t, --topics      limit search by these topics (comma-separated)
  -d, --target-dir  path to a directory into which repositories should be cloned
  -q, --query-file  file to read code queries from
  -o, --out-file    filename to write CSV results into
  --rm              remove repos after analysis is complete
  -v, --verbosity   sets the verbosity (off, error, warn, info, debug, or trace)
  --help            display usage information

Example

The following invocation will search GitHub for all repositories that:

  • include the keyword "mpi"
  • are written primarily in either C or C++
  • have been pushed to some time after Jan 1st, 2013
  • have more than two stars

It will then clone each repository in /tmp/octosurfer, search each repository for occurences of the queries in the file my-queries.txt, and save the results in results.csv. Each cloned repository will be removed after it has been searched.

$ octosurfer -k mpi \
	-l c,c++ \
	-p ">2013-01-01" \
	-s ">2" \
	-d /tmp/octosurfer \
	-q my-queries.txt \
	-o results.csv \
	--rm

Queries

Queries are listed in a text file, and the file name is given to octosurfer with the -q flag. There should be one query per line, and regex syntax may be used in a query. octosurfer searches files line by line, so there can be no multiline matches.

Performance

octosurfer uses tokio and makes heavy use of async Rust. Repositories are cloned and searched asynchronously. Searching through a single repository is single-threaded, though. The assumption is that usually, multiple repositories are searched at a time, so each search does not need to be parallelized. During testing, heavy parallel searching raised the OS error EMFILE, i.e. "too many open files".

octosurfer uses the grep crate to search files. This crate is the library that powers ripgrep.

Disk usage

octosurfer clones each repository shallowly, i.e. with git clone --depth 1. However, because the GitHub search can return hundreds or thousands of repositories, the cumulative disk use can become quite significant. It may be prudent to pass the --rm flag if unsure of how many repositories a search will yield.

Note that because repositories are cloned asynchronously, more than one repository may exist on-disk at a time, even with the --rm flag given.

octosurfer clones repositories into the given target directory, and then under /{repo owner's name}/{repo name}. If --rm is given, octosurfer will remove the cloned repository, as well as the directory named after the repository owner. It is therefore advisable to pass an empty directory as the --target-dir flag, to avoid octosurfer accidentally removing files and directories you intended to keep.

Dependencies

~26–40MB
~776K SLoC