4 releases
0.2.4 | Feb 13, 2021 |
---|---|
0.2.3 | Mar 24, 2019 |
0.2.2 | Mar 22, 2019 |
0.2.0 | Mar 21, 2019 |
#1611 in Database interfaces
29KB
559 lines
CSV Fish
CSV categorical data analyzer. Generates 2x2 contingency tables according to specified row/column group conditions, and applies Fisher's exact test.
Installation
If you don't want to build the program yourself (see instructions below for that), you can get it via the Cargo tool. First install Cargo and then run
cargo install csv_fish
Usage
To use the program, point to the data, group specifications, and desired output location:
csv-fish --data data.csv --groups groups.csv --output output.csv
See the example files for inspiration about the content of the files, and read the below input/output specification before using the program.
The Groups CSV file contains SQL queries executed in SQLite. Do not run the program on input files you cannot trust.
Inputs
-
All CSV inputs are expected to be separated by semicolons.
-
Data CSV - values of categorical variables for individual samples.
-
First row is a header - names of categorical variables in the data set. Each subsequent row represents one sample.
-
Each column represents a categorical variable.
-
Each cell is a value of a given categorical variable for a given sample.
-
Before running any queries on data, all cell values are trimmed of leading and trailing whitespace, to prevent accidental mismatches. (Don't use leading and trailing whitespace as intended differentiators in data.)
-
(More than two possible values per a categorical variable are supported.)
-
-
Groups CSV - SQL for selection of the groups being tested.
-
If you're not familiar with SQL, you can get a quick overview of how the where conditions work, and the logical operators you can use in them.
-
First row is a header, subsequent rows represent groups (conditions for selecting groups).
-
The conditions are either for selecting rows of the contingency table, or for selecting columns, or for selecting both rows and columns at the same time.
-
When using row-only and column-only conditions, the program then combines each of the row-only conditions with each of the column-only conditions (Cartesian product) to generate row-and-column conditions, which are in turn used for generating contingency tables.
-
When using row-and-column conditions, each such row in Groups CSV file corresponds to one contingency table.
-
It is possible to combine row-only and column-only conditions with row-and-column conditions in a single Groups CSV file.
-
See the example files, which might be better than a thousand words of description.
-
-
The header of Groups CSV must be:
condtype;filter;r1cond;r2cond;c1cond;c2cond
-
condtype
- required - the values can berow
,col
, orrowcol
. Therow
andcol
types are for specifying conditions for rows and columns separately, therowcol
type specifies them together. -
filter
- optional - initial SQL WHERE condition applied to all data before conducting the fisher exact test. Can be used to shrink the sample set before performing any further operations.- If your data contains samples where some categorical variables
are unknown (empty cell), you probably want to add filter
conditions so that queries working with that variable don't use
those samples. E.g.
my_var != ''
. See the groups.csv example. Asfilter
is SQL, you can use the operators (likeAND
) in the condition if you need to reference multiple colums.
- If your data contains samples where some categorical variables
are unknown (empty cell), you probably want to add filter
conditions so that queries working with that variable don't use
those samples. E.g.
-
r1cond
- required whencondtype
isrow
orrowcol
- SQL WHERE condition for selecting row 1. -
r2cond
- optional - SQL WHERE condition for selecting row 2. If empty, the complement ofr1cond
is used (the result is still limited byfilter
). -
c1cond
- required whencondtype
iscol
orrowcol
- SQL WHERE condition for selecting column 1. -
c2cond
- optional - SQL WHERE condition for selecting column 2. If empty, the complement ofc1cond
is used (the result is still limited byfilter
). -
To avoid accidental sample overlaps between
r1cond/r2cond
orc1cond/c2cond
, it is recommended thatr2cond
andc2cond
are not used empty, andfilter
is used to pre-select a narrower sample set when desired (e.g. skip samples with unknown value for given variables), and then onlyr1cond
andc1cond
are used for row/column selection, making the second row/column always the complement of the first one.
-
Output
-
Results CSV - Fisher's exact test results.
-
First row is a header, each subsequent row represents one Fisher's exact test.
-
The header is:
filter;r1cond;r2cond;c1cond;c2cond;r1c1;r1c1;r2c1;r2c2;fisher_l;fisher_r;fisher_2t
-
The values in the first 5 columns are the same as in the input Groups CSV.
-
The values in the columns
r1c1,r1c2,r2c1,r2c2
are counts of samples in the contingency table, which satisfy the respective conditions. -
The
fisher_l,fisher_r,fisher_2t
are left, right, and 2-tail p-values of Fischer's exact test.
-
Building
Only tested on Linux. To build the binary:
make
To run all tests:
make test
If you're feeling adventurous, you can also cross-compile the program
for Windows. This requires podman
. Make sure to read the script
before running it. Then run it:
bash tools/compile-for-windows.sh
License
GNU GPL v3+
Dependencies
~25MB
~454K SLoC