Lib.rs

›

#nom #pdb #parser #bioinformatics #protein

nom-pdb

PDB parser implemented with nom

5 releases

0.0.9	Oct 21, 2020
0.0.8	Oct 20, 2020
0.0.6	Oct 19, 2020
0.0.2	Oct 18, 2020
0.0.1	Oct 17, 2020

#2824 in Parser implementations

Used in 2 crates

MIT license

4MB
2K SLoC

nom-pdb

PDB parser implemented in Rust using nom.

NOTE: This crate is in early development and the API has not yet been stabilized, so do not use this crate in production. If you have any suggestions, please don't hesitate to open an issue or make a PR!

Features

Parses structural information and a subset of important metadata.
- Primary structure
- Secondary structure (sheets and helices)
- Coordinates and bonding
Able to deal with non-standard residues (not yet mature)
JSON serialization powered by serde.

The parsed data is stored in a Structure, which is a struct provided by the protein-core crate.

Example

Read to JSON

cargo run --example read 1a8o

{
  "header": {
    "classification": "VIRAL PROTEIN",
    "deposition_date": "1998-03-27",
    "id_code": "1A8O"
  },
  "title": "HIV CAPSID C-TERMINAL DOMAIN",
  "authors": [
    "T.R.GAMBLE",
    "S.YOO",
    "F.F.VAJDOS",
    "U.K.VON SCHWEDLER",
    "D.K.WORTHYLAKE",
    "H.WANG",
    "J.P.MCCUTCHEON",
    "W.I.SUNDQUIST",
    "C.P.HILL"
  ],
  "experimental_techniques": [
    "XRayDiffraction"
  ],
  "cryst1": {
    "a": 41.98,
    "b": 41.98,
    "c": 88.92,
    "alpha": 90.0,
    "beta": 90.0,
    "gamma": 90.0,
    "lattice_type": "Primitive",
    "space_group": [
      [
        4,
        3
      ],
      [
        2,
        1
      ],
      [
        2,
        1
      ]
    ],
    "z": 8
  },
  "modres": {
    "MSE": {
      "standard_res": "Met",
      "description": "SELENOMETHIONINE",
      "occurence": [
        [
          "A",
          151
        ],
        [
          "A",
          185
        ],
        [
          "A",
          214
        ],
        [
          "A",
          215
        ]
      ]
    }
  },
  "seqres": [
    [
      "A",
      [
        {
          "Custom": "MSE"
        },
        "Asp",
        "Ile",
        "Arg",
        "Gln",
        "Gly",
        "Pro",
    // snip //
      ]
    ]
  ],
  "models": [
    {
      "atoms":  [
          "id": 1,
          "name": "N",
          "id1": " ",
          "residue": "Ser",
          "chain": "A",
          "sequence_number": 0,
          "insertion_code": " ",
          "x": -12.138,
          "y": 1.867,
          "z": 20.782,
          "occupancy": 1.0,
          "temperature_factor": 67.46,
          "element": "N",
          "charge": 0,
          "hetatom": false
        },
        // snip //
      ]
      "anisou": [
        // snip //
      ],
      "sheets": [
        {
          "id": "A",
          "strands": [
            {
              "start": [
                "A",
                34
              ],
              "end": [
                "A",
                38
              ],
              "sense": "Unknown"
            },
            // snip //
          ]
        },
        // snip //
      ]
      "helices": [
        // snip
      ],
      "connect": [
        // snip //
      ]
    }
  ]
}

Notes

References

Roadmap

Note: Priority is and is ought to be placed on parsing structural information instead of metadata, since the latter is more or less disordered free-text and usually not of particular interest to users (even in cases where they are, users can examine the PDB file directly).

Title Section

Primary Structure Section

Heterogen Section

Secondary Structure Section

Helix
Sheet

Connectivity Annotation Section

Miscellaneous Features Section

Site

Crystallographic and Coordinate Transformation Section

Coordinate Section

Connectivity Section

Conect

Bookkeeping Section

Master
End

Sample PDB Files

The files in assets/ are retrieved from RSCB's FTP server using the method described in my blog post. Here are some features of the selected PDB files stored in this directory:

1a8o: a simple X-Ray structure
4f7i: Lots of sheets
7znf: solution NMR; lots of models
3l1p: complex with DNA

Dependencies

~5MB
~100K SLoC