# Data Protocols

## Unique Features

To calculate spectra at different fields we implemented a spectrum simulation program in python. The program works in the same manner as other programs such as PERCH/cosmic truth (Nmr solutions), (ANMR), GISSMO, as well as the NMRDB predictor/simulator. The program takes as input chemical shifts, coupling constants, and a field strength, then performs the standard quantum mechanical calculations to simulate an NMR spectrum from this input. The practical limit for spin matrix simulations is normally 10-15 spins, due to the exponentially increasing time and memory requirements for such a calculation. However, many natural product compounds are significantly larger than this.

In a similar way to the other programs mentioned, we split these large spin matrices into smaller submatrices which are more efficient to calculate. In particular, we adopt a method similar to PERCH/ANMR where we simulate the spectra individually for one or a small number of spins at a time, and include in the calculation the neighbouring spins which contribute to the coupling patterns (ie overlapping submatrices) (LINK)## Data Dereplication

When we encounter replicate data in NP-MRD, we consolidate and compare the information. This involves using tools and methods to check for duplicates at various levels, including InChI identity, SMILES identity, name identity, molecular weight identity, and formula identity. If identical compounds are identified, we manually consolidate the data to retain the most informative parts from each source. For instance, if one source provides detailed information about the origin and description of a compound, and another source has a more accurate structure, we combine these elements to ensure the best possible data quality. Additionally, we enhance the aggregated data with our own useful information, such as classifications, references, and experimental data. This means that the data in NP-MRD may not look exactly the same as in other sources, but it is enriched and more comprehensive.

## Prediction of J couplings

For the spectra simulations of predicted spectra, the coupling constant values which are needed as input into the spectrum simulation are also predicted. We currently use a rule-based method to predict the coupling constant between a specified pair of atoms. The coupling prediction attempts to incorporate trends in coupling constants as defined by dihedral angles (karplus-like equations) and additional additive effects due to neibhoring functional groups. Some of these effects were approximated based on tables of experimetnal coupling constants (Hans Reich NMR Resources)

## Atom numbering in molecular structures

Atoms tended to be numbered non-sequentially, which made tasks such as the manual assignment of chemical shifts to such structures confusing and time-consuming. For this reason we decided to adopt a simpler numbering scheme based on RDKit. In particular, we renumber structures using canonicalized SMILES strings generated by RDKit, which results in a more predictable sequential numbering of the atoms.