SDF files are usually supplied by compound vendors when purchasing a library, and can be generated by most standard chem-informatics tools. These files will be properly formatted by default and do not require any formatting.
You will find documentation from Accelrys pertaining to the SD file formatting below, attached to this document.
CSV/XLSX files can be easily generated in Excel, and need to be formatted in a CDD-readable manner. Below are the rules for creating a correct CSV/XLSX file.
You will find a properly formatted file at the bottom of this article- you can use this is a starting point for preparing your own.
- Each type of data must be entered into a separate column.
For example, if in your original data file, Molecule ID and Batch ID are concatenated, they need to be separated into two columns for import into CDD. Of course, if this is the first time you are registering this compound, you won't have a Molecule ID and Batch ID yet.
- The first row of the file must contain column headers (or titles). CDD will use the headers to connect (map) the data in the column to the appropriate fields in the database.
Columns with a blank first row will be ignored.
The headers do not need to match CDD fields exactly, as you will specifically instruct the database on how to parse your file during import.
For example, if your file happens to have two rows of headers, where the second row has sub-headers, the these must be combined into a single row.
- Each row of data must include a molecule identifier.
Recall that all data in CDD is pivoted around molecules, and we will not know how to associate the data unless a molecule ID is supplied.
For example, if you are importing batch attributes, you need to include both the molecules' name and batch name.
A molecule may be uniquely identified by a combination of its' primary ID and batch name, or synonym and batch name, or plate and well reference, once the plate map is imported. In registration systems, a Batch External ID will also uniquely identify a batch.
SMILES is usually safest, since it is not dependent on formatting such as line breaks (like MOL), and is not prone to errors (like IUPAC).
- For replicate data, include each replicate on a separate row.
For example, if you are importing 3 batches of a single compound, include one row for each replicate, repeating the same molecular identifier on each row.
01_TrainingFile_kitchen_sink.csv kitchen sink
SDF_file_formats.zip SDF file formats