CDD accepts three file formats for import: Comma Separated Values format (or CSV), Microsoft Excel (or XLSX) files, and Structure Data File format (or SDF).
SDF files are usually supplied by compound vendors when purchasing a library, and can be generated by most standard chem-informatics tools. These files will be properly formatted by default and do not require any formatting.
You will find documentation from Accelrys pertaining to the SD file formatting below, attached to this document.
CSV/XLSX files can be easily generated in Excel, and need to be formatted in a CDD-readable manner. Below are the rules for creating a correct CSV/XLSX file.
You will find a properly formatted file at the bottom of this article- you can use this is a starting point for preparing your own.
- Each type of data must be entered into a separate column.
For example, if in your original data file, Molecule ID and Batch ID are concatenated, they need to be separated into two columns for import into CDD. Of course, if this is the first time you are registering this compound, you won't have a Molecule ID and Batch ID yet.
- The first row of the file must contain column headers (or titles). CDD will use the headers to connect (map) the data in the column to the appropriate fields in the database.
Columns with a blank first row will be ignored.
The headers do not need to match CDD fields exactly, as you will specifically instruct the database on how to parse your file during import.
For example, if your file happens to have two rows of headers, where the second row has sub-headers, then these must be combined into a single row.
- Each row of data must include a batch identifier.
All data in CDD is pivoted around batch molecules, and therefore a Batch ID must be specified so that CDD Vault will know which batch to associate the data.
For example, if you are importing batch attributes, you need to include both the Molecule Name and the Batch Name.
A batch may be uniquely identified by one of the following attributes:
- Molecule Name and Batch Name
- Synonym and Batch Name
- Molecule-Batch-ID (this is a concatenation of the Molecule Name and Batch Name)
- Unique Batch ID
- Plate and Well Location, once the Plate Map is imported
Molecule structure must be in MOL or SMILES format.
SMILES is usually safest, since it is not dependent on formatting such as line breaks (like MOL), and is not prone to errors.
- For replicate data, include each replicate on a separate row.
For example, if you are importing 3 batches of a single compound, include one row for each replicate, repeating the same molecular identifier on each row.
01_TrainingFile_kitchen_sink.csv kitchen sink
SDF_file_formats.zip SDF file formats