Data Sources

This page records the notes specific to each data source, regarding the ETL process when their data were integrated into MyChem.info:

Note

The structured metadata about all data sources can be accessed from the metadata endpoint. The detailed information about the integrated data is described in this data page.

AEOLUS

The value of aeolus.outcomes field is a list of outcome objects. The list is sorted by the aeolus.outcomes.case_count field in the descending order. In some rare cases, the list can be a large list (up to ~10K). The large list is often associated with common chemicals (e.g. asprin, omeprazole). For the purpose of reducing the total size of a single chemical object, we truncated the aeolus.outcomes list up to 5000 items.

This truncation affects only 165 objects (as of 2018-11-28, full list here), comparing to total 3,044 objects containing aeolus data (~5%).

ChEBI

The following chebi.xrefs fields are subject to truncation:

chebi.xrefs.intenz
chebi.xrefs.rhea
chebi.xrefs.uniprot
chebi.xrefs.sabio_rk
chebi.xrefs.patent

The value of each fields above is a list. In some cases, the list can be very large (up to ~90K items). The large list is often associated with common chemicals (e.g. water, ATP). For the purpose of reducing the total size of a single chemical object, we removed the above fields if their values contain more than 1000 items.

This truncation affects only 143 objects (as of 2018-11-28, full list here), comparing to total 98,511 objects containing chebi data (<0.15%).

ChEMBL

Data for ChEMBL is pulled from 6 online json sources:

  • Molecule, which serves as a root data source. Entries from other sources are attached molecule entries as new fields
  • Drug Indications, which will parsed and attached to molecule entries, e.g. molecule["drug_indications"] = list_of_drug_indications
  • Drug Mechanisms, which will parsed and attached to molecule entries, e.g. molecule["drug_mechanism"] = list_of_drug_mechanism
  • Drug, used to augment first_approval field to drug indication entries
  • Target, used to augment target_name and target_organism fields to drug mechanism entries
  • Binding Sites, used to augment binding_site_name field to drug mechanism entries

Dictionaries are created for each chemical based on their standardinchikey in the following format:

{_id: "standardinchikey", "chembl": {"<drug_indications>":"<...>", "<drug_mechanisms>":"<...>",..}}

DrugBank

Due to licensing restrictions, we removed DrugBank data from MyChem.info on 09/08/2021.

FDA Orphan Drug Designations

This datasource was added to MyChem.info on 09/08/2020. The data comes from a JSON file hosted here

NDC

The value of ndc field is a list. In some rare cases, the list can be a large list (up to ~4K). The entire ndc field will be removed if the list contains more than 1000 items.

This truncation affects only 4 objects (as of 2018-11-28, full list here), comparing to total 36,893 objects containing ndc data (~0.01%).

SIDER

The value of sider field is a list of side-effect objects. The list of side-effect objects are already sorted by the value of the sider.side_effect.frequency field in the descending order (e.g. “92.6%”, “65%”). In the case of no sider.side_effect.frequency value or non-numeric values (e.g. “common”, “rare”, “post-marketing”), these side-effect objects are kept at the top of the list.

In some rare cases, the list can be very large (up to ~5K). We then truncated the list up to 2000.

This truncation affects only 26 objects (as of 2018-11-28, full list here), comparing to total 1,507 objects containing sider data (~1.7%).

UniChem

Data for UniChem is pulled from 3 files, including:

  • UC_SOURCE.txt.gz, which (once decompressed) supplies matching values for source ids (src_id) and source names.
  • UC_STRUCTURE.txt.gz, which provides the UniChem entry identifies (uci) as well as the standardinchikey (standardinchikey)
  • UC_XREF.txt.gz, which provides a source id (src_id), the name used for the given source (src_compound_id), and the uci

Using the above values from each of the 3 files, dictionaries are created for each chemical based on their standardinchikey in the following format:

{_id: "standardinchikey", "unichem": {"<source_name>":"<source_specific_id>", "<source_name>":"<source_specific_id>",..}}

Directories containing file dumps can be found at: ftp://ftp.ebi.ac.uk/pub/databases/chembl/UniChem/data/oracleDumps/