Unit 4: Informatics & Databases in Drug Design

March 16, 2026

Semester 8
BP807T

Informatics & Databases in Drug Design

Modern drug design is fundamentally data-driven. A computational chemist must fluently navigate vast, interconnected biological and chemical data repositories. This unit introduces the core fields of Bioinformatics (managing protein/gene data) and Chemoinformatics (managing chemical structure data), and takes you on a tour of the essential public databases—from the Protein Data Bank (PDB) containing millions of 3D protein structures to PubChem containing hundreds of millions of chemical compounds.

Syllabus & Topics

  • 1Introduction to Bioinformatics: The interdisciplinary science that develops computational methods and software tools to understand and analyze massive amounts of biological data—primarily gene sequences, protein sequences, and protein 3D structures. Applications in Drug Design: Sequence Alignment (using BLAST to find functionally similar drug targets in different organisms), Homology Modeling (predicting the 3D structure of an unknown protein based on a known homologous protein structure), and Phylogenetic Analysis (understanding the evolutionary relationships between target proteins).
  • 2Introduction to Chemoinformatics: The application of informatics methods to study chemical problems, especially those related to drug discovery. Core Functions: Converting molecular structures into computer-readable formats (SMILES, InChI strings), managing massive virtual compound libraries, calculating molecular descriptors (MW, LogP, HBD, HBA) for QSAR, and developing quantitative models that link chemical structure to biological activity.
  • 3ADME Databases: Specialized databases containing experimental or computationally predicted Absorption, Distribution, Metabolism, and Excretion data for thousands of drug molecules. Purpose: Before even synthesizing a molecule, scientists can check these databases to predict if the drug will be orally absorbed, if it will cross the blood-brain barrier, which CYP450 liver enzyme will metabolize it, and how quickly the kidneys will excrete it. Examples: ADMET Predictor, pkCSM, SwissADME.
  • 4Chemical Databases: PubChem: The world’s largest freely accessible chemical database (maintained by NIH/NLM). Contains over 100 million unique chemical structures with associated bioactivity data from millions of biological assays. ChEMBL: A curated, manually extracted database of bioactive drug-like molecules with quantitative experimental pharmacological activity values (IC50, Ki, EC50). ZINC: A massive free database of commercially available chemical compounds specifically formatted and ready for virtual screening and molecular docking.
  • 5Biochemical and Pharmaceutical Databases: Protein Data Bank (PDB): The absolute most critical database in all of structural biology. Contains over 200,000 experimentally determined 3D atomic coordinate structures of proteins, nucleic acids, and complex assemblies (determined by X-ray Crystallography, NMR, or Cryo-EM). This is WHERE docking targets come from. UniProt: The definitive, universal resource for protein sequence and functional annotation information. DrugBank: A massively comprehensive pharmaceutical database containing exhaustive information about drugs and their targets (mechanisms, pharmacokinetics, interactions, approved status).

Learning Objectives

Define Bioinformatics Scope: Explain the core role of Bioinformatics in computational drug design, specifically how ‘Homology Modeling’ provides 3D protein structures when experimental crystal structures are unavailable.
Navigate Chemical Databases: Differentiate the specific content and primary use-case of PubChem (massive, general chemical info), ChEMBL (curated bioactivity data), and ZINC (virtual screening-ready compounds).
Utilize PDB: Describe the specific molecular information (3D coordinates, resolution, ligand co-crystallization) a computational chemist extracts from a Protein Data Bank (PDB) entry before performing molecular docking.
Predict ADME Properties: Explain how a scientist uses an ADME prediction database (like SwissADME) to computationally filter out drug candidates with predicted poor oral bioavailability BEFORE synthesizing them.
Apply Chemoinformatics: Describe how SMILES notation and molecular descriptor calculation enable computers to mathematically process and compare millions of chemical structures simultaneously.

Exam Prep Questions

Q1. Why is the Protein Data Bank (PDB) considered the most important database in CADD?

In Computer-Aided Drug Design (CADD), molecular docking and structure-based drug design require the three-dimensional structure of the target protein. Without knowing the exact arrangement of atoms in the protein’s binding site, it is impossible to simulate how a drug molecule might interact with it.

The Protein Data Bank (PDB) stores experimentally determined 3D structures of proteins, nucleic acids, and other biomolecules obtained through techniques such as X-ray crystallography, NMR spectroscopy, and cryo-electron microscopy. Scientists download these structural files from the PDB and use them in docking or modeling software to study interactions between drugs and biological targets.

Q2. What is a SMILES string in chemoinformatics?

A SMILES (Simplified Molecular Input Line Entry System) string is a text-based representation of a chemical structure that allows computers to store and process molecular information efficiently.

Instead of using graphical molecular drawings, SMILES encodes the structure using characters and symbols that represent atoms and their connectivity. For example:

  • Ethanol → CCO

  • Benzene → c1ccccc1

This format allows chemoinformatics software to store, search, compare, and analyze millions of molecules quickly, making it essential for chemical databases and virtual screening.

Q3. How does homology modeling help when no crystal structure exists for a target protein?

Many biologically important proteins do not yet have experimentally determined 3D structures because techniques such as crystallography or cryo-EM are complex, time-consuming, and expensive.

Homology modeling predicts the 3D structure of an unknown protein by using the known structure of a closely related protein (template) that shares a similar amino acid sequence. Because proteins with similar sequences often fold into similar shapes, scientists can build a predicted structural model of the target protein.

Although the resulting model may not be perfectly accurate, it is often sufficient for molecular docking and preliminary drug design studies when no experimental structure is available.