How to Use KEGG: Pathways, Genes, and Database Tools

KEGG (Kyoto Encyclopedia of Genes and Genomes) is a collection of interconnected databases that maps genes, proteins, compounds, and diseases onto biological pathways. If you’re encountering it for the first time, the sheer number of tools and databases can feel overwhelming. Here’s a practical walkthrough of how to actually use it, from basic browsing to uploading your own data.

What KEGG Contains

KEGG is not a single database. It’s a suite of databases organized around one central idea: linking molecular-level information (individual genes, proteins, chemical compounds) to higher-level biological functions (metabolic pathways, signaling networks, disease mechanisms). The major pieces you’ll interact with most are:

  • PATHWAY: Manually drawn diagrams showing known biological networks, from metabolism to signal transduction. These are the iconic colored maps most people associate with KEGG.
  • GENES: A database of genes across hundreds of organisms, pulled from GenBank, RefSeq, and organism-specific databases.
  • KO (KEGG Orthology): A system that groups genes from different organisms by shared function. Each functional group gets a “K number” identifier.
  • DRUG and DISEASE: Curated collections linking approved drugs to their targets and diseases to their associated genes, pathogens, and environmental factors.
  • BRITE: Hierarchical classification files that organize genes, compounds, drugs, and diseases into functional categories.

Everything connects through the KO system. A K number assigned to a gene in one organism lets KEGG automatically map that gene onto pathways, modules, and disease entries. This cross-referencing is what makes the whole system powerful.

Browsing Pathway Maps

The pathway maps are where most people start. Go to the KEGG website (kegg.jp) and navigate to the PATHWAY database. You can browse by category (metabolism, genetic information processing, human diseases, etc.) or search for a specific pathway by name or ID number.

Each pathway exists in several versions, and the color coding tells you what you’re looking at. On a reference pathway, boxes are white and link out to general entries for enzymes, reactions, or orthology groups. On an organism-specific pathway, green boxes indicate that the organism actually has genes for that step, giving you an immediate visual picture of which parts of a pathway are present in a given genome. If you’re looking at a human disease pathway, you may see pink boxes for disease genes and light blue boxes for drug targets.

Global maps (numbered in the 01100s) and overview maps (01200s) work differently from regular pathway maps. Instead of boxes, they use lines and arrows to connect enzymes and reactions across the entire metabolic network. These are useful for seeing the big picture but can be dense. Zooming and panning help.

Searching for Genes and Proteins

You can search the GENES database in a few ways. The simplest is a keyword search: type in a gene name, accession number, or functional description. You can search across all organisms at once or narrow it to a specific species or group.

If you have a protein or DNA sequence rather than a name, KEGG offers sequence similarity searches. You provide your sequence, and the system finds genes with similar sequences above a threshold you set. Beyond standard similarity searches, KEGG also identifies “best-best neighbors,” which are pairs of genes in two organisms that are each other’s closest match. These reciprocal best hits tend to be more reliable indicators of shared function than one-directional matches.

Understanding K Numbers

The KO system is central to how KEGG works, so understanding K numbers will save you time. A K number represents a specific molecular function, defined manually based on experimentally characterized genes. When KEGG annotates a genome, it assigns K numbers to individual genes rather than writing out text descriptions. This standardized labeling is what allows automatic pathway reconstruction: once your genes have K numbers, KEGG can instantly show which pathways, modules, and disease networks they participate in.

Most K numbers originate from well-studied genes in model organisms and are then extended to other species by sequence similarity. When you see a K number on a pathway map, clicking it shows you all the genes across all organisms that share that functional assignment.

Annotating Your Own Sequences

If you have genomic or metagenomic sequence data and want KEGG functional annotations, the built-in tools for this are BlastKOALA and GhostKOALA. GhostKOALA handles larger datasets (metagenomes), while BlastKOALA is suited for individual genomes. The process is straightforward:

First, paste or upload your amino acid sequences in FASTA format and select the database you want to search against. Enter your email address and click to request confirmation. You’ll receive an email with a link to confirm and submit the job. The system queues your request, and when it finishes, you get a second email with a link to your results. Results are available for seven days before they’re deleted, so download them promptly.

The output assigns K numbers to your sequences, which you can then feed into KEGG Mapper to visualize which pathways your dataset covers.

Using KEGG Mapper Tools

KEGG Mapper is the toolset for projecting your own data onto KEGG’s maps and hierarchies. It has three core tools:

The Search tool takes a list of identifiers (genes, compounds, reactions, drugs) and highlights them in red on pathway maps, BRITE hierarchies, and module diagrams. This is the quickest way to see where your molecules of interest sit in the broader biological context.

The Color tool works the same way but lets you assign custom background and foreground colors to different identifiers. This is particularly useful for distinguishing categories in your data, like coloring upregulated genes in one color and downregulated genes in another. You enter your data as KEGG identifiers followed by color specifications, select a pathway map, and the tool renders your color-coded overlay.

The Join tool serves a different purpose. It combines a BRITE hierarchy or table file with a binary relation file by matching KEGG identifiers, effectively adding a new data column to the hierarchy. This works for gene and protein hierarchies, as well as for compound, drug, disease, and organism classifications. A newer BRITE hierarchy viewer lets you perform both Search and Join operations directly in the browser.

Exploring Drug and Disease Data

KEGG’s DRUG database covers approved drugs with information on their targets, the enzymes that metabolize them, and the transporters involved. The DISEASE database catalogs diseases alongside their associated genes, carcinogens, pathogens, and environmental risk factors. Together, these databases connect to the pathway maps so you can see how a drug’s target fits into a signaling cascade, or which pathways are disrupted in a given disease.

KEGG MEDICUS integrates both databases with drug label information (package inserts). For a visual overview, organism-specific maps using the code “hsadd” display human pathways with disease genes highlighted in pink and drug targets in light blue. A dedicated “Search Disease” tool lets you map gene identifiers or K numbers directly to disease entries.

The API also includes a drug-drug interaction query (the “ddi” operation) that identifies adverse interactions between drugs in the database.

Accessing KEGG Programmatically

KEGG provides a REST API for retrieving data without using the web interface. Every call follows the same URL pattern: https://rest.kegg.jp/<operation>/<argument>. The available operations are:

  • info: Returns release information for a database.
  • list: Returns entry identifiers and names. For example, /list/pathway/hsa returns all human pathways.
  • find: Searches for entries matching a keyword.
  • get: Retrieves full entry data.
  • conv: Converts identifiers between KEGG and external databases.
  • link: Finds cross-references between databases. For instance, /link/pathway/hsa returns all pathways linked to each human gene, while /link/hsa/hsa00010 lists every human gene in the glycolysis pathway.
  • ddi: Queries drug-drug interactions.

These calls return plain text, making them easy to parse in Python, R, or any scripting language. If you’re running enrichment analysis or building automated pipelines, the API is far more practical than clicking through the website. Several R packages (like KEGGREST) and Python libraries wrap these API calls for convenience.

Licensing and Access

KEGG is free to browse on the web for academic users. If you’re at a university or research institution, you can use kegg.jp and its GenomeNet mirror without restriction. However, if your academic work involves providing KEGG-based services to others (running a web tool that queries KEGG, for example), you need an academic service provider license, which comes with a KEGG FTP subscription.

Commercial users need a paid license. KEGG is not publicly funded and is explicit that it is not a public database. Commercial licensing is handled through Pathway Solutions, which also operates KEGG FTP download sites and a commercial mirror. If you’re in industry, contact them directly for pricing and terms before building KEGG into any product or service.