Skip to main content

Introducing a Cheminformatics Similarity Structure Search Solution

Drug Discovery

For a downloadable PDF, please click here.

About the Gemini® APU

The Gemini® Associative Processing Unit (APU) is GSI’s patented processing technology and a new breed of processor. It features massive parallel data processing, compute and search, in-place, directly in the memory array. The APU’s design eliminates bandwidth-costly data transfers between the memory and processor. This gives the APU a performance edge in the acceleration of similarity search applications.

Gemini® APU delivers accurate search results while reducing search times on circular fingerprints in large databases, from many minutes to fractions of a second, and with very low power consumption.

The APU as a Driver of Similarity Search

Similarity search is the most general term used for a range of mechanisms which share the principle of searching (typically, very large) spaces of objects where the only available comparator is the similarity between any pair of objects. This is becoming increasingly important in an age of large information repositories where the objects contained do not possess any natural order.

Similarity search involves searching a database of vectors for similarity to a given query. The query itself must be of identical size to the database records (e.g., 512 or 1024 bits).

At GSI, we implement similarity search and top-k using parallel processing. Current processor technologies are not well suited to similarity search and top-k problems due to the high memory bandwidth demands required. For similarity search. the APU’s advantage over the CPU is processing time (the APU can process a similarity search in less than 15 microseconds), lower power consumption and lower latency.

Similarity Structure Search in Cheminformatics

Cheminformatics refers to the use of computer and informational techniques applied to a range of problems in the field of chemistry. These in silico techniques are used in pharmaceutical companies and academic settings in the process of drug discovery. They allow for the reduction of time and cost needed to develop new drugs. Cheminformatics is also used in the chemical and allied industries in various other forms.

Cheminformatics uses very large databases. To date, several databases have been developed for cataloging molecule data for drug development.

These databases consist of chemical fingerprints, used to identify the features of chemical elements within a matrix, and therefore define its unique portrait in comparison to similar matrices. Molecular fingerprints are widely used in drug discovery and virtual screening. The reasons for their popularity include: Ease of use (requiring little to no configuration), the speed at which substructure and similarity searches can be performed with them, and a virtual screening performance that equals other, more complex methods.

GSI Joins Forces with the Weizmann Institute of Science & G-INCPM

Project Overview

As a proof of concept for the APU technology with similarity structure search applications, GSI Technology has teamed up with researchers at the Weizmann Institute of Science, in Rehovot, Israel, and The Nancy & Stephen Grand Israel National Center for Personalized Medicine (G-INCPM), which is based at and managed by the Weizmann Institute. G-INCPM’s multi-disciplinary teams of scientists and clinicians—working in collaboration with biomedical and pharmaceutical companies—conduct research involving genomics, protein profiling, bioinformatics, and drug discovery, translating those insights to the clinic for the benefit of human health.

As part of ongoing research on new molecules with pharmacological properties, and with the goal of obtaining the approval of the FDA and other agencies, a team at the Weizmann Institute and G-INCPM is searching an established database of 38 million molecular fingerprints with the goal of identifying molecules that are similar—or have similar properties—to their query molecule.

The Weizmann Institute of Science & G-INCPM’s Approach to Structure Search

Researchers at the Weizmann Institute and G-INCPM have been using BIOVIA Pipeline Pilot[1] data analysis software, Oracle Database with BIOVIA Direct chemical cartridge installed, and a CPU, to fetch, process, and analyze molecular structures data. The BIOVIA Direct chemistry data cartridge enables researchers to register, search, and retrieve molecules and reactions in a fully integrated, relational Oracle® environment.

BIOVIA Direct chemical cartridge supports fingerprint generation using MDL keys and circular fingerprints such as ECFP-4/6 or FCFP 4/6. When it comes to finding, fetching and performing similarity structure, BIOVIA Direct delivers faster results on MDL keys. Search response time for circular fingerprints is much slower.

Regardless of the search technology, MDL keys only deliver relevant results when the similarity threshold is 0.8 and above, while circular fingerprints deliver relevant results on any threshold value. Similarity thresholds of 0.8 and below are of particular interest to researchers.

Therefore, a new solution is required that will enable the researchers to perform search efficiently and quickly, using circular fingerprints.

A New Approach with Gemini® APU

To speed up the similarity structure search process, GSI is replacing the Oracle Direct chemical cartridge-based BIOVIA similarity search engine with an APU-based similarity search engine. GSI has replaced the Pipeline Pilot “Search Fingerprint Column” component that calls Oracle Database, with a new Python API that calls the Gemini® APU processor and the GSI Numeric Library (GNL). In the GNL, GSI has implemented similarity search based on the Tanimoto distance measures.

The following diagram provides an overview of GSI’s APU similarity search solution. The Python API connects Pipeline Pilot to GNL. It can equally be used by other third-party applications, such as ChemAxon and ChemSpider.

GSI's APU Similarity Search Solution

[1] BIOVIA Pipeline Pilot is a software program developed by the Accelrys   ® group.

The GSI Similarity Structure Search Process

  1. The database is loaded ONCE into the system memory. It will only need to be loaded again if changes are made to it. This step represents a new capability in Pipeline Pilot, developed by GSI.
  2. Pipeline Pilot converts the researcher’s query molecule into a fingerprint. (Existing functionality).
  3. GSI replaces the existing Pipeline Pilot similarity search component with a GSI Search component. This component calls the GNL function that performs the following actions:
    1. The query fingerprint is loaded into the L1  memory block.
    2. The molecule database is divided into manageable chunks of data that can fit into L1 memory.
    3. We use the Tanimoto coefficient to perform similarity search. It is considered an ideal method for fingerprint-based calculations of the similarity of molecular representations.
  4. GSI’s Python code calls the GNL Tanimoto distance measures, which find values and indices of k nearest entries.

Some Screenshots From Pipeline Pilot

The following screenshots from BIOVIA’s Pipeline Pilot user interface, show the similarity search pipeline, and where GSI’s Python API has been embedded in the pipeline, connecting GSI’s APU and GNL resources.

GSI components imported into the Pipeline Pilot interface:

GSI APU components imported into the Pipeline Pilot interface

The imported GSI file is added to the file pane:

The imported GSI file is added to the file pane

The GSI file becomes a Pipeline Pilot component:

The GSI file becomes a Pipeline Pilot component

The GSI component’s paramters:

The GSI component’s paramters

A hyperlink from the BIOVIA Web Port search page is added to the GSI configuration UI.

A hyperlink from the BIOVIA Web Port search page is added to the GSI configuration UI

A Configuration hyperlink added to the BIOVIA Web Port search page.

A Configuration hyperlink added to the BIOVIA Web Port search page.

The link opens the GSI demo tool:

The link opens the GSI demo tool:

APU Performance: A Detailed Example

In a current typical Pipeline Pilot workflow, a scientist has to wait several minutes for just one molecule similarity search to be completed. The APU's unique hyper-scale computational search can speed up the search by many orders of magnitude.

Let's consider this following detailed example: Say we want to perform Tanimoto similarity of a query molecule on a database of 38 million molecules represented by ECFP fingerprints, each of a size of 512 bits.

The APU processes chunks of fingerprints while determining Tanimoto similarities to the query. While processing occurs, the APU simultaneously loads the next chunk of fingerprints. This kind of pipelining maximizes the throughput for Tanimoto search. Thanks to parallel processing, the APU can process multiple queries in one cycle, at speeds that far exceed current CPU capabilities.


The following table show Gemini® APU query processing time for KNN similarity search, with 512-bit and 1024-bit vectors, and number of queries ranging from 1 to 100.

The table below shows search time performance for Gemini® query processing on 38 million compounds using the Tanimoto threshold. The threshold range is 1 – 0.01, where 1 means an exact match.



Power Consumption

The APU reduces search power consumption by a factor of 3.5.

Typical Power Consumption, APU vs. CPU

Future Implementations

The APU can be easily integrated with other similarity search applications, using a Python API. Future implementations include:

  • Support for a single query on an APU chip from Enamine REAL database (680M compounds).
  • Calculate more descriptive fingerprints with a larger length of folded fingerprint (bit sizes of 256, 512, 1024, and up to 8192).
  • Load multiple databases into the APU system memory and select one database for a specific query.
  • Submit a batch of compounds (hundreds or thousands) simultaneously.
  • A protocol that enables interactive searching (i.e., in real time).


Joining the Fight Against Covid-19

Joining the fight against Covid-19, researchers at the Weizmann Institute are searching for molecules that are structurally similar to others that have known affect on the virus.

A central premise of medicinal chemistry is that structurally similar molecules exhibit similar biological activities. The coronavirus (SARS-CoV-2) is related to SARS-Cov, which was responsible for the SARS outbreak of the early 2000s. Remdesivir is an antiviral drug that has been shown in laboratory tests to be effective against SARS-Cov.

With Gemini® APU, an in-house database of 40 million compounds, and with Remdesivir as their query molecule, researchers are performing very fast similarity structure searches on their database to identify structurally similar compounds. They hope to identify active compounds that can be used to target the coronavirus.