The Protein Data Bank archive, which contains more than 160,000 3D structures for proteins, DNA, and RNA, this month released a new Coronavirus protease structure following the recent coronavirus outbreak, an ongoing viral epidemic primarily affecting mainland China that now threatens to spread to populations in other parts of the world.
The structure, the topic of the PDB’s current ‘Molecule of the Month’ feature, is a high-resolution crystal structure of 2019-nCoV coronavirus 3CL hydrolase (Mpro) as determined by Zihe Rao and Haitao Yang’s research team at ShanghaiTech University. Rapid public release of this structure of the main protease of the virus, known within the archive as PDB 6lu7, will enable research on this newly-recognized human pathogen. More details are available from the PDB.
The PDB archive is jointly managed by the Worldwide Protein Data Bank partnership, involving data centers in the United States, Europe and Asia. U.S. operations are led by the RCSB Protein Data Bank at Rutgers, the San Diego Supercomputer Center (SDSC) at UC San Diego, and UC San Francisco. PDB data provide a starting point for structure-guided drug discovery.
Such viruses have increasingly become a danger to world health, given the increase in global travel, according to the PDB release. Particularly virulent forms have emerged from their natural animal hosts and pose a threat to human communities. In 2003, the Severe Acute Respiratory Syndrome (SARS) virus emerged in China from bat populations, moving to civets and finally to humans. Ten years later, the MERS virus also emerged from bats, transferring in the Middle East to dromedary camels and then to humans.
While the latest entry is currently the only public-domain 3D structure from this specific coronavirus, the PDB archive also contains structures of the corresponding enzyme from other coronaviruses. The 2003 outbreak of the closely-related SARS virus led to the first 3D structures, and today there are more than 200 PDB structures of SARS proteins.
“Function follows form in biology,” said Stephen K. Burley, physician-scientist and director of the RCSB Protein Data Bank and faculty member at Rutgers University and UC San Diego-SDSC. “Open access to PDB data ensures that rapid access to rigorously validated and expertly curated 3D structure information contributes broadly to research and education in fundamental biology, biomedicine, bioenergy, and biotechnology.”
The coronavirus 3CL hydrolase (Mpro) enzyme, also known as the main protease, is essential for proteolytic maturation of the virus. It is thought to be a promising target for discovery of small-molecule drugs that would inhibit cleavage of the viral polyprotein and prevent spread of the infection.
Comparison of the protein sequence of the 2019-nCoV coronavirus 3CL hydrolase (Mpro) against the PDB archive identified 95 PDB proteins with at least 90% sequence identity.
Furthermore, these related protein structures contain approximately 30 distinct small molecule inhibitors, which could guide discovery of new drugs. Of particular significance for drug discovery is the very high amino acid sequence identity (96%) between the 2019-nCoV coronavirus 3CL hydrolase (Mpro) and the SARS virus main protease (PDB 1q2w). Summary data about these closely-related PDB structures are available (CSV) to help researchers more easily find this information.
In addition, the PDB houses 3D structure data for more than 20 unique SARS proteins represented in more than 200 PDB structures, including a second viral protease, the RNA polymerase, the viral spike protein, a viral RNA, and other proteins (CSV).
“Coronavirus main proteases represent attractive targets for drug discovery and development.
3D structure information freely available from the PDB includes small chemicals bound tightly to the enzyme active site (the business end of the main protease), confirming that they are druggable targets,” explained Burley. “Some of these structures provide starting points for structure-guided drug discovery of protease inhibitors with drug-like properties suitable for preclinical testing. We hope that this new structure, and those from SARS and MERS, will help researchers and clinicians address the 2019-nCoV coronavirus global public health emergency.”
Reproducible and Scalable Structural Bioinformatics
Scientists face time-consuming barriers when applying structural bioinformatics analysis, including complex software setups, non-interoperable data formats, and lack of documentation, all which make it difficult to reproduce results and reuse software pipelines. A further challenge is the ever-growing size of datasets that need to be analyzed.
To address these challenges, SDSC’s Structural Bioinformatics Laboratory, directed by Peter Rose, is developing a suite of reusable, scalable software components called MMTF-PySpark, using three key technologies: its parallel distributed processing framework provides scalable computing; the MacroMolecular Transmission Format (MMTF), a new binary and compressed representation of Macromolecular structures, which enables high-performance processing of PDB structures.
“The use of MMTF-PySpark could easily shave off a year of a graduate student’s or postdoc’s work in Structural Bioinformatics,” said Rose. “We bank on contributions from the community to develop and share an eco-system of interoperable tools.”