Materials discovery acceleration by using condition generative methodology

Caiyuan Ye Beijing National Laboratory for Condensed Matter Physics and Institute of Physics, Chinese Academy of Sciences, Beijing 100190, China
University of Chinese Academy of Sciences, Beijing 100049, China
   Yuzhi Wang Beijing National Laboratory for Condensed Matter Physics and Institute of Physics, Chinese Academy of Sciences, Beijing 100190, China
University of Chinese Academy of Sciences, Beijing 100049, China
   Xintian Xie Shanghai Key Laboratory of Molecular Catalysis and Innovative Materials, Department of Chemistry, Key Laboratory of Computational Physical Science (Ministry of Education), Fudan University, Shanghai 200433, China
   Tiannian Zhu Beijing National Laboratory for Condensed Matter Physics and Institute of Physics, Chinese Academy of Sciences, Beijing 100190, China
University of Chinese Academy of Sciences, Beijing 100049, China
   Jiaxuan Liu Beijing National Laboratory for Condensed Matter Physics and Institute of Physics, Chinese Academy of Sciences, Beijing 100190, China
University of Chinese Academy of Sciences, Beijing 100049, China
   Yuqing He Beijing National Laboratory for Condensed Matter Physics and Institute of Physics, Chinese Academy of Sciences, Beijing 100190, China
   Lili Zhang Chongqing Bitmap Information Technology Co. Ltd, Chongqing 402760, China
   Junwei Zhang Chongqing Bitmap Information Technology Co. Ltd, Chongqing 402760, China
   Zhong Fang Beijing National Laboratory for Condensed Matter Physics and Institute of Physics, Chinese Academy of Sciences, Beijing 100190, China
University of Chinese Academy of Sciences, Beijing 100049, China
   Lei Wang Beijing National Laboratory for Condensed Matter Physics and Institute of Physics, Chinese Academy of Sciences, Beijing 100190, China
   Zhipan Liu zpliu@fudan.edu.cn Shanghai Key Laboratory of Molecular Catalysis and Innovative Materials, Department of Chemistry, Key Laboratory of Computational Physical Science (Ministry of Education), Fudan University, Shanghai 200433, China
   Hongming Weng hmweng@iphy.ac.cn Beijing National Laboratory for Condensed Matter Physics and Institute of Physics, Chinese Academy of Sciences, Beijing 100190, China
University of Chinese Academy of Sciences, Beijing 100049, China
   Quansheng Wu quansheng.wu@iphy.ac.cn Beijing National Laboratory for Condensed Matter Physics and Institute of Physics, Chinese Academy of Sciences, Beijing 100190, China
University of Chinese Academy of Sciences, Beijing 100049, China
Abstract

With the rapid advancement of AI technologies, generative models have been increasingly employed in the exploration of novel materials. By integrating traditional computational approaches such as density functional theory (DFT) and molecular dynamics (MD), existing generative models—including diffusion models and autoregressive models—have demonstrated remarkable potential in the discovery of novel materials. However, their efficiency in goal-directed materials design remains suboptimal. In this work we developed a highly transferable, efficient and robust conditional generation framework, PODGen, by integrating a general generative model with multiple property prediction models. Based on PODGen, we designed a workflow for the high-throughput crystals conditional generation which is used to search new topological insulators (TIs). Our results show that the success rate of generating TIs using our framework is 5.3 times higher than that of the unconstrained approach. More importantly, while general methods rarely produce gapped TIs, our framework succeeds consistently—highlighting an effectively \infty improvement. This demonstrates that conditional generation significantly enhances the efficiency of targeted material discovery. Using this method, we generated tens of thousands of new topological materials and conducted further first-principles calculations on those with promising application potential. Furthermore, we identified promising, synthesizable topological (crystalline) insulators such as CsHgSbCsHgSb\mathrm{CsHgSb}roman_CsHgSb, NaLaB12subscriptNaLaB12\mathrm{NaLaB_{12}}roman_NaLaB start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT, Bi4Sb2Se3subscriptBi4subscriptSb2subscriptSe3\mathrm{Bi_{4}Sb_{2}Se_{3}}roman_Bi start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT roman_Sb start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_Se start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, Be3Ta2SisubscriptBe3subscriptTa2Si\mathrm{Be_{3}Ta_{2}Si}roman_Be start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT roman_Ta start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_Si and Be2WsubscriptBe2W\mathrm{Be_{2}W}roman_Be start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_W.

I Introduction

New materials play a crucial role in industrial and technological fieldszhang2021dealing , offering unique properties that drive more efficient and sustainable solutionshu2021research . Crystalline materials, with their highly ordered structures, excel in areas such as electronics, optoelectronics, and medicine, providing significant support for technological advancementsdisa2021engineering ; butt2021recent ; chaudhary2022advances . However, traditional experimental and theoretical computation methods are increasingly unable to meet growing demandsagrawal2016perspective .

With the rapid advancement of AI technology over the past decade, new research paradigms have been introduced into the discovery of novel crystal materials, offering the potential to overcome the limitations of traditional methodsparadigms ; review . Well-developed predictive machine learning models have already demonstrated their ability to facilitate the rapid and accurate screening of crystal structurescgcnn ; megnet ; alignn ; dimenet ; gemnet ; coNGN ; liang2023material , assisting, accelerating, and even replacing first-principles calculationsm3gnet ; chgnet ; gnome ; liang2024cluster ; yang2024mattersim ; omat24 ; AlphaNet ; hamgnn ; hamgnn2 ; deeph2 ; deephe3 ; dpmoire . In recent years, various generative machine learning models have been applied to the exploration of new crystal structures. For example, diffusion-based modelscdvae ; diffcsp ; diffcsp++ ; joshi2025all such as CDVAE, autoregressive modelsschnet ; cifllm ; crystalformer such as CrystalFormer, flow-based modelsmiller2024flowmm ; luo2024crystalflow such as FlowMM, as well as several other modelsqiu2024vqcrystal ; sriram2024flowllm . These general generative machine learning models primarily focus on learning the distribution of crystal structures from training datasets, enabling the sampling of novel structures. Alongside these general generative models, conditional generative models have been developed to generate crystal structures tailored to specific target properties. Examples include MatterGenzeni2023mattergen , MatterGPTchen2024mattergpt , Con-CDVAEcon-cdvae , and Cond-CDVAEcond-cdvae . It is also noteworthy that recent efforts have employed reinforcement learningCFRL ; RLTI or active learningactive1 ; active2 to achieve conditional generation of crystal structures.

While some studies have explored the application of generative models in crystal structure discoverycdvae_super ; cdvae_2d ; cdvae_topo ; wyck_gen , materials with desirable physical properties and practical applications often constitute only a small fraction of known structures. In such cases, conditional generative models offer a more efficient approach than general generative models by guiding the search toward structures that meet specific criteria.

In this paper, we propose a conditional generation framework named PODGen, which means using Predictive models to Optimize the Distribution of the Generative model for conditional generation. It can be applied to various generative and predictive models, effectively improving the success rate of generation. Additionally, we have designed a workflow for high-throughput generation of crystal structures, including structure optimization, property verification, and structure deduplication. We demonstrate its application in generating topological insulators, which are crystalline materials with special electronic band structures that enable the formation of protected surface states, exhibiting unique electrical and spin-related propertiestqcwang . And 19324 topological insulators and topological crystalline insulators have been generated, with further first-principles calculations performed on promising materials with potential practical applications, which found 12 new dynamically stable (no imaginary phonon modes) crystal structures with desirable properties, among which 5 are located at the bottom of the potential energy surface (PES).

II Method

Refer to caption
Figure 1: The fundamental steps of crystal structure conditional generation framework. This is one step of the Markov Chain Monte Carlo (MCMC) process, where π()𝜋\pi(\cdot)italic_π ( ⋅ ) represents the target distribution we aim to sample from. CtsubscriptC𝑡\mathrm{C}_{t}roman_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT denotes the state (crystal structure) at the t-th Markov step, and CsuperscriptC\mathrm{C}^{\prime}roman_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is the proposed updated state. P()𝑃P\left(\mathrm{\cdot}\right)italic_P ( ⋅ ) is the crystal probability given by the generative model, and P(y|)𝑃conditionalyP\left(\mathrm{y|\cdot}\right)italic_P ( roman_y | ⋅ ) is the conditional probability provided by the prediction model, where y represents the target property we specify. The target distribution π()𝜋\pi(\cdot)italic_π ( ⋅ ) is determined by P()𝑃P\left(\mathrm{\cdot}\right)italic_P ( ⋅ ) and P(y|)𝑃conditionalyP\left(\mathrm{y|\cdot}\right)italic_P ( roman_y | ⋅ ).

II.1 Conditional generation framework

II.1.1 Basic composition

Most widely used general generative models in the field of crystal structure generation, such as autoregressive models, diffusion models, and flow-based models, are probabilistic generative models. Fundamentally, these models generate new crystal structures by learning the distribution of crystal structures present in the training dataset. Generating structures with these models can be understood as sampling from the distribution P(C)𝑃CP\left(\mathrm{C}\right)italic_P ( roman_C ), where CC\mathrm{C}roman_C represents crystal structure and P(C)𝑃CP\left(\mathrm{C}\right)italic_P ( roman_C ) is the learned distribution approximating the true distribution P(C)superscript𝑃CP^{*}\left(\mathrm{C}\right)italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( roman_C ) observed in the training data. However, when aiming to generate crystals within a specific domain, the objective shifts to sampling from the conditional distribution P(C|y)superscript𝑃conditionalC𝑦P^{*}\left(\mathrm{C}|y\right)italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( roman_C | italic_y ), where y𝑦yitalic_y denotes the target properties characterizing the materials within that domain. And it is well knew that P(C|y)=P(C)P(y|C)/P(y)superscript𝑃conditionalC𝑦superscript𝑃𝐶superscript𝑃conditional𝑦Csuperscript𝑃𝑦P^{*}\left(\mathrm{C}|y\right)=P^{*}\left(C\right)P^{*}\left(y|\mathrm{C}% \right)/P^{*}\left(y\right)italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( roman_C | italic_y ) = italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_C ) italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y | roman_C ) / italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y ). Here, P(y)superscript𝑃𝑦P^{*}\left(y\right)italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y ) acts as a normalization constant that can be ignored. Therefore, sampling from the distribution P(C|y)superscript𝑃conditionalC𝑦P^{*}\left(\mathrm{C}|y\right)italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( roman_C | italic_y ) can be reformulated as sampling from the distribution π(C)=P(C)P(y|C)superscript𝜋Csuperscript𝑃Csuperscript𝑃conditional𝑦C\pi^{*}\left(\mathrm{C}\right)=P^{*}\left(\mathrm{C}\right)P^{*}\left(y|% \mathrm{C}\right)italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( roman_C ) = italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( roman_C ) italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y | roman_C ).

Building upon the above analysis, we developed PODGen, a highly transferable and robust conditional generation framework. This framework consists of three key components: (1) a general generative model that provides P(C)𝑃CP\left(\mathrm{C}\right)italic_P ( roman_C ) to approximate P(C)superscript𝑃CP^{*}\left(\mathrm{C}\right)italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( roman_C ), (2) multiple predictive models that provide P(y|C)𝑃conditional𝑦CP\left(y|\mathrm{C}\right)italic_P ( italic_y | roman_C ) to approximate P(y|C)superscript𝑃conditional𝑦CP^{*}\left(y|\mathrm{C}\right)italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y | roman_C ), and (3) an efficient sampling method—Markov Chain Monte Carlo (MCMC) sampling. The fundamental steps of this framework are illustrated in FIG. 1.

In this framework, the generative model only needs to provide probabilistic estimates of crystal structures, without restrictions on the specific model type. Additionally, most widely used predictive models can be seamlessly integrated into this framework. For classification-based predictive models, which commonly employ cross-entropy loss, inherently yield probability estimates for each structural class; whereas for regression-based predictive models which commonly use the Mean Squared Error (MSE) as their loss function, which corresponds to a probabilistic model assuming that the observed data follow a Gaussian distribution centered around the predicted value with a fixed variance. Therefore, most predictive models can provide the probability P(y|C)𝑃conditional𝑦CP\left(y|\mathrm{C}\right)italic_P ( italic_y | roman_C ).

II.1.2 Crystal generation

MCMC sampling is an efficient method for sampling from complex high-dimensional distributions. Similar method has been used in language model for generating sentences that satisfy certain conditionsmiao2019cgmh ; zhang2020language , or optimizing the resultsong2025llmfeynmanleveraginglargelanguage . MCMC generates a sequence of correlated samples by iteratively transitioning from one state to another based on the transition matrix of a Markov chain. In the context of crystal structure generation, each state corresponds to a specific crystal structurecrystalformer . The Metropolis-Hastings (MH) algorithm enables efficient computation of transition probabilities in a Markov chainHM_M ; HM_H . This algorithm proposes potential new states based on a designed update strategy and then accepts the transition with a probability given by Eq. 1 Eq. 2. The detailed balance condition established in this way ensures that the samples obtained through MCMC conform to the target distribution.

A(C|Ct1)=min{1,A(C|Ct1)},𝐴conditionalsuperscriptCsubscriptC𝑡1min1superscript𝐴conditionalsuperscriptCsubscriptC𝑡1\displaystyle A\left(\mathrm{C}^{\prime}|\mathrm{C}_{t-1}\right)=\mathrm{min}% \left\{1,A^{*}\left(\mathrm{C}^{\prime}|\mathrm{C}_{t-1}\right)\right\},italic_A ( roman_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | roman_C start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) = roman_min { 1 , italic_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( roman_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | roman_C start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) } , (1)
A(C|Ct1)=π(C)P(Ct1|C)π(Ct1)P(C|Ct1).superscript𝐴conditionalsuperscriptCsubscriptC𝑡1𝜋superscriptC𝑃conditionalsubscriptC𝑡1superscriptC𝜋subscriptC𝑡1𝑃conditionalsuperscriptCsubscriptC𝑡1\displaystyle A^{*}\left(\mathrm{C}^{\prime}|\mathrm{C}_{t-1}\right)=\frac{\pi% \left(\mathrm{C}^{\prime}\right)P\left(\mathrm{C}_{t-1}|\mathrm{C}^{\prime}% \right)}{\pi\left(\mathrm{C}_{t-1}\right)P\left(\mathrm{C}^{\prime}|\mathrm{C}% _{t-1}\right)}.italic_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( roman_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | roman_C start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) = divide start_ARG italic_π ( roman_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_P ( roman_C start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | roman_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_π ( roman_C start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) italic_P ( roman_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | roman_C start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) end_ARG . (2)

Here, π()𝜋\pi\left(\cdot\right)italic_π ( ⋅ ) represents the target probability distribution, P(C|Ct1)𝑃conditionalsuperscriptCsubscriptC𝑡1P\left(\mathrm{C}^{\prime}|\mathrm{C}_{t-1}\right)italic_P ( roman_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | roman_C start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) denotes the probability of proposing a transition from crystal structure Ct1subscriptC𝑡1\mathrm{C}_{t-1}roman_C start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT to crystal structure CsuperscriptC\mathrm{C}^{\prime}roman_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, and A(C|Ct1)𝐴conditionalsuperscriptCsubscriptC𝑡1A\left(\mathrm{C}^{\prime}|\mathrm{C}_{t-1}\right)italic_A ( roman_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | roman_C start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) is the acceptance probability for this proposed transition. If the proposal is accepted, then CtsubscriptC𝑡\mathrm{C}_{t}roman_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = CsuperscriptC\mathrm{C}^{\prime}roman_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT; otherwise, CtsubscriptC𝑡\mathrm{C}_{t}roman_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = Ct1subscriptC𝑡1\mathrm{C}_{t-1}roman_C start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT. In this study, we applied this conditional generation framework to the generation of topological insulators, where π(C)𝜋C\pi\left(\mathrm{C}\right)italic_π ( roman_C ) is specifically defined as:

π(C)=P(C)kP(TI|C)P(NMet|C)P(NMag|C),𝜋C𝑃superscriptC𝑘𝑃conditionalTIC𝑃conditionalNMetC𝑃conditionalNMagC\displaystyle\pi\left(\mathrm{C}\right)=P\left(\mathrm{C}\right)^{k}P\left(% \mathrm{TI}|\mathrm{C}\right)P\left(\mathrm{NMet}|\mathrm{C}\right)P\left(% \mathrm{NMag}|\mathrm{C}\right),italic_π ( roman_C ) = italic_P ( roman_C ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_P ( roman_TI | roman_C ) italic_P ( roman_NMet | roman_C ) italic_P ( roman_NMag | roman_C ) , (3)
k={e0.5,if eE,eC1,else.\displaystyle k=\left\{\begin{aligned} &e^{0.5},&&\text{if }\exists e\in E,e% \in C\\ &1,&&\text{else}\end{aligned}\right..italic_k = { start_ROW start_CELL end_CELL start_CELL italic_e start_POSTSUPERSCRIPT 0.5 end_POSTSUPERSCRIPT , end_CELL start_CELL end_CELL start_CELL if ∃ italic_e ∈ italic_E , italic_e ∈ italic_C end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL 1 , end_CELL start_CELL end_CELL start_CELL else end_CELL end_ROW . (4)

In this paper we train a CrystalFormercrystalformer as the general generation model to provide the P(C)𝑃CP\left(\mathrm{C}\right)italic_P ( roman_C ), and three classification modeldimenet to provide P(y|C)𝑃conditional𝑦CP\left(y|\mathrm{C}\right)italic_P ( italic_y | roman_C ) where TI stands for topological insulator, NMet stands for non-metal, and NMag stands for non-magnetic. Since existing rapid identification toolshe2019symtopo ; tqcmethod for crystal topological properties are all symmetry-based, we selected CrystalFormer, which inherently encodes space group and Wyckoff position information within it. In contrast, the predictive models were not meticulously curated or extensively trained, further demonstrating the robustness of our framework. For more information about these model, please refer to Supplementary Section S1. And thought the analysis of the databasezhang2019catalogue ; tqc , it is not difficult to find that topological insulators are more likely to be found in crystals containing these elements E={Al,P,S,Ga,Ge,As,Se,In,Sn,Sb,Te,Pb,Bi}𝐸AlPSGaGeAsSeInSnSbTePbBiE=\left\{\mathrm{Al,P,S,Ga,Ge,As,Se,In,Sn,Sb,Te,Pb,Bi}\right\}italic_E = { roman_Al , roman_P , roman_S , roman_Ga , roman_Ge , roman_As , roman_Se , roman_In , roman_Sn , roman_Sb , roman_Te , roman_Pb , roman_Bi }. Therefore, we introduce k𝑘kitalic_k to modify the crystal probability P(C)𝑃CP\left(\mathrm{C}\right)italic_P ( roman_C ), aiming to generate crystals containing these elements with a higher probability.

We employ three types of proposals, each corresponding to modifications in atomic species, atomic coordinates, and lattice constants. At each step of the Markov chain, one of these proposals is randomly selected with probabilities of 0.2, 0.4, and 0.4, respectively. When modifying atomic species, we first select a Wyckoff position from the current configuration with equal probability and then replace the atomic species at that position with a randomly chosen element. For atomic coordinate modifications, we apply Gaussian noise to the fractional coordinates of all atoms with degrees of freedom, while respecting Wyckoff position constraints, which may prevent certain atomic coordinates from being altered. Similarly, when modifying lattice constants, Gaussian noise is added to all adjustable lattice parameters, subject to space group constraints, which may restrict changes to certain lattice constants.

Refer to caption
Figure 2: One of the annealing MCMC convergence curves is presented, where the horizontal axis represents the Markov step count, and the vertical axis denotes the logarithmic probability of the target distribution at the corresponding temperature. (a) The highest temperature T = 10. (b) The lowest temperature T = 1.

To prevent the generated structures from being confined to known regions of configuration space, we incorporate a simulated annealing approach with ten temperature levels ranging from T = 10 to T = 1. The process begins by randomly selecting a crystal structure from the Alexand20schmidt2022dataset ; schmidt2022large (refer to Supplementary Section S1) as the initial state. Starting from T = 10, we allow the system to equilibrate at each temperature before gradually cooling to the next level, continuing until convergence is reached at T = 1, at which point sampling is performed. Convergence at each temperature is determined based on the rolling window mean and standard deviation of LogπT(C)Logsubscript𝜋𝑇C\mathrm{Log}\ \pi_{T}\left(\mathrm{C}\right)roman_Log italic_π start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( roman_C ), with a window size of 200 steps and a tolerance of 1e-3. Fig. 2 presents the convergence curves at T = 10 and T = 1. During the sampling phase, we record a sample every 100 Markov steps.

II.2 Crystal generation workflow

Refer to caption
Figure 3: The core workflow of this study involves utilizing a conditional generation framework, which incorporates a general generative model, to generate crystal structures. Subsequently, various tools are employed for structural optimization, screening, and evaluation. Finally, materials with promising application potential undergo further first-principles calculations for validation.

We have designed a workflow for high-throughput generation of crystal structures in specific domains, as illustrated in FIG. 3. This workflow encompasses conditional crystal structure generation, machine learning force field relaxation, crystal property evaluation, and structure deduplication. We applied this workflow to the generation of topological insulators and further validated the most promising candidate materials through first-principles calculations.

The workflow integrates our conditional generation framework PODGen, a general machine learning force field (MLFF) OpenLAMpeng2025openlam , a symmetry-based topological classification tool Symtopohe2019symtopo , and first-principles calculation tools such as VASPkresse1996efficiency , along with software packages including pymatgenong2013python , ASElarsen2017atomic , VASPkitwang2021vaspkit , and Phonopyphonopy-phono3py-JPCM ; phonopy-phono3py-JPSJ . This workflow is transferable to the exploration of other condition-dependent crystal materials by simply replacing the corresponding prediction models and crystal property evaluation tools.

In this high-throughput generation workflow we first use MLFF model to relax generated structure. The MLFF relaxation closely approximates first-principles results while being approximately three orders of magnitude more efficient. And we employed the OpenLAM model released in October 2024peng2025openlam . Then we use SymTopohe2019symtopo to help us quickly verify the topological properties of new crystals. SymTopo is an automated tool for calculating the topological properties of nonmagnetic crystalline materials. At last, we use StructureMatcher𝑆𝑡𝑟𝑢𝑐𝑡𝑢𝑟𝑒𝑀𝑎𝑡𝑐𝑒𝑟StructureMatcheritalic_S italic_t italic_r italic_u italic_c italic_t italic_u italic_r italic_e italic_M italic_a italic_t italic_c italic_h italic_e italic_r module from pymatgenong2013python to determine whether two structures are similar. The scope of duplicate checking is the combined datasetheyu of Materiaezhang2019catalogue and TQCtqc , as well as the newly generated crystals. For more details, please refer to Supplementary Section S2.

II.3 Promising Crystals Validation

For structures with a direct band gap that are classified as topological insulators or topological crystalline insulators after MLFF relaxation, we further verify their properties and stability using first-principles calculations. This process includes DFT relaxation, reconfirmation of topological classification using SymTopo, and phonon spectrum analysis. To ensure computational efficiency and reproducibility, we primarily use pymatgen and VASPkit to generate input files for VASP, with all calculations performed using VASP 5.4.4.

During the relaxation process, we divide it into two steps. In the first step, we use VASP input files generated by pymatgen for relaxation. However, since the default convergence criteria in pymatgen are not stringent enough (EDIFF is typically set on the order of 1e-3), we refine the relaxation in a second step. Once the first relaxation succeeds, we modify EDIFF to 1e-5 and EDIFFG to -1e-3 in the second step, ensuring that the atomic forces are reduced to below 1e-3 eV/Å.

If both relaxation steps converge, we use SymTopo to reassess the topological properties and band gap of the relaxed structure. For structures that remain topological insulators with a direct band gap, we further compute their electronic band structure and phonon spectrum. For band structure calculations, we directly use the CHGCAR and fermi energy obtained from SymTopo’s SCF calculation, along with the high-symmetry path generated by VASPkit, to plot the band structure. For phonon spectrum calculations, we employ a 2×2×2 supercell and the DFPT method. Except for setting ENCUT to 1.3 times ENMAX, all other parameters are generated by VASPkit. Most DFT calculations in this paper use the PBE exchange-correlation functionalPBE , and some of the results obtained using the HSE06 functionalhse06 can be found in the Supplementary Information.

To further evaluate the synthesizability of these structures, we employed the Stochastic Surface Walking (SSW) methodssw ; sswcrystal to explore their potential energy surfaces. This not only allowed us to determine the locations of the generated crystal structures on the PES, but also enabled the discovery of more stable configurations.

III Result

III.1 Topological Material Condition Generation

Refer to caption
Figure 4: (a) The elemental occurrence frequency distribution among the generated TI and TCI materials, and (b) The formation energy distribution predicted by OpenLAM. (c) Further calculations and screening performed on 104 TI and TCI materials with direct band gaps.
Table 1: The proportions of topological insulators and topological crystalline insulators, as well as their ratio, in both general generation and conditional generation using topological insulators as constraint.
Method TI TCI TI:TCI
General generation 2.85% 2.45% 1.16:1
Conditional generation 15.25% 9.93% 1.62:1

Using the method mentioned before, we have generated 84726 crystal, 78110 of them can be successfully relaxed by OpenLAM with maximum atomic force falls below 0.02 eV/Å and predicted formation energy smaller then 1.0 eV/atom. Then 78575 of them can get the topological classification given by Symtopo. After removing the duplicate structures, there are 11914 unique crystals classified as TI and 7336 unique crystals classified as TCI, corresponding to proportions of 15.25% and 9.93%. Here, 68 TIs are considered to have direct band gaps, among which 63 are also regarded as having indirect band gaps. Among the 36 TCIs considered to have direct band gaps, 34 are also regarded as having indirect band gaps.

We also explored the direct generation of crystal structures using CrystalFormer for topological material screening. Among the 2000 generated materials, only 57 were identified as TI and 49 as TCI, corresponding to probabilities of 2.85% and 2.45%, respectively. This generation efficiency is significantly lower than that achieved through conditional generation. More importantly, no gapped TIs were found among them. Furthermore, we observed that in the absence of conditional generation, the ratio of TI to TCI was 1.16:1. However, when generating materials conditioned on TI (excluding TCI), this ratio increased to 1.62:1. As shown in Supplementary Table S1, the topological classification model we employed is a relatively basic one, and TI and TCI are known to be categories that are prone to misclassification by predictive models. Nevertheless, our conditional generation framework significantly improves both the success rate and the proportion of materials with the desired topological properties, demonstrating its robustness. We believe that employing a state-of-the-art (SOTA) predictive model with refined training will further enhance generation efficiency.

We conducted a statistical analysis of the 19,250 generated TI and TCI materials. FIG. 4(a) shows the occurrence frequency of each element in these materials. Compared to the CrystalFormer training set (Supplementary Fig. S2), the elemental distribution of the generated crystals has been substantially altered, resembling more closely the distribution of topological insulators in the existing databaseheyu (Supplementary Fig. S4). This indicates that our conditional generation framework effectively adjusts the baseline distribution of CrystalFormer toward the target distribution characteristic of topological materials. Further analysis reveals that, although elements such as B and Ge maintain high occurrence frequencies similar to those in the existing database, the most prevalent element shifted from O in the original database to H in the newly generated materials. This shift demonstrates that our framework not only aligns with the existing distribution but also explores new compositional spaces beyond the limitations of the original datasets, leading to the discovery of novel crystal structures.

FIG. 4(b) presents the formation energy distribution of the 19,250 generated TI and TCI materials, with formation energies predicted concurrently during relaxation using OpenLAM. Although the distribution does not fully align with the ideal scenario where all formation energies are negative, it closely resembles the formation energy distribution of the CrystalFormer training set (Supplementary Fig. S3). This observation suggests that for properties not explicitly constrained by our conditional generation framework—such as formation energy, which has little direct correlation with topological properties—the generated crystal structures largely adhere to the inherent distribution of the base model.

III.2 Promising Materials Validation

Refer to caption
Figure 5: Four crystal structures that remain topological (Crystalline) insulators with a direct band gap after further relaxation via DFT, and exhibit phonon spectra without imaginary frequencies as calculated by DFPT. The topological classification and symmetry-based indicators for these structures are provided by SymTopohe2019symtopo . Ehull represents the energy above hull, which are obtained from DFT calculations.

Although materials without a band gap can still be identified as TI through symmetry-based topological classification methodshe2019symtopo ; tqcmethod , gapped TI are generally considered more promising for practical applications. Therefore, from the generated TI and TCI materials, we selected 104 candidates with a direct band gap for further validation through first-principles calculations. The verification process is outlined in FIG.4(c). Among these materials, 88 were successfully relaxed, and 50 retained their direct band gap as TI or TCI after relaxation. Notably, 12 of these materials exhibited phonon spectra without imaginary frequencies. The crystal structures, SymTopo classification results, topological indices, band structures, and phonon spectra of these 12 materials are presented in FIG. 5 and Supplementary Fig. S6.

Refer to caption
Figure 6: PES contour plots of five newly generated materials obtained from SSW, where the vertical axis represents the relative energy and the horizontal axis denotes the structural descriptor. ‘DOS’ indicates the number of times each structure was found. The red crosses mark the positions of these five materials on their respective potential energy surfaces.

We applied the SSW method to explore the PES of this 12 materials, in order to further evaluate their experimental synthesizability. As shown in FIG. 6, these five materials are located at the bottom of their PES and exhibit negative energy above hull (as shown in FIG. 5 and Supplementary Fig. S6), indicating a higher likelihood of experimental realization. The PES landscapes of the remaining materials are shown in Supplementary Fig. S7.

III.3 WannierTools confirmation

Refer to caption
Figure 7: Boundary-state spectra for three materials, each plotted along the indicated high-symmetry lines. (a) BaMo6As2Se6subscriptBaMo6subscriptAs2subscriptSe6\mathrm{BaMo_{6}As_{2}Se_{6}}roman_BaMo start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT roman_As start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_Se start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT with an open boundary on the (010) surface, (b) RbSrBiRbSrBi\mathrm{RbSrBi}roman_RbSrBi with an open boundary on the (100) surface, (c) CsHgSbCsHgSb\mathrm{CsHgSb}roman_CsHgSb with open boundary on the (100) surface, and (d) Ca2AgAssubscriptCa2AgAs\mathrm{Ca_{2}AgAs}roman_Ca start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_AgAs with an open boundary on the (010) surface. The red curves highlight the in-gap boundary modes.

To further validate our results, we selected several materials and constructed Wannier tight-binding modelstight1 ; tight2 ; tight3 using Wannier90wannier90 . We then used the WannierToolswu2018wanniertools package to calculate surface states and the Wilson loopsz21 ; z22 ; yu2011equivalent based on these Wannier models. In our analysis, open boundary conditions were imposed along different crystallographic directions: FIG. 7(a) presents the boundary-state spectrum of BaMo6As2Se6subscriptBaMo6subscriptAs2subscriptSe6\mathrm{BaMo_{6}As_{2}Se_{6}}roman_BaMo start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT roman_As start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_Se start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT, with an open boundary on the (010) surfac. whereas FIG. 7(b) and FIG. 7(c) show the spectra of RbSrBiRbSrBi\mathrm{RbSrBi}roman_RbSrBi and CsHgSbCsHgSb\mathrm{CsHgSb}roman_CsHgSb, respectively, with an open boundary on the (100) surface, and FIG. 7(d) displays Ca2AgAssubscriptCa2AgAs\mathrm{Ca_{2}AgAs}roman_Ca start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_AgAs with an open boundary on the (010) surface. As illustrated in the figures, pronounced in-gap boundary states are clearly observed, indicating that these materials are topological (Crystalline) insulators. Detailed Wilson loop calculations can be found in Supplementary Fig. S5.

IV Conclusion

Crystal structure generation models are powerful tools for discovering novel crystalline materials. However, when searching for structures with specific properties, conditional generation methods can significantly enhance efficiency. In this study, we developed a highly transferable and robust conditional generation framework PODGen by integrating the general crystal generation model, with multiple property prediction models. A generative model capable of providing crystal structure probabilities and most existing predictive models can be seamlessly incorporated into this framework, which imposes minimal requirements on their predictive capabilities. Moreover, once the base model is trained, conditional generation can be performed simply by training an appropriate predictive model for any specific domain. This significantly reduces the dependence on large domain-specific datasets and lowers training costs.

For properties explicitly conditioned by a predictive model (or those with strong correlations), our approach effectively guides the base model—originally trained to follow the distribution of the training dataset—toward generating structures that conform to a desired target distribution. Conversely, for properties without an associated predictive model (or those with weak correlations), the generated structures continue to follow the original training distribution.

We applied this framework to the conditional generation of topological insulator materials, achieving a success rate 5.35 times higher than that of conventional generation models. More importantly, the stricter the property constraints on the generated crystals, the greater the advantage of our framework over general generative models. For example, in generating gapped TIs, our framework achieves success where general methods almost entirely fail—representing an effectively \infty improvement. Using this method, we generated over 80,000 structures, nearly 20,000 of which were identified as TI or TCI. Further first-principles calculations were performed on the subset with direct band gaps, leading to the identification of 12 materials with promising application potential. Five of these structures are located near the global minima of the PES, suggesting a higher likelihood of experimental synthesis. Furthermore, we used WannierTools to further verify our results.

Certainly, there remains room for improvement in our framework. We adopted CrystalFormer as the base model; however, during the MCMC state updates, we only modified atomic species, atomic positions, and lattice constants, while leaving Wyckoff positions and space groups unchanged. This limitation arises because, in the string-based crystal structure representation used by CrystalFormer, Wyckoff positions are interdependent, requiring a more sophisticated update strategy. Additionally, modifications to the space group would fundamentally alter the entire structural representation of CrystalFormer.

V Code and Data

Our code is available at http://github.com/cyye001/PODGen. And the dataset of generated crystals will be shown on a website of the Condensed Matter Physics Data Center of Chinese Academy of Sciences http://cmpdc.iphy.ac.cn/materialsgalaxy/#/services/materials and can be downloaded in Electronic Laboratory for Material Science http://in.iphy.ac.cn/eln/link.html#/113/G9f5.

References

  • (1) Zhang, D. et al. Dealing with the foreign-body response to implanted biomaterials: strategies and applications of new materials. Advanced Functional Materials 31, 2007226 (2021).
  • (2) Hu, X., Deng, Z., Lin, X., Xie, Y. & Teodorescu, R. Research directions for next-generation battery management solutions in automotive applications. Renewable and Sustainable Energy Reviews 152, 111695 (2021).
  • (3) Disa, A. S., Nova, T. F. & Cavalleri, A. Engineering crystal structures with light. Nature Physics 17, 1087–1092 (2021).
  • (4) Butt, M., Khonina, S. N. & Kazanskiy, N. Recent advances in photonic crystal optical devices: A review. Optics & laser technology 142, 107265 (2021).
  • (5) Chaudhary, V. S., Kumar, D., Pandey, B. P. & Kumar, S. Advances in photonic crystal fiber-based sensor for detection of physical and biochemical parameters—a review. IEEE sensors journal 23, 1012–1023 (2022).
  • (6) Agrawal, A. & Choudhary, A. Perspective: Materials informatics and big data: Realization of the “fourth paradigm” of science in materials science. Apl Materials 4 (2016).
  • (7) Fang, J. et al. Machine learning accelerates the materials discovery. Materials Today Communications 33, 104900 (2022).
  • (8) Wang, Z., Hua, H., Lin, W., Yang, M. & Tan, K. C. Crystalline material discovery in the era of artificial intelligence (2025). URL http://arxiv.org/abs/2408.08044. eprint 2408.08044.
  • (9) Xie, T. & Grossman, J. C. Crystal graph convolutional neural networks for an accurate and interpretable prediction of material properties. Physical review letters 120, 145301 (2018).
  • (10) Chen, C., Ye, W., Zuo, Y., Zheng, C. & Ong, S. P. Graph networks as a universal machine learning framework for molecules and crystals. Chemistry of Materials 31, 3564–3572 (2019).
  • (11) Choudhary, K. & DeCost, B. Atomistic line graph neural network for improved materials property predictions. npj Computational Materials 7, 185 (2021).
  • (12) Gasteiger, J., Giri, S., Margraf, J. T. & Günnemann, S. Fast and uncertainty-aware directional message passing for non-equilibrium molecules. arXiv preprint arXiv:2011.14115 (2020).
  • (13) Gasteiger, J., Becker, F. & Günnemann, S. Gemnet: Universal directional graph neural networks for molecules. Advances in Neural Information Processing Systems 34, 6790–6802 (2021).
  • (14) Ruff, R., Reiser, P., Stühmer, J. & Friederich, P. Connectivity optimized nested line graph networks for crystal structures. Digital Discovery 3, 594–601 (2024).
  • (15) Liang, C. et al. Material symmetry recognition and property prediction accomplished by crystal capsule representation. Nature Communications 14, 5198 (2023).
  • (16) Chen, C. & Ong, S. P. A universal graph deep learning interatomic potential for the periodic table. Nature Computational Science 2, 718–728 (2022).
  • (17) Deng, B. et al. Chgnet as a pretrained universal neural network potential for charge-informed atomistic modelling. Nature Machine Intelligence 5, 1031–1041 (2023).
  • (18) Merchant, A. et al. Scaling deep learning for materials discovery. Nature 624, 80–85 (2023).
  • (19) Liang, C. et al. A cluster-based deep learning model perceiving series correlation for accurate prediction of phonon spectrum. Advanced Science 11, 2406183 (2024).
  • (20) Yang, H. et al. Mattersim: A deep learning atomistic model across elements, temperatures and pressures. arXiv preprint arXiv:2405.04967 (2024).
  • (21) Barroso-Luque, L. et al. Open materials 2024 (omat24) inorganic materials dataset and models. arXiv preprint arXiv:2410.12771 (2024).
  • (22) Yin, B. et al. Alphanet: Scaling up local frame-based atomistic foundation model. arXiv preprint arXiv:2501.07155 (2025).
  • (23) Zhong, Y., Yu, H., Su, M., Gong, X. & Xiang, H. Transferable equivariant graph neural networks for the hamiltonians of molecules and solids. npj Computational Materials 9, 182 (2023).
  • (24) Zhong, Y. et al. Universal machine learning kohn–sham hamiltonian for materials. Chinese Physics Letters 41, 077103 (2024).
  • (25) Wang, Y. et al. Deeph-2: enhancing deep-learning electronic structure via an equivariant local-coordinate transformer. arXiv preprint arXiv:2401.17015 (2024).
  • (26) Gong, X. et al. General framework for e (3)-equivariant neural network representation of density functional theory hamiltonian. Nature Communications 14, 2848 (2023).
  • (27) Liu, J., Fang, Z., Weng, H. & Wu, Q. Dpmoire: A tool for constructing accurate machine learning force fields in moiré systems (2025). URL http://arxiv.org/abs/2412.19333. eprint 2412.19333.
  • (28) Xie, T., Fu, X., Ganea, O.-E., Barzilay, R. & Jaakkola, T. Crystal diffusion variational autoencoder for periodic material generation. arXiv preprint arXiv:2110.06197 (2021).
  • (29) Jiao, R. et al. Crystal structure prediction by joint equivariant diffusion. Advances in Neural Information Processing Systems 36, 17464–17497 (2023).
  • (30) Jiao, R., Huang, W., Liu, Y., Zhao, D. & Liu, Y. Space group constrained crystal generation. arXiv preprint arXiv:2402.03992 (2024).
  • (31) Joshi, C. K. et al. All-atom diffusion transformers: Unified generative modelling of molecules and materials. arXiv preprint arXiv:2503.03965 (2025).
  • (32) Gebauer, N., Gastegger, M. & Schütt, K. Symmetry-adapted generation of 3d point sets for the targeted discovery of molecules. Advances in neural information processing systems 32 (2019).
  • (33) Flam-Shepherd, D. & Aspuru-Guzik, A. Language models can generate molecules, materials, and protein binding sites directly in three dimensions as xyz, cif, and pdb files. arXiv preprint arXiv:2305.05708 (2023).
  • (34) Cao, Z., Luo, X., Lv, J. & Wang, L. Space group informed transformer for crystalline materials generation. arXiv preprint arXiv:2403.15734 (2024).
  • (35) Miller, B. K., Chen, R. T., Sriram, A. & Wood, B. M. Flowmm: Generating materials with riemannian flow matching. In Forty-first International Conference on Machine Learning (2024).
  • (36) Luo, X. et al. Crystalflow: A flow-based generative model for crystalline materials. arXiv preprint arXiv:2412.11693 (2024).
  • (37) Qiu, Z. et al. Vqcrystal: Leveraging vector quantization for discovery of stable crystal structures. arXiv preprint arXiv:2409.06191 (2024).
  • (38) Sriram, A., Miller, B., Chen, R. T. & Wood, B. Flowllm: Flow matching for material generation with large language models as base distributions. Advances in Neural Information Processing Systems 37, 46025–46046 (2024).
  • (39) Zeni, C. et al. A generative model for inorganic materials design. Nature 1–3 (2025).
  • (40) Chen, Y. et al. Mattergpt: A generative transformer for multi-property inverse design of solid-state materials. arXiv preprint arXiv:2408.07608 (2024).
  • (41) Ye, C.-Y., Weng, H.-M. & Wu, Q.-S. Con-cdvae: A method for the conditional generation of crystal structures. Computational Materials Today 1, 100003 (2024).
  • (42) Luo, X. et al. Deep learning generative model for crystal structure prediction. npj Computational Materials 10, 254 (2024).
  • (43) Cao, Z. & Wang, L. Crystalformer-rl: Reinforcement fine-tuning for materials design (2025). URL http://arxiv.org/abs/2504.02367. eprint 2504.02367.
  • (44) Xu, H., Qian, D., Liu, Z., Jiang, Y. & Wang, J. Design topological materials by reinforcement fine-tuned generative model (2025). URL http://arxiv.org/abs/2504.13048. eprint 2504.13048.
  • (45) Li, Z., Liu, S., Ye, B., Srolovitz, D. J. & Wen, T. Active learning for conditional inverse design with crystal generation and foundation atomic models (2025). URL http://arxiv.org/abs/2502.16984. eprint 2502.16984.
  • (46) Han, X.-Q. et al. Invdesflow: An ai-driven materials inverse design workflow to explore possible high-temperature superconductors. Chinese Physics Letters (2025). URL http://iopscience.iop.org/article/10.1088/0256-307X/42/4/047301.
  • (47) Choudhary, K. & Garrity, K. Designing high-tc superconductors with bcs-inspired screening, density functional theory, and deep-learning. npj Computational Materials 8, 244 (2022).
  • (48) Lyngby, P. & Thygesen, K. S. Data-driven discovery of 2d materials by deep generative models. npj Computational Materials 8, 232 (2022).
  • (49) Hong, T. et al. Discovery of new topological insulators and semimetals using deep generative models. npj Quantum Materials 10, 12 (2025).
  • (50) Yamazaki, S. et al. Multi-property directed generative design of inorganic materials through wyckoff-augmented transfer learning (2025). URL http://arxiv.org/abs/2503.16784. eprint 2503.16784.
  • (51) Bradlyn, B. et al. Topological quantum chemistry. Nature 547, 298–305 (2017).
  • (52) Miao, N., Zhou, H., Mou, L., Yan, R. & Li, L. Cgmh: Constrained sentence generation by metropolis-hastings sampling. In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, 6834–6842 (2019).
  • (53) Zhang, M., Jiang, N., Li, L. & Xue, Y. Language generation via combinatorial constraint satisfaction: A tree search enhanced monte-carlo approach. arXiv preprint arXiv:2011.12334 (2020).
  • (54) Song, Z. et al. Llm-feynman: Leveraging large language models for universal scientific formula and theory discovery (2025). URL http://arxiv.org/abs/2503.06512. eprint 2503.06512.
  • (55) Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H. & Teller, E. Equation of state calculations by fast computing machines. The journal of chemical physics 21, 1087–1092 (1953).
  • (56) Hastings, W. K. Monte carlo sampling methods using markov chains and their applications. Biometrika 57, 97–109 (1970). URL http://doi.org/10.1093/biomet/57.1.97.
  • (57) He, Y. et al. Symtopo: An automatic tool for calculating topological properties of nonmagnetic crystalline materials. Chinese Physics B 28, 087102 (2019).
  • (58) Vergniory, M. et al. A complete catalogue of high-quality topological materials. Nature 566, 480–485 (2019).
  • (59) Zhang, T. et al. Catalogue of topological electronic materials. Nature 566, 475–479 (2019).
  • (60) Vergniory, M. G. et al. All topological bands of all nonmagnetic stoichiometric materials. Science 376, eabg9094 (2022).
  • (61) Schmidt, J., Wang, H.-C., Cerqueira, T. F., Botti, S. & Marques, M. A. A dataset of 175k stable and metastable materials calculated with the pbesol and scan functionals. Scientific Data 9, 64 (2022).
  • (62) Schmidt, J. et al. Large-scale machine-learning-assisted exploration of the whole materials space. arXiv preprint arXiv:2210.00579 (2022).
  • (63) Peng, A., Liu, X., Guo, M.-Y., Zhang, L. & Wang, H. The openlam challenges. arXiv preprint arXiv:2501.16358 (2025).
  • (64) Kresse, G. & Furthmüller, J. Efficiency of ab-initio total energy calculations for metals and semiconductors using a plane-wave basis set. Computational materials science 6, 15–50 (1996).
  • (65) Ong, S. P. et al. Python materials genomics (pymatgen): A robust, open-source python library for materials analysis. Computational Materials Science 68, 314–319 (2013).
  • (66) Larsen, A. H. et al. The atomic simulation environment—a python library for working with atoms. Journal of Physics: Condensed Matter 29, 273002 (2017).
  • (67) Wang, V., Xu, N., Liu, J.-C., Tang, G. & Geng, W.-T. Vaspkit: A user-friendly interface facilitating high-throughput computing and analysis using vasp code. Computer Physics Communications 267, 108033 (2021).
  • (68) Togo, A., Chaput, L., Tadano, T. & Tanaka, I. Implementation strategies in phonopy and phono3py. J. Phys. Condens. Matter 35, 353001 (2023).
  • (69) Togo, A. First-principles phonon calculations with phonopy and phono3py. J. Phys. Soc. Jpn. 92, 012001 (2023).
  • (70) He, Y. Machine Learning topological characteristics from multiple electronic materials databases. Ph.D. thesis, UCL-Université Catholique de Louvain (2023).
  • (71) Perdew, J. P., Burke, K. & Ernzerhof, M. Generalized gradient approximation made simple. Physical review letters 77, 3865 (1996).
  • (72) Peralta, J. E., Heyd, J., Scuseria, G. E. & Martin, R. L. Spin-orbit splittings and energy band gaps calculated with the heyd-scuseria-ernzerhof screened hybrid functional. Physical Review B—Condensed Matter and Materials Physics 74, 073101 (2006).
  • (73) Shang, C. & Liu, Z.-P. Stochastic surface walking method for structure prediction and pathway searching. Journal of Chemical Theory and Computation 9, 1838–1845 (2013).
  • (74) Shang, C., Zhang, X.-J. & Liu, Z.-P. Stochastic surface walking method for crystal structure and phase transition pathway prediction. Physical Chemistry Chemical Physics 16, 17845–17856 (2014).
  • (75) Marzari, N. & Vanderbilt, D. Maximally localized generalized wannier functions for composite energy bands. Physical review B 56, 12847 (1997).
  • (76) Souza, I., Marzari, N. & Vanderbilt, D. Maximally localized wannier functions for entangled energy bands. Physical Review B 65, 035109 (2001).
  • (77) Marzari, N., Mostofi, A. A., Yates, J. R., Souza, I. & Vanderbilt, D. Maximally localized wannier functions: Theory and applications. Reviews of Modern Physics 84, 1419–1475 (2012).
  • (78) Mostofi, A. A. et al. An updated version of wannier90: A tool for obtaining maximally-localised wannier functions. Computer Physics Communications 185, 2309–2310 (2014).
  • (79) Wu, Q., Zhang, S., Song, H.-F., Troyer, M. & Soluyanov, A. A. Wanniertools: An open-source software package for novel topological materials. Computer Physics Communications 224, 405–416 (2018).
  • (80) Fu, L. & Kane, C. L. Topological insulators with inversion symmetry. Physical Review B—Condensed Matter and Materials Physics 76, 045302 (2007).
  • (81) Fu, L., Kane, C. L. & Mele, E. J. Topological insulators in three dimensions. Physical review letters 98, 106803 (2007).
  • (82) Yu, R., Qi, X. L., Bernevig, A., Fang, Z. & Dai, X. Equivalent expression of z 2 topological invariant for band insulators using the non-abelian berry connection. Physical Review B—Condensed Matter and Materials Physics 84, 075119 (2011).

Acknowledgements

We thank Shigang Ou, Ruihan Zhang, Jingyu Yao, Yue Xie, Yi Yan, Yuanchen Shen for useful discussions. This work was supported by the Science Center of the National Natural Science Foundation of China (Grant No. 12188101), the National Key Research and Development Program of China (Grant No. 2023YFA1607400, 2022YFA1403800), the National Natural Science Foundation of China (Grant No.12274436, 11921004), and H.W. acknowledge support from the New Cornerstone Science Foundation through the XPLORER PRIZE.  

Author contributions

C.Y., H.W. and Q.W. conceived the idea and performed the analysis. C.Y. developed and implemented the PODGen framework, designed and executed the crystal generation workflow, and conducted the DFT, SymTopo, and DFPT calculations. Y.W. performed the WannierTools calculations. X.X. and Z.L. carried out the SSW calculations. T.Z. presented the results on the website. Y.H. provided the topological materials dataset and performed data cleaning. All authors contributed to the interpretation of the results and the writing of the manuscript.

Competing interests

The authors declare no competing interests.

Additional information

Supplementary information

Correspondence and requests for materials should be addressed to Quansheng Wu.