Interpretable structural-semantic decoding reveals language-like organisation of regulatory information in DNA

Li Yang, Dongbo Wang

Published: 2025/3/30

Abstract

Decoding how linear DNA encodes regulatory information remains a central challenge. Existing decoding approaches lack interpretability and struggle to reveal the underlying coding principles. Here, we present the interpretability-first, structural artificial intelligence (AI) framework for DNA (ISAF4DNA), which uses state-aware symbolic encoding and couples structural unit discovery with semantic validation to form a closed-loop structural-semantic decoder. When applied to N6-methyladenine (6mA) datasets from 63 species, ISAF4DNA reveals a language-like organization of regulatory information: (i) a conserved motif-derivation pathway AT -> GAT/ATC -> GATC; (ii) two forms of redundant syntax: anchor-type structures with a conserved core and selective flanks, and fuzzy-type clusters composed of distributed units with positional tolerance; and (iii) differential deployment trends between prokaryotes and multicellular eukaryotes. Together, these observations motivate the development of a testable framework, EpigenoLinguistics, that treats motifs as lexical units, redundancy as syntax, and deployment as pragmatics. This framework advances the ``DNA as language'' concept from a metaphor to a falsifiable framework with supporting evidence, thereby bridging biology and computational linguistics. ISAF4DNA advances the application of AI techniques in biology from black-box predictions to mechanism-level signals, augments database annotations, and guides regulatory-element design, with principles extensible to other modifications.

Read Full Paper (arXiv.org)