2 Introduction

2.1 Context

Proteins, one of the fundamental building blocks of life, can be classified into various hierarchical categories based on their structural and functional similarities. This classification helps scientists understand protein evolution, function, and relationships. The concept of protein family has been established in the 70's where few protein sequences and structures were known and most of them were small and constituted of a single domain. Since then, the massive increase of protein 3D structures and sequences led to more subtle definitions, like super-family or sub-family organizations.

Super family : A protein superfamily is the broadest level of classification in the hierarchy of protein classification. Members of a superfamily share distant evolutionary ancestry and often have a common structural fold or domain, but they can perform a wide range of functions. Superfamilies may include proteins that perform vastly different biological roles but still exhibit similarities in their protein structure or specific domains.
Family : Within a protein superfamily, proteins are further organized into families. Protein families are more closely related in terms of sequence, structure, and function compared to superfamily members. Members of a protein family typically share a common evolutionary origin and structural features. While they may perform similar functions, there can still be some functional diversity within a family due to subtle variations in sequence or structure.
Sub-family : The most specific level of classification is the protein subfamily. Subfamilies are groups of proteins within a family that are even more closely related in terms of sequence, structure, and function. Members of a subfamily often perform very similar or identical functions and may have evolved more recently from a common ancestor.

This introduces a granularity in the protein family concept, providing several scales for analysis that allow for the identification of the zones or residues responsible for this granularity.

2.2 Ordalie

Ordalie (ORDered ALignment Information Explorer) is an interactive tool designed for the exploration of the informational content of a Multiple Sequence Alignment (MSA) into a hierarchical manner, and within different contexts, such as phylogeny or 3D structure.

**Figure 1:** Diagram of the Ordalie philosophy

The Ordalie philosophy (see fig. 1) resides in its capacity to perform a concomitant multi-scale analysis across three axes: the amino acids sequence axis, the taxa axis, and the contexts axis.
The information distributed along the amino acid sequence (represented by the horizontal axis in Figure 1) can be analyzed across several scales:

Large-scale features: overall domain architecture, multi-domain organization, and broad evolutionary conserved regions.
Intermediate-scale features: secondary structure elements (helices and strands), low-complexity regions, functional motifs (such as SLiMs), and molecular recognition patches.
Local or residue-level features: specific post-translational modification (PTM) sites, catalytic residues, individual mutation positions, and site-specific conservation scores.

The taxonomic depth of the study constitutes another key axis of analysis. The MSA can be exploited at three main scales:

Global scale: To define family-wide hallmarks and conserved motifs.
Intermediate scale: To distinguish sub-families by identifying residues that are specifically conserved within a subgroup but differ from the rest of the family.
Individual scale: To pinpoint mutations or unique variations specific to a particular taxon."

The third analytical axis focuses on the diverse computational contexts integrated within Ordalie, ranging from residue conservation analysis and phylogenetic tree rendering to the mapping of external features and 3D structure visualization. A key strength of the platform is that all these analyses are unified within a structural framework, allowing sequence-based features to be spatially mapped and compared directly onto the 3D structures included in the alignment.

As a conclusion of this short introduction, Ordalie provides a holistic framework where cross-comparing data from different analytical dimensions becomes seamless. Whether adjusting the taxonomic scale to explore broad evolutionary patterns or zooming in on taxon-specific features, this integrative approach is essential for unraveling the multi-faceted relationships governing the sequence-structure-function-evolution paradigm.

2.3 Database and Snapshot Management

At the core of Ordalie lies a dedicated SQLite database engine. This architecture ensures high-performance data handling and persistent storage for both system-wide configurations (colors, thresholds, and default values) and alignment-specific data (sequences, annotations, and biological features).

One of Ordalie's most powerful features is its ability to manage multiple analytical iterations through a versioning system:

The Database:

Acts as the central repository, ensuring data integrity across sessions. By using the native .ord file format, all project metadata is bundled into a single, portable relational database.

Snapshots:

A snapshot represents a discrete state of the analysis at a given time. It captures a specific clustering configuration or a particular alignment variation. This allow users to:

Record and compare different analytical hypotheses.
Navigate through the history of the project without data loss.
Annotate specific results via the Annotation Tool (see section ).

Snapshots are managed via the dedicated Snapshot Bar (see section 3.5.2), providing a seamless way to switch between different views of the same biological dataset.