Photo: Daniel Berounsky
Aims & methods
To meet our objectives – understanding the nature of pagan Tibetan religion by reconstructing the pantheon, the ritual practices, and the worldview of early Tibet – we have formulated ten Work Packages, spread over three successive Phases.
Phase 1. (WP1–7)
Data collection, curation and selection
Phase 2. (WP8–9)
Annotation and analysis
Phase 3. (WP10)
Synthesis: reconstructing Tibetan Pagan religion
The Work Packages
WP1. Data selection and fieldwork preparation
The more experienced project members will make a selection of the current Leyu-Bai raw materials by excluding texts that belong to the Buddhicised Yungdrung Bön tradition: these texts are “impostors”, insofar as they have been imposed on the Leyu priests by missionary Bönpo monks and do not belong to a local transmission.
Most of the Leyu-Bai collections are currently only available in the form of digitised images of the manuscripts. Team members will create Handwritten Text Recognition (HTR) models, which facilitate computer-assisted transcription of large amounts of historical, handwritten data. There are currently no HTR models available to aid automatic transcriptions of the unusual form of the “headless” (ume) Tibetan script in which much of the Leyu-Bai and Black Water collections are written. We will therefore develop and optimise ume transcription models (building on existing experience from ERC grant TibSchol, CoG 101001002); a selection of those will be manually corrected by experienced monks from Triten Norbutse monastery, in Kathmandu, to increase our Ground Truth, which will subsequently enhance our model’s accuracy. Collaboration with Esukhia and the Buddhist Digital Resource Center (BDRC) means we can employ their tools and expertise that will not only optimise our workflow, but also guarantee that our project outputs (digital images and transcriptions) will be accessible and maintained on their servers even after the project has ended. Although all materials are written in some form of Tibetan, some are extremely obscure, with complex features not found in regular Classical Tibetan texts. These include non-standard letter shapes, contracted forms of multiple syllables, abbreviations and other graphs to represent divinities, cardinal numbers and so forth, many of which are not found in other Tibetan manuscript traditions. These codicological peculiarities will themselves be the object of study within the project, leading to a “glossary” and database of such features.
WP2. Curation of Leyu-Bai written data
Parts of the data we intend to collect currently exists only in oral form. Priests (leyu and lhabön, among others) of these two traditions have agreed to recite the narratives and to perform the accompanying rituals to permit high-quality audiovisual recordings of these unique oral traditions. We will train Automatic Speech Recognition (ASR) models to facilitate computer-assisted transcription of the text.
WP3. Audiovisual fieldwork data collection
The PI and other team members will make at least three trips to Nepal to collaborate with Bön monks there to ensure that everyone on the team receives the necessary specialist codicological transcription training.
WP4. Nepal fieldwork for training
Video footage of rituals will be edited to produce short and long versions of the performances. Dialogue will be transcribed with the help of ASR models, and the text translated into English and Chinese. Selections from these transcriptions will be used for subtitles and for narrative descriptions of the ritual procedures. The work package will result in:
- Annotated videos of rituals
- Tibetan and Chinese ASR models and training data which will be made available for use by colleagues.
- Descriptions of ritual performances and translations of oral and written liturgies.
WP5. Curation of audiovisual fieldwork data
In order to annotate the texts, the raw transcriptions will need further preprocessing in the form of Normalisation and Segmentation, before they can be analysed as searchable eTexts. Normalisation includes expanding contracted forms, recognising and converting non-Tibetan characters, decoding abbreviations, identifying dialect terms and converting archaic features into more familiar forms.
WP6. Preprocessing of all materials
In order to identify the relevant and most interesting texts in our vast data collections, we need to be able to create a catalogue of the content. Recent advancements in computational humanities employ deep-learning methods that can automatically detect topics, patterns and other content features to identify and extract meaningful information that will help us to organise and classify the texts in our data. Text classification is a procedure of designating pre-defined labels for textual units (sentences, paragraphs, documents). It is an essential and significant task in many Natural Language Processing (NLP) applications, such as Topic Modelling (TM), Named-Entity Recognition (NER) and Information Retrieval (IR).
Topic modelling is an unsupervised approach to recognising topics in large amounts of texts by detecting the patterns. Based on the topics, the text collections are divided into different parts and provided with topic labels. These labelled parts can then be used as preliminary catalogues that display the content of these huge text collections.
Information Retrieval will apply Semantic Textual Similarity (STS) techniques to large datasets based on advanced language models (contextual word embeddings and transformers). These will allow us to refine our topic-based catalogues, as they facilitate content-based queries of all our collections. This can be done even if the exact words, spellings, phrasing etc. are different from our search input. Using such STS techniques is crucial because simple (or even fuzzy) string searches are impossible due to the difficult nature of our data. This type of automatic cataloguing based on meaningful content is important to enable us to find the relevant non-Buddhist parts of these collections for further study.
Currently, only a fragment of the contents of the Leyu and Baima collections is known, and one of the important outputs of this work package will be a catalogue of these collections. Finally, a comprehensive catalogue will enable us to choose the most interesting texts that will be studied in detail in Phases 2 and 3 of the project.
WP7. Text classification and selection
This work package involves the identification of religious variables and the development of annotation schemes. The variables consist of names and identities of divinities and other entities in the Pagan pantheon, salient features of ritual practice, and the broader world view that can be derived from these elements. Computer-assisted annotation will be applied, using NLP classification and identification techniques. In addition to a database of all the variables and features in our oral and written data, the work package will generate a number of tools, including:
Named Entity Recognition (NER) models, which will identify entities such as heroes, divinities, demons and places into predetermined categories.
Coreference Resolution: our texts are often ambiguous about the protagonists in certain mythic episodes, since the names are often omitted after their first appearance. This tool will help us to determine the identity of unnamed actors.
Semantic Role Labelling (SRL): the tendency of our texts to omit case markers or to confuse their function – genitives and agentives are often used interchangeably, as are locatives and ablatives – compounds the uncertainty of their meaning. SRL facilitates the assignation of semantic roles to words and phrases and clarifies narrative structure.
WP8. Text annotation and project database
The procedures carried out in the foregoing steps will greatly facilitate our comprehension of the texts that have been selected, preparing the ground for this work package: producing summaries and annotations of these texts, as well as annotated paraphrases or full translations of a more limited selection.
WP9. In-depth individual text analyses of selected texts
The third phase of the project comprises a single Work Package, devoted to the production of several monograph and syntheses. Each of the project members will produce a study of one or more selected texts or a particular theme based on an examination of textual and, in some cases, oral sources. A synthetic monograph will bring together all in-depth text analyses to focus on the core objectives of reconstructing the Pagan pantheon, ritual practices and worldview of early Tibet.
We will continue our computer-assisted approach by making optimal use of statistical methods of reconstruction. There is a long-standing tradition in biology to use phylogenetic algorithms to analyse large sets of genetic data. More recently, these quantitative biological phylogenetic techniques have been transferred to an increasing number of subjects in the Humanities as “phylomemetics” to reconstruct proto-languages in linguistics, to build text-version genealogies of manuscript witnesses in philology and to trace lineages of cultural transmission in anthropology. These methods are based on the principle that if we knew the actual evolutionary distance between all variables, then we can reconstruct their evolutionary history. This final phase of the project will employ such phylomemetic techniques to trace the evolutionary history of the texts in the Leyu-Bai collection, using philologically-focused phylomemetic distance methods that can reveal how different manuscripts relate to one another. The synthetic monograph will address the core objectives and the main research question, providing a complete overview of Tibet’s Pagan religion. In addition, the monograph will present the project’s groundbreaking computer-assisted methodology to reconstruct any religion based on large collections of difficult, multi-modal data that has not been digitised yet.