From Metadata to FAIR Principles: Make Your Data Better Now – Love Your Data Blog Series Part 3

Written By Polina Solonets

The Max Planck Institute for Legal History and Legal Theory (mpilhlt) has recently approved its Research Data Policy. In accordance with this document, our institute commits to systematic management of research data in line with established standards and best practices, thus assuring the quality of research, satisfying legal and ethical requirements and contributing to the responsible handling of resources.

Love Your Data blog series explains the main concepts of research data management (RDM) and their practical application to legal historical research. In the previous blog posts we discussed what RDM is, what is isn’t, and what advantages it can bring to your research project. We have also talked about the research data life cycle and how RDM comes into play at each stage. This post will explain the importance of metadata and FAIR principles with examples from legal history. Let’s dive in!

Read Part 1 “What is Research Data Management and Why it Matters” here

Read Part 2 “A Researcher’s Guide to the Data Life Cycle” here

What is Metadata and Why is it Important?

My previous blog post explained in detail what counts as research data in legal history. And hopefully you already have an idea of what amounts to research data in your project. Another important type of data is metadata. An easy way to understand what metadata is – and why we need it – is to think about metadata as data about your data. This means that it contains information describing other data. It is a formalised and standardised description of what your data contains, what properties it has, how and where it was collected, how it can be (re)used by others, and it is important not only for human scholars, but for machines too, as they help us search for the right scholarly data on the Net.

How detailed your metadata is depends on you. But it is generally recommended to make it as extensive, accurate and clear as possible so that it is more understandable for others. It is a good idea to include the following categories in your metadata:

a description of your dataset such as the dataset’s creator, publisher, creation and publication date, title, summary, keywords, etc;
unique persistent identifiers such as DOI (or others);
content description (what documents your datasets contains, from which time periods and places, what archive it comes from, and so on);
access rights with information on who has access and how these rights may be used. It is always good to use some of the Creative Commons licences (CC BY 4.0 is a recommended open access licence). If the data is not under copyright, you should indicate that it is in the public domain; and
any information on how your data is related to other digital objects (Is there a publication based on the dataset? Is there another more updated version of the dataset? Is your data related to another dataset?).

Why is it crucial to always accompany your dataset with rich and clear metadata? Imagine the situation: you went to archives and took pictures of what felt like thousands of very important historical documents that had never been digitized before. You are excited to publish them as open data and share them with other scholars in your field. Unfortunately, because you were unaware of how crucial metadata is, you did not record any at the time. In such a case, publishing your dataset and sharing it with others would be very problematic, if not impossible. First of all, nobody would be able to find out what your data contains: which documents, from where, from which period, etc. It would only be possible to establish these important characteristics by opening your dataset and going through each file one by one, reading each document individually. In this imaginary scenario it would be hopeless to get a quick overview of the data or search the dataset for something in particular. It would also be very hard for others to find your data online, as search engines use something called indexing to make scholarly data discoverable and such indexing relies on rich metadata.

Even if you cannot share your data with others, it is still recommended that you publish your dataset’s metadata online (Wilkinson et al., 2016), as rich metadata will allow others to understand what your sources are about, even without having access to the data itself. The idea may sound strange: publishing metadata without any actual data? But, again, imagine yourself in quite a common situation: you collect some documents in an archive and, unfortunately, the archive does not give you permission to publish these documents nor share them internally with your colleagues. You make a spreadsheet meticulously documenting all the most important metadata about your data. For each document in your collection, you write down its title, the year it was produced, who it was produced by and where, its genre and topic, which archive it was found in, under which inventory number it is held, how many pages it contains and other important characteristics. Publishing this metadata will, firstly, help other scholars to understand what archival material your research is based on (even in the absence of access to the material itself): for example, which period it covers, how large your dataset is, what types of documents it comprises, what geographical areas are included, and so on. Secondly, it may be very handy for other scholars working in your filed to get an idea about which materials are located in which archive. It is not always possible to find out this information beforehand online. Even if it is possible to get to know these things in advance, it will save a lot of time and effort for others and provide them with a good orientation on where to start, especially when sources are scattered across different places.

In the end, taking care of your metadata may turn out useful for you too. If, in five years’ time, you decide to reuse the data you collected for another project, you will thank yourself for investing in your metadata, as it will help you to get an overview of your sources and understand them better.

Metadata is a love note to the future [Digital Image]. (2012) © CC BY 2.0 cea+

That is why creating rich metadata for your data is very important.

Furthermore, to make your metadata discoverable and searchable for machines, it should follow some degree of standardisation and formalisation, sticking to one of the community-endorsed metadata standards such as Dublin Core, TEI or DDI, for example. What standard you choose depends on the type of data and also on what is normally used for this type of data in your scholarly community. For example, Dublin Core is a general purpose standard, which you can use in most cases. TEI is common for textual data encoded in XML. You can find more information about existing standards here. When you publish your dataset in a repository, the repository will make an offer for you to fill in some metadata and a good repository will most likely have some metadata standard in place. However, it is up to you how much detail you go into because the repository will most likely require only the minimum of obligatory metadata elements.

It is also recommended to use controlled vocabularies (a standardised set of keywords or phrases) for the metadata as it will help you to enhance formalisation. For example, when you tag our field of study, what do you use? Is it law? Is it legal history? Is it jurisprudence? Maybe legal studies? Should you go into more detail and say moral theology or canon law instead? The term you choose will influence how easy or difficult it will be for others to find your data online. Controlled vocabularies help us to prevent ambiguity and make sure that we all use the same terms to talk about the same concepts. They are domain-specific and come in forms of ontologies, taxonomies and so on. You need to find out if there is a controlled vocabulary in place for your domain. You can start your search from more generic vocabulary registries like BARTOC or Faisharing or ask your colleagues and library if they use anything for their field. At our institute, we already have several projects that created controlled vocabularies and ontologies for their research phenomena, for example, in the ‘Non-State Law of Economy’ and RHONDA projects as well as in the ongoing OrDi project.

The FAIR Data Principles for Research Data

FAIR Principles, first published in 2016, are an established standard and serve as guidelines that help improve the handling of research data. There are 15 principles which aim to improve the Findability, Accessibility, Interoperability, and Reuse of digital assets. Please note that the FAIR principles should be used for both data and metadata (as we have just discussed, the most valuable data is useless without (FAIR) metadata). They can also be handy for other digital non-data objects such as code and software, workflows, thesauri, ontologies, taxonomies, corpora, digital editions, annotations, video-audio material and many more. We can also use the term FAIR when speaking about organisations and archives, as an institution can be FAIR-enabling if it sticks to these principles. And on a bigger scale, we can even talk about FAIR ecosystems, when FAIR-enabling organisations build FAIR infrastructures and use FAIR workflows to produce FAIR (meta)data.

Another important point about data FAIRness is that it is aimed not only at human researchers, but also at machines. In our research, we rely greatly on the assistance of computers when searching for information online or working with a big amount of data. That is why it is essential that our research (meta)data is discoverable and actionable for computers as well.

Now let’s look at the FAIR principles and try to understand each of them in practice. You can find the full version of all 15 principles with detailed explanations here. I will cover only the most important aspects of FAIR below.

FAIR Qualities. (2025). © CC BY 4.0 CESSDA ERIC

Findability

The findability principle basically means that it should be possible to (automatically) discover your dataset. To achieve this, you need to make sure that your dataset has a globally unique and persistent identifier (for example, DOI, but there are others as well), has rich metadata that describes it (remember what we talked about before: absence of good metadata can make the most precious data useless), and is registered at the searchable Internet resource (otherwise it will not be discoverable).

Accessibility

For the accessibility principle, it is important to note that this is not the same as open access, which is a common misconception. Rather, it means that the terms of access should be clearly stated for human as well as computational users (through a licence, for example), and any limitations should be indicated. In cases where it is not possible to publish your data openly, it is highly recommended to publish metadata, as it will allow others to get an impression of your dataset even if it is not openly accessible. Metadata should stay accessible over time.

Interoperability

The interoperability principle means that your data can be easily integrated with other data and/or it does not require special (proprietary) software to be opened. You can achieve this, firstly, by using open, standard and non-proprietary data formats. Secondly, you can use controlled vocabularies as well as standards established in your community. For example, when we work on digital editions in the ‘School of Salamanca’ or ‘Non-State Law of Economy’ projects, we use the XML format for the transcription of our texts that will be published online. This format is standard for encoding digital editions. It is both human- and machine-readable, it is software-independent (you can open a document in any text reader), it can be easily transformed into a pdf or an html, and it allows for the representation of a text’s structural parts (like headings, paragraphs, chapters and so on) without mixing content and layout. As this format is flexible, meaning that to encode a text everyone can create their own tags (for example, one can tag a text heading as a <head>, <title>, <heading>, <titel>, etc.), it is important to use community standards for encoding text, otherwise everyone will be using different tags for the same things. That is why we follow the guidelines of the Text Encoding Initiative, which makes recommendations on how to tag structural elements in different text genres and document types.

Reusability

Finally, data reusability is the ultimate goal of FAIR principles and aims to enhance data reusability potential in research. To make data reuse possible, you need to make sure that your (meta)data is sufficiently described and documented, so that others can easily understand if the data is suitable for them, what it is about and if it can be reused. This again highlights the importance of licences, rich metadata including data provenance information (where does the data come from? How and by whom was it collected? How should it be cited? How was it processed?) and community standards (in data formats, metadata and vocabulary).

Make Your Data Better Now

In conclusion, FAIR principles are not only important to share your data with others, but also to keep your data accessible and usable in the long run. Like metadata, thinking from the beginning about such RDM aspects, including data formats, licences and standards, will prolong your data lifecycle and make sure that your dataset will still be (re)usable in 10 years.

Coming up next in Part 4: Data’s Legal and Ethical Aspects.

References:

CESSDA Training Team. (2025). CESSDA Data Archiving Guide version 4.0. Bergen, Norway: CESSDA ERIC.

Deutsche Forschungsgemeinschaft. (2025). Guidelines for safeguarding good research practice. Code of conduct.

DANS. FAIR Aware. https://fairaware.dans.knaw.nl/

GO FAIR. (n.d.). FAIR Principles.

Max Planck Gesellschaft. (2021). Responsible acting in science: Rules of conduct for good scientific practice – How to handle scientific misconduct.

Wilkinson, M. D., Dumontier, M., Aalbersberg, Ij. J., Appleton, G., Axton, M., Baak, A., Blomberg, N., Boiten, J.-W., da Silva Santos, L. B., Bourne, P. E., Bouwman, J., Brookes, A. J., Clark, T., Crosas, M., Dillo, I., Dumon, O., Edmunds, S., Evelo, C. T., Finkers, R., & Gonzalez-Beltran, A. (2016). The FAIR Guiding Principles for Scientific Data Management and Stewardship. Scientific Data, 3(1).

Cite as: Solonets, Polina: From Metadata to FAIR Principles: Make Your Data Better Now – Love Your Data Blog Series Part 3, legalhistoryinsights.com, 12.06.2025, https://doi.org/10.17176/20250701-132952-0

Print This Article