The Problem of Digital Dating, Part II: Some Further Principles
In the previous instalment of this discussion, I raised some questions about dating manuscripts in a digital context. I am happy to say that this has provoked some fairly active discussion, both publicly and privately, particularly at the meeting of the Advisory Board which we held the day after the Second DigiPal Symposium, but we are still a long way from a resolution. Before coming to some possibilities, here are some more thoughts on the issue, focusing particularly on some of the underlying principles, differences in which I think are partly the source of the problem.
Accuracy and Precision
One underlying issue here is that of accuracy and precision. According to my engineering training, 'accuracy' is essentially the degree to which our measurements correspond to the 'ground truth': how 'right' or 'correct' a number is, or at least our degree of confidence in that correctness. 'Precision', on the other hand, is how closely the measurement is specified entirely independently of its 'correctness'; how small the range of possible values is. (Note that this definition of precision is different from some that I have seen, for example on Wikipedia, but it is the one I was taught and is also more relevant for this discussion.) To take a trivial example, to say that the age of the British Prime Minister, David Cameron, is somewhere between zero and five hundred is perfectly accurate – nobody would argue that it is wrong – but it is so imprecise as to be useless. Conversely, to say that his age is fourteen years, three months, two days, six hours, eleven minutes and twenty-six seconds is extremely precise, but it is also not at all accurate (at least not as I write this in 2012). The point is that accuracy and precision should correspond. If you know that your measurement is not very accurate, for example because it is a rough approximation, then the figure you give should have a relatively low precision. Conversely, if you know that you have a very accurate measurement, then again you should give as much precision as you are able. This is a basic, intuitive principle that we follow all the time in our lives: if someone asked me to guess how old David Cameron is then I would say 'in his 40's' – about as precise as I think my guess can be – but if I look up his date of birth then I can be more precise because I have more confidence in the accuracy of my answer: namely, as of today, his age is (apparently) 46 years, one month and one day – though note that I still don't know how many hours, minutes or seconds.
What has this to do with dating manuscripts? The problem is that our traditional conventions are deliberately imprecise because we know that they are not very accurate: the point raised by Peter Kidd in the comments to Part I of this series. If we say that a manuscript was written early in the eleventh century then the point is that we do not know exactly what year that was – if we did then we would say so – and so we are giving the greatest precision we can based on our perceived accuracy. Similarly, if I search for manuscripts dated 's. xi1/4', I do not really mean 'anything written after midnight on the first of January in the year 1000 (or 1001?) and before midnight on the 31 of December in the year 1025 (or is it 1026?)'. This is clearly absurd. What I really mean is probably something more like 'anything dated to some time around about the period from the start of the century through to about the 1020s or so'. However, the computer is pressuring us to give a much greater precision to these dates than our accuracy should allow. Indeed, this tension is evident in the MASTER and TEI P5 Guidelines for dates: MASTER specified that 'values should contain a date in the ANSI (yyyy-mm-dd) format' (see here), whereas TEI P5 allows not only 'from' and 'to' attributes, but also 'not before', 'not after', 'when', or even a general named 'period' (see §13.1.3 of the TEI Guidelines, and note that this was essentially retained in the ENRICH standard for manuscript description). Still, these are restricted by the W3C standard for XML which insists on specifying a start-year and end-year for any given range, but the ISO standard this is based on is apparently more general and allows other possibilities, including simply '13' for the thirteenth century (see §18.104.22.168 of the TEI Guidelines; the full ISO standard must be purchased but is available here). What does this mean in practical terms, then? Namely that the international body who decides how dates should be represented on the Internet has decreed that any time-period must be specified at least to the year, if not to the month, day, or more. (Incidentally, the rules also state that the year must be specified in the Gregorian calendar, a point that is often missed.) Rather more flexible and (arguably) appropriate for this material is the Extended Date Time Format, a different adaptation of the ISO standard which has been developed by the Library of Congress and which allows one to specify uncertainty in the date with a high level of precision.
In short, then, the computer can allow for more flexible and uncertain dating, but it is not easy to do so, and it usually requires going against international standards for manuscript description which in turn has consequences for interoperability, exchange and preservation of data, and users' understanding. If we are to do this then we have to be very sure that we understand the consequences and that it is really what we want.
The Importance of Research Questions (or, What's it All For Anyway?)
These are some of the problems, then. On the one hand, the simple answer is to decide which of these systems one will follow and to go with it: as long as we are clear then one might think that it does not matter much which is followed. Alternatively, we may chose deliberately not to represent any single year, but rather to keep the dating periods as plain text without any numbers attached. However questions then remain about how to represent these uncertain dates on a timeline, in search results and so on: in other words, even if we can store uncertain dates in a standard machine-readable way, this does not answer those questions raised above about how we want those dates to appear in our interface. The question still is what do we want, and the answer, as always must surely be that it depends what we are doing. If we are searching for material that we then want to look at in more detail – what could loosely be termed a 'close reading' of the manuscripts – then we probably usually want to err on the side of inclusiveness: my experience in focus groups on the ASCluster Project was that people tend not to trust the computer and so prefer to be given extraneous material which they can filter themselves, rather than missing information that was not returned in the search. In this case, we would presumably want a book dated 1425 to be listed as both 'saec. xv1/4' and 'saec. xv2/4', as that way we would be sure to find it whichever one we searched. In this case I agree, it makes sense, but I wonder if this is always so: if you search for manuscripts from saec. xi2/4, would you really want all manuscripts dated saec. xi in. to appear as well because this earlier period might also extend into the second quarter-century? Perhaps for eleventh-century English material yes, because it is a relatively small corpus, but what about fifteenth-century Italian, where a query like this could result in thousands of extraneous hits? Admittedly if you were really going to look at every item then you would not search for all material produced in Italy in saec. xv1/4 to begin with, but the question still remains how far this principle of inclusiveness should be pushed.
An alternative scenario would be if you are planning a more statistical approach – more of a 'distant reading' – such as comparing the relative number of manuscripts surviving which were produced in Italy versus France, or produced in the first quarter of the fifth century versus the second. In this case you would certainly not want a book dated to 1425 to appear as both 'saec. xv1/4' and 'saec. xv2/4', as this would make counting extremely difficult. In this case, one would almost certainly want the categories to be contiguous but not overlapping: in other words, I think we would want every year to appear in exactly one time-period. In this case, the clearest system is probably that of quarter-centuries, and one can then define if (for example) the century starts in 1400 or 1401. The problem is that some cases will inevitably overlap categories: here 'saec. x/xi', 'saec. xi med.' and so on. One might be tempted to create a standard that disallows these, but they will inevitably be needed in practice and so must be catered for.
A third approach is also tempting but probably not very usable in practice: that is, to tailor the time-periods for the corpus under consideration. For the eleventh century English vernacular material, for example, my work suggests that the script can loosely be divided into one period starting in about the 990s and extending through to about the 1030s or '40s; a second period running to the late 1060s or early 1070s, and a third period from then until into the twelfth century. This invites dividing the century approximately into thirds, or something like 'saec. x/xi–xi in.', 'saec. xi med.' and 'saec. xi ex.–xi/xii', if 'in.', 'med.' and 'ex.' are understood as one-third-centuries. In practice I am not advocating this at all – the result would be extremely chaotic and would exclude almost any possibility of data exchange – but the point rather is to emphasise that each corpus, each period and script, all have their own periods which are meaningful, and this is in part why agreeing on any one will always be so difficult.
In the next instalment I hope (finally!) to present some possible responses to this question, but in the meantime, why do you search for manuscripts by date? What information would be useful for you? Would you prefer an inclusive or a narrow search? Please let us know.Share on Twitter Share on Facebook