Accuracy of AI system outputs and performance measures

Continuing our AI auditing framework blog series, Reuben Binns, our Research Fellow in Artificial Intelligence (AI), and Valeria Gallo, Technology Policy Adviser, explore how the data protection principle of accuracy applies to
AI systems, and propose some steps organisations should take to ensure compliance.

Thisblog forms part of our ongoing consultation on developing the ICO framework forauditing AI. We are keen to hear your views in the comments below or via email.


Accuracy isone of the key principles of data protection. It requires organisations to takeall reasonable steps to make sure the personal data they process is not “incorrector misleading as to any matter of fact” and, where necessary, is corrected ordeleted without undue delay.

Accuracy is especiallyimportant when organisations use AI to process personal data and profileindividuals. If AI systems use or generate inaccurate personal data, this maylead to the incorrect or unjust treatment of a data subject.

Discussions aboutaccuracy in an AI system often focus on the accuracy of input data, ie the personal data of a specific individualused by an AI system to make decisions or predictions about an individual.

However, it is important tounderstand accuracy requirements also apply to AI outputs, both in terms of accuracy of decisions or predictions abouta specific person and across a wider population.

In this blog, we take acloser look at what accuracy of AI outputs means in practice and why selectingthe appropriate accuracy performance measures is critical, in order to ensurecompliance and to protect data subjects.

Accuracy of AI outputs

If the output of an AIsystem is personal data, any inaccuracies as to “any matter of fact” can bechallenged by data subjects. For example, if a marketing AI application predicteda particular individual was a parent when they in fact have no children, itsoutput would be inaccurate as to a matter of fact. The individual concernedwould have the rightto ask the controller to rectify the AI output, under article 16 of the GeneralData Protection Regulation (GDPR).

Often however, AI outputs may generate personal data where thereis currently no matter of fact. For example, an AI system could predict that someoneis likely to become a parent in the next three years. This kind of prediction cannotbe accurate or inaccurate in relation to ‘any matter of fact’. However, the AIsystem may be more or less accurate as a matter of statistics, measured interms of how many of its predictions turn out to be correct for the population theyare applied to over time. The European Data Protection Board’s(EDPB) guidancesays in these cases individuals still have the right to challenge the accuracy ofpredictions made about them, on the basis of the input data and/or the model(s)used. GDPR also provides a right for the data subject to complement such personaldata with additional information.

In addition, accuracyrequirements are more stringent in the case of solelyautomated AI systems, if the AI outputs have a legal or similar effect ondata subjects (article 22 of the GDPR). In such cases, the GDPR recital 71 statesorganisations should put in place “appropriate mathematical and statistical procedures”for the profiling of data subjects as part of their technical measures. They shouldensure any factor that may result in inaccuracies in personal data is correctedand the risk of errors is minimised.
While it is not the roleof the ICO to determine the way AI systems should be built,it is our role to understand how accurate they are and the impact on data subjects.Organisations shouldtherefore understand and adopt appropriate accuracy measures when building anddeploying AI systems, as these measures will have important data protection implications.
Accuracy as a performance measure: the impact on data protection compliance

Statistical accuracy isabout how closely an AI system’s predictions match the truth. For example, ifan AI system is used to classifyemails as spam, a simple measure of accuracy would be the number of emails thatwere correctly classified as spam as a proportion of all the emails that wereanalysed.

However such a measure couldbe misleading. For instance, if 90% of all emails are spam, then you could createa 90% accurate classifier by simply labelling everything as spam. For thisreason, alternative measures are usually used to assess how good a system is,which reflect the balance between two different kinds of errors:
  • Afalse positive or ‘type I’ error:these are cases that the AI system incorrectly labels as positive (eg emailsclassified as spam, when they are genuine)
  • Afalse negative or ‘type II’ error:these are cases that the AI system incorrectly labels as negative when they areactually positive (e.g. emails classified as genuine, when they are actuallyspam).
The balance between these two types of errors can be captured through various measures, including:

Precision: the percentage of cases identified as positive that are in fact positive (also called ‘positive predictive value’). For instance, if 9 out of 10 emails that are classified as spam are actually spam, the precision of the AI system is 90%.

Recall (or sensitivity): the percentage of all cases that are in fact positive that are identified as such. For instance, if 10 out of 100 emails are actually spam, but the AI system only identifies seven of them, then its recall is 70%.



There are trade-offs betweenprecision and recall. If you place more importance on finding as many of thepositive cases as possible (maximising recall), this may come at the cost ofsome false positives (lowering precision).

In addition, there may beimportant differences between the consequences of false positives and falsenegatives on data subjects. For example, if a CV filtering system selecting qualifiedcandidates for an interview produces a false positive, then an unqualifiedcandidate will be invited to interview, costing the employer and the applicant’stime unnecessarily. If it produces a false negative, a qualified candidate willmiss an employment opportunity and the organisation will miss a good candidate.Organisations may therefore wish to prioritise avoiding certain kinds of errorbased on the severity and nature of the risks.

In general, accuracy as ameasure depends on it being possible to compare the performance of a system’soutputs to some “ground truth”, i.e. checking the results of the AI system againstthe real world. For instance, a medical diagnostic tool designed to detectmalignant tumours could be evaluated against high quality test data, containingknown patient outcomes. In some other areas, a ground truth may be unattainable.This could be because no high quality test data exists or because what you aretrying to predict or classify is subjective (e.g. offense), or socially constructed(e.g. gender).

Similarly, in many casesAI outputs will be more like an opinion than a matter of fact, so accuracy maynot be the right way to assess the acceptability of an AI decision. Inaddition, since accuracy is only relative to test data, if the latter isn’trepresentative of the population you will be using your system on, then notonly may the outputs be inaccurate, but they may also lead to bias and discrimination.These will be the subject of future blogs, where we will explore how, in suchcases, organisations may need to consider other principles like fairness and theimpact on fundamental rights, instead of (or as well as) accuracy.
Finally, accuracy is not astatic measure, and while it is usually measured on static test data, in reallife situations, systems will be applied to new and changing populations. Justbecause a system is accurate with respect to an existing population (e.g.customers in the last year), it may not continue to perform well if thecharacteristics of the future population changes. People’s behaviours maychange, either of their own accord, or because they are adapting in response tothe system, and therefore the AI system may become less accurate with time. Thisphenomenon is referred to in machine learning as ‘concept drift’, and variousmethods exist for detecting it. For instance, you can measure the estimateddistance between classification errors over time

Further explanation ofconcept drift can be found on the Cornell University website.

What should organisations do?

Organisationsshould always think carefully from the start whether it is appropriate toautomate any prediction or decision making process. This should include assessingif acceptable levels of accuracy can be achieved.
Ifan AI system is intended to complement, or replace, human decision-making then anyassessment should compare human and algorithmic accuracy to understand therelative advantages, if any, various AI systems might bring. Any potential accuracyrisk should be considered and addressed as part of any DataProtection Impact Assessment.
Whileaccuracy is just one of multiple considerations when determining whether andhow to adopt AI, it should be a key element of the decision-making process.This is particularly true if the subject matter is, for example, subjective orsocially contestable. Organisations also need to consider if high quality testdata can be obtained on an ongoing basis to establish a “ground truth”. Seniorleaders should be aware that left to their own devices data scientists may not todistinguish between data labels that are objective or subjective, but this maybe an important distinction in relation to the data protection accuracyprinciple.
Iforganisations decide to adopt an AI system, then they should:
·ensurethat all functions and individuals responsible for its development, testing, validation,deployment, and monitoring are adequately trained to understand the associated accuracyrequirements and measures; and
·adoptan official common terminology that staff can use to discuss accuracyperformance measures, including their limitations and any adverse impact on datasubjects.
Accuracyand the appropriate measures to evaluate it should be considered from thedesign phase, and should also be tested throughout the AI lifecycle. Afterdeployment, monitoring should take place, the frequency of which should be proportionalto the impact an incorrect output may have on data subjects, so the higher theimpact the more frequently it is monitored. Accuracy measures should also beregularly reviewed to mitigate the risk of concept drift and change policyprocedures should take this into account from the outset.
Accuracyis also an important consideration if organisations outsource the developmentof an AI system to a third party (either fully or partially) or purchase an AIsolution from an external vendor. In these cases, any accuracy claim made by thirdparties needs to be examined and tested as part of the procurement process. Similarly,it may be necessary to agree regular updates and reviews of accuracy to guardagainst changing population data and concept drift.
Finally,the vast quantity of personal data organisations will need to hold and processas part of their AI systems is likely to put pressie on any pre-AI processes to identify and, ifnecessary, rectify/delete inaccurate personal data, whether it is used as inputor training/test data. Therefore organisations will need to review their datagovernance practices and systems to ensure they remain fit for purpose.

Your feedback

We are keen to hear your thoughts on this topic and welcome anyfeedback on our current thinking. In particular, we would appreciate your viewson the following two questions:

1)Are there any additionalcompliance challenges in relation to accuracy of AI systems outputs andperformance measures we have not considered?

2)What other technical andorganisational controls or best practice do you think organisations shouldadopt to comply with accuracy requirements?




Please share your views by leaving a comment below or by emailingus at AIAuditingFramework@ico.org.uk