All posts by Richard Bridgen

Introduction to text mining and machine learning in systematic reviews

By Tom Roper, Clinical Librarian, Royal Sussex County Hospital

A group of librarians from NICE, Public Health England, universities and NHS Library and Knowledge services were privileged to attend a workshop on Text Mining and Machine Learning in Systematic Reviews, led by [James Thomas] (http://iris.ucl.ac.uk/iris/browse/profile?upi=JTHOA32), Professor of Social Research and Policy at the EPPI-Centre.  James designed [EPPI-Reviewer[ (https://eppi.ioe.ac.uk/CMS/Default.aspx?alias=eppi.ioe.ac.uk/cms/er4), software to manage all types of literature review, including systematic reviews, meta-analyses, ‘narrative’ reviews and meta-ethnographies, and leads Cochrane’s [Project Transform](https://community.cochrane.org/help/tools-and-software/project-transform).

James outlined the problem: we systematically lose research, and then spend a great deal of effort and money on trying to find it again. We need to use correct methods, and, moreover, need to be seen to be correct. There are quantitative issues as well: Cochrane reviewers screen more than 2 million citations a year.  Can this considerable human effort be made more manageable by the judicious use of text mining and machine learning? While tools are being developed to help this task, their development is uneven, as is their adoption.

James distinguished between three types of machine learning, rules-based (unfashionable in computer science circles, he warned), unsupervised, and supervised, and gave us opportunities to try out tools based on these approaches using our own devices.

Rules-based approaches are accurate, but fragile – they either work, or fail completely. Unsupervised approaches work by leaving a machine to identify patterns in the data, for example by clustering documents, for example [LDAVis ]( http://eppi.ioe.ac.uk/ldavis/index.html#topic=6&lambda=0.63&term=) based, you don’t need me to tell you, on Latent Dirichlet Allocation.

Supervised approaches require a human or humans to give the machine training data; after a while, from a 280,000 row spreadsheet in an example James quoted, a statistical model can be constructed which can then be used with new material to determine whether or not a study is a randomised controlled trail or not. Training data comes from people, including data generated for other purposes, data created for the project itself  and crowd-sourced data, as in the case of [Cochrane Crowd ]( http://crowd.cochrane.org/index.html), which mobilises Cochrane Citizen Scientists to decide whether or not the subject of a database record is an RCT.

In systematic reviews, these approaches may be used to identify studies by citation screening or classification, to map research activity, and to automate data extraction, including performing Risk of Bias assessment and extraction of statistical data. Readers may be familiar with tools that take a known set of citations, and use word frequency counts, or analysis of phrases and adjacent terms to create word or phrases lists or visualisations.  Similarly, term extraction and automatic clustering can be used to do statistical and linguistic analysis on text, for human review, and, if deemed useful, modification of an initial search strategy. [Voyant Tools]( https://voyant-tools.org/) is one example, as are [Bibexcel]( https://homepage.univie.ac.at/juan.gorraiz/bibexcel/), [Termine]( http://www.nactem.ac.uk/software/termine/) and even the use of Endnote’s subject bibliography feature to generate lists of keywords.

Citation networks can be used for supplementary searching – will this change, James asked, if or when all bibliographic data becomes open? Useful tools here, apart from traditional ones such as Web of Science, include [VosViewer]( http://www.vosviewer.com/). We also spent some time playing with [EPPI-Reviewer]( https://eppi.ioe.ac.uk/eppireviewer-web/home), the EPPI-Centre’s own tool for systematic reviewers and with [Carrot2 Search](http://search.carrot2.org/stable/search)

In the future, James suggested that there is a great deal of interest in a “surveillance” approach to finding evidence, which can automatically identify if a review or some guidance needs updating. Cochrane are developing the [Cochrane Evidence Pipeline](https://community.cochrane.org/help/tools-and-software/evidence-pipeline) which aims to triage citations found by machine or crowd-sourced methods can either be triaged by the relevant Cochrane Review Group, or assessed using machine-learning.

While the workshop focussed on systematic reviews, for a jobbing librarian like me in a clinical setting, searches to support systematic review will make up only a small part of the workload. Nevertheless, searches still need to be conducted soundly and rigorously. Can artificial intelligence and machine learning help? Certainly some of the tools James showed are useful when formulating search strategies. A group within London and Kent Surrey and Sussex NHS Libraries is developing a search protocol for the region. We may well find ourselves referencing some of these tools. It is always stimulating to hear a world leader in a field talk, and I’m sure all the workshop participants would join me in thanking both Professor Thomas for giving up his time, and Health Education England for organising the workshop.

The tools James described, and more, may be found on the [EPPI-Centre website] (http://eppi.ioe.ac.uk/cms/Default.aspx?tabid=3677). See also the National Centre for Text Mining’s page of [software tools] (http://www.nactem.ac.uk/software.php)

For a systematic review on the subject see:

O’Mara-Eves A, Thomas J, McNaught J, Miwa M, Ananiadou S. Using text mining for study identification in systematic reviews: a systematic review of current approaches. Syst Rev. 2015 Jan 14;4:5. doi: 10.1186/2046-4053-4-5.

For a more recent overview, I would recommend Julie Glanville’s chapter on Text Mining for Information Specialists in Paul Levay and Jenny Craven’s new book on systematic searching:

Glanville J. Text mining for information specialists. In: Craven J, Levay P, editors. Systematic searching:  practical ideas for improving results.  London : Facet Publishing 2018. p.147-169.

Evidence Standards for Digital Health published

NHS England has been working with NICE, MedCity, Public Health England, NHS Digital and DigitalHealth.London on a project aimed at helping digital health innovators, commissioners, investors and grant funders to understand what a ‘good’ level of evidence for digital health technologies looks like.

NICE has published the ‘Evidence Standards Framework for Digital Health Technologies’, which details evidence of effectiveness for intended use and evidence of economic impact – and which will be key to supporting the speed and uptake of digital health tools.

You can view the new standards at www.nice.org.uk/digital-evidence-standards. There’s also:

  • an article published today in the Lancet
  • a short YouTube film which explains the new standards
  • a blog from Indra Joshi

If you have any questions at all or would like any further information, please let me know.

Nicola Fulton | Communications and Engagement Manager
nicola.fulton1@nhs.net
Empower the Person, Digital Transformation Portfolio, NHS England

Avoiding the toaster at the CILIP Employers Forum: artificial intelligence and libraries

On the 20th November 2018 I attended the CILIP Employers Forum. One of the talks was by Terry Corby on “Avoiding the Toaster! Meeting the challenge of disruptive innovation”. The toaster in the title was alluding to the idea that if we fail to deal with disruptive innovation, we will become “toast”.

Terry argued that automation is already here:

  • “60% of occupations could have 30% or more of their activities automated with current technology”
  • 20% of a CEO’s activities could be automated now
  • The cost benefits are between three and ten times the investment. Only human factors prevent it happening.
  • AI solutions tend to work best when they have a human element as well.

Examples he gave of good disruption were:

Many companies foresaw future disruption but failed to capitalise:

  • Kodak invented digital photography
  • Xerox invented the Graphical User Interface and the computer mouse.

Among Terry’s suggestions for how to operate in this environment were:

  • Seek out stakeholders who will insist on innovation.
  • Find out what your customer really wants and values.
  • Work on many innovations, expecting that most will fail, but some may greatly succeed.
  • Create a culture that encourages innovation and learning.
  • Completely master new skills if you can, or recognise when you can’t.
  • Be an outsider in new areas, not just an insider in your own.

Established companies are often at a disadvantage because they don’t recognise the threat and fear cannibalising their business

The challenge Terry laid down to librarians was that we had allowed search engines to roll over us, would we do the same for artificial intelligence? He doesn’t know our field and so had no answers, but he did call us to think these issues through for ourselves, and then we will avoid someone “eating our breakfast”.

Now over to you: what do you think? Leave a comment below.

Stephen Ayre