2021 marks the second year of The Innovation Sandbox and will see us embark on several new projects, as well as some existing projects, with new students, new clients and new partnerships.
The Innovation Sandbox is a collaborative initiative at the confluence of industry & academia, allowing both academic types of research on real–world datasets and the transfer of cutting-edge machine learning research into industry application
The first iteration this year will focus on two new projects: 1) Machine Learning and Natural Language Processing (NLP) for E-Commerce product description personalisation and 2) Predicting urinary tract infection (UTI) risks in care home resident for early detection. These will run alongside our continued efforts in the sailing automation in partnership with Jack Trigger Racing.
Machine Learning and NLP for E-Commerce Personalisation
The first project – Machine Learning and NLP for E-Commerce Personalisation – will focus on developing a state–of–the–art machine learning architecture that can generate personalised product descriptions through Natural Language Processing (NLP).
Natural Language Processing or NLP is a subfield of Artificial Intelligence tasked with equipping machines with the ability read and understand language. NLP is a rapidly advancing field and when infused into other applications, it is creating powerful and intelligent tools.
For example, by applying machine learning to NLP, and now commonly, using deep learning models, we can automatically deliver more accurate responses and descriptions to improve user experience by learning from previous interactions.
In this instance, the product description that is generated will be tailored towards the needs of the customer that is viewing the product. The model, by leveraging existing content and attributes of the customers will then be able to use those attributes to tailor the information in the description. The NLP generated description will then contain the most relevant information for a particular customer type or segment.
This project is set to revolutionise the retail and e-commerce industries. It will allow retail platforms, amongst others, to provide the most appropriate product description possible to the customer and enhance their brand experience through hyper-personalisation.
Goals of the project
This project has several objectives. The first is to replicate research by Alibaba’s research team in the English language – you can read the existing research here: Towards Knowledge-Based Personalized Product Description Generation in E-commerce.
Secondly, there will be an opportunity to apply and further the research we’ve conducted in Natural Language Processing (Text Summarization from The Innovation Sandbox ) and apply recent advances in the field to the latest developments.
Furthermore, the project aims to extend on previous research by trying to solve the text generation problem while having either sparse or absent customer attributes. This addresses what’s called a cold start problem in personalisation and will help further our research and developments in Natural Language Processing.
Finally, by developing using limited compute resources, the project aims to find solutions to achieving high quality model training results while making machine learning development and research more sustainable for the environment through reduced energy consumption and carbon footprint.
Project & NLP challenges
The model will be trained using a large open-source data set from Amazon consisting of approximately 18 million products and around 250 million product descriptions. This will pose certain challenges in terms of processing the data and training the models.
Specifically, it will require the fine tuning of pre-trained models and to carry out extensive hyperparameter search, that can be prohibitive in terms of compute resources.
This makes for interesting research opportunities, as we can build on previous text summarisation research on LED encoder-decoder architectures where we achieved best in class summarisation performance on a fraction of the compute used by previous initiatives. Read more here
The project also faces additional technical challenges. Our team will have to deal with imbalanced distributions of reviews within different product segments and, since the project will rely on open–source data, there will be no data containing explicit information about customer attributes or buying histories. The project team will therefore have to think creatively about how to group customers and characterise them using only information from their reviews.
What’s the impact of NLP applications in industry
The applicability of the technology developed during this project is very broad and can benefit a wide range of applications and industries.
For instance, in the healthcare industry, we could apply this type of architecture to healthcare professionals. Information generated from long and complex medical notes and patient histories, can be tailored to the needs of the user; in this case, doctors, nurses or surgeons view different results according to their requirements.
Another application could be in the legal services where it could generate summaries of case information tailored for different legal professionals or parties that are involved in the case.
Similarly, in the recruitment industry, CVs or job adverts can be tailored and optimised for different roles, industries, companies or platforms.
What is certain is that the field of NLP is rapidly advancing. New research and developments continue to push the boundaries for projects and applications, and we’re excited to be leading the charge.
Predicting UTI risks with time-to-event prediction modelling
The second project – Predicting UTI risks in Care Homes for Early Detection – will focus on developing a machine learning architecture that can predict and provide early detection of Urinary Tract Infections (UTIs) in care homes. The recommendations generated by the model will assist doctors and carers in preventing UTIs amongst the residents and improving quality of care.
UTIs are generally complex to detect. Due to comorbidities, any physical or psychological state considered to be outside the realm of normal well-being can overshadow symptoms of other conditions, such as UTIs for instance. ‘
The model, by leveraging on existing clinical and behavioural data collected through the Person Centred Software (PCS), will be able to use those data points such as blood pressure or glucose levels combined with actions and events from the residents to predict ahead of time the risk of a resident contracting a UTI.
Goals of this project
The main goal of this project is for the model to be able to continuously test, predict and generate insights ahead of time on the risk of a resident having contracted a UTI.
Moreover, the aim is to offer early detection methods for changes in patient behaviour. Rather than looking uniquely at clinical data to detect infections in residents, the project aims to improve the accuracy of the data and help carers have the most accurate clinical status of the residents, by factoring in behavioural and event data as well.
Ultimately, as comorbidities and verbal impairments in residents can interfere with the early detection of infections, the project aims to aid support the carer by providing an accurate picture of the wellbeing of the resident.
Project Challenges to improving patient wellbeing
The model will be trained using the data collected by PCS that collectively covers 200 care homes in the U.K. Each data point will be manually uploaded by the doctor or the carer in the application and will represent the clinical and behavioural status of the resident.
How to clean the data and then subsequently how to process it, represent key challenges in the project. Since data is entered manually through a user interface by each carer, variation in data input and measurement error or the possibility of misrecording events are possible sources of error. This could impact the reliability of the data and make the predictions inaccurate.
Data processing however will represent the biggest volume of work. Importantly, discrete event data needs to be processed into temporal abstractions, so that it can be used in conjunction with more traditional time-series data and allow use of time series techniques that require uniformly sampled data. Depending on the algorithm type, it may be necessary to process data further by vectorising or ‘flattening’ data to remove the time dimension for some lookback window to make the data consumable by the model.
Another challenge of this project is dealing with bias, particularly related to gender. This variable may affect what the model outputs, as female patients have a higher probability of developing a UTI due to their menstrual cycles. It will be important to understand how any model might learn such biases and seek to understand their clinical impact.
Finally, processing data from residents while maintaining anonymity and patient trust will represent an additional challenge in terms of privacy. To overcome this, we plan to use a privacy-enhancing technology (PET) called differential privacy. By adding noise to the data, it will be possible to identify a resident uniquely from specific anonymised data rather than general data – preserving the privacy of the patient.
The benefits across healthcare
The applicability of the architecture and the outcome developed during this project is similarly very broad.
Firstly, the model has the potential to re-use some of the data to retrain itself to detect other problems and other infections. As the model will be based on clinical, behavioural and event data, it could be replicated not only to predict UTIs in care homes residents but virtually to any healthcare environment where data is collected.
Due to its heavy reliance on data, it carries the potential to help expand the understanding of Machine Learning applications for clinical and health management purposes.
Furthermore, if successful and validated, the model could be used to help monitor up to 50,000 care homes residents, approximately 10% of which are likely to get a UTI at some point in a 3-month period (based on our data).
Ultimately the potential clinical benefits to such a system are huge. For instance, the model could be able to contrast comorbidities in residents, known to be a source of interference in the detection and diagnosis of conditions such as infections. The model could then overcome these comorbidities and increase the accuracy of the insights and predictions provided to the carers.
We’ll be providing updates on each of the projects and sharing new research into the respective fields exclusively to our subscribers. Subscribe to our newsletter to keep up to date with the project with first look and listed blog and podcasts.