Simple Natural Language Processing Projects for Health Sciences( How to Extract and Summarize Medication Package Inserts Using Python)

3 min readJul 13, 2019

This tutorial is a basic text extraction from pdf(Using Tika Python Library)and text summarization using text summarization algorithms like TextRank (Gensim Python Library Implementation),LexRank(Sumy Python Library Implementation),Luhn(Sumy Python Library Implementation).

Colaboratoy Notebook Here

Anyone can learn from it but the target audience are people with background in health sciences who are learning about Data Science,Machine learning,programming,Natural Language Processing and will prefer to learn with familiar data type.

A package insert is a document included in the package of a medication that provides information about that drug and its use. For prescription medications, the insert is technical, and provides information for medical professionals about how to prescribe the drug(Wikipedia,2019)

Making sense of Medication package Insert was a major part of my training as a Pharmacist. So the aim of the tutorial is to see how text summarization algorithms will perform on this kind of use case and share my thoughts based on the output.

Download or upload a sample document.

Tika python library did an amazing text extraction Job.It was fast,tidy and easy to use.

There were about 673 sentences in the inserts after text extraction. We will see if the outcome makes sense if summarized to about 70 sentences. A good Summary for me will be a summarizer that covers Route of Administration,Dosing,Contraindication and Adverse Drug reaction.

The gensim keyword algorithm did a good Job capturing the most important keywords. It’s obvious the document (a vaccine insert)is about ‘vaccination’,’engerix’,’hepatic’, ’patient’,’antibody’,’immunization’

You can explore further by tuning the keywords parameters

Gensim’s summarize is based on the popular TextRank Summarization Algorithm. I set the ratio of original words to be included in the summary to 0.3 out of 1. This can be tuned further.

Sumy’s LexRankSummarizer is based on the Lexrank Algorithm.

The PlaintextParser can either be .from_string or .from_file. Sentence_count can be tuned depending on need, I used 70 sentences here. I will suggest that readers try different values

Sumy’s LuhnSummarizer is based on the Luhn Summarization algorithm.

Sentence count can also be tuned till optimal values here.

The Summary that made more sense at about 70 sentences was Sumy’s Luhn followed by Gensim’s Textrank and then Sumy’s LexRank. For this task i didn’t do any data cleaning except stripping tailing white spaces.

I tried to use scispacy’s tokenization method since it’s model is trained on biomedical data it did well but I found out sumy had it’s own tokenizer parameter, so I simply removed scispacy’s tokenization method.

I was really impressed with Tika’s performance at Text extraction from pdf,Wondering why it’s not as popular as other python text extraction libraries.

I hope you practice at this, and do interesting things. Maybe I will try this same task using deep learning.

Simple Natural Language Processing Projects for Health Sciences( How to Extract and Summarize Medication Package Inserts Using Python)

Written by Wuraola Oyewusi

No responses yet