How to use scispaCy Entity Linkers for Biomedical Named Entities
This is a sequel to a previous tutorial by me, there’s been an update to scispaCy library
Link to project github
Link to Notebook with uncleared output : This is to help you know if you are getting the right output
Link to Notebook with cleared output
Now that you have done extracted Biomedical Named Entities. How can you make more of the entities? You can link them to a knowledge base(s). This is exactly what the scispaCy entity linkers do.
The choice of knowledge base to link depends on the nature of extracted named entities and the question the user is trying to answer. It is preferable to read further about each database have a good grasp of the possibilities with them. The names of the knowledge base give an idea of the kind of information accessible with it.
Previous versions supported only one knowledge base but from the library documentation for v2.5.0, five(5) knowledge bases are now supported umls (Unified Medical Language System), mesh (Medical Subject Headings), rxnorm (RxNorm), go(Gene Ontology), hpo(Human Phenotype Ontology).
In this tutorial before linking entities to the available knowledge bases
biomedical entities will be extracted from the sample text using the 4 available scispaCy NER models. To read the specificity of each NER model , click this link
All identified entities will then be parsed through different knowledge bases.
Download the required scispaCy models
Import all libraries
Link to sample text: https://www.ncbi.nlm.nih.gov/books/NBK92477/.
Any text sample can be used, data can also be a series of text files.
This function displays a tuple of entities extracted by the scispaCy NER models and the st function is to prevent duplication of entities in our dataset.
displacy will also helps to visualize the recognized entities.
The images below show the code and output for named entity extraction using different models. as expected the type of entities recognized and extracted is dependent on type of model
A total of 422 biomedical named entities were extracted from the sample corpus using 4 NER models from scispaCy.
Concatenate all the extracted entities and save the data for future use.
The function below is a general function to link biomedical entities to the scispaCy knowledge bases.
One of the goals of this tutorial is to show how different knowledge bases can return different entity linkage based on the type of data the knowledge base is designed for. Here, the same entity is parsed by four knowledge bases and they returned different concepts, matching score and definitions.
The entity_linker function was tested with the 4 sciSpacy knowledge bases “umls”,” mesh”,”go”,”hpo”. The function will return 2 entities and their scores as it relates to the Knowledge base. As at the time of this writing the “rxnorm” knowledge base returned KeyError, I will investigate further on why this is happening
The images below show the general entity linker in action.
To apply the entity linker to all entities in the pandas dataframe.
Readers have probably come across the concept of optimizing code. The code below shows some of the lines of code being moved out of the entity_linker function and adjusted to be able to link to one database.
If readers can, they can compare the difference between using the code with certain lines outside the function or the general entity linker used above. (It is an interesting difference that answers the question of when functions should be a general function or tweaked.
This is the view of the resulting dataframe showing what each entity links to in the available knowledge bases(s). Some entities had definitions to link to in the 4 scispaCy knowledge bases connected to.
This is a tutorial showing how to extract biomedical and clinical entities and link to medical knowledge base(s) using the scispaCy python library.
I hope this answers some of your questions and you have a great time exploring.
Did you also learn a couple of things about pandas and swifter?
You can find more about scispaCy here