Health Related Headlines Datasets for Natural Language Processing(NLP)

Wuraola Oyewusi
2 min readJun 29, 2019

--

I have always wanted to work with more Datasets that are related to health and useful for Natural Language Processing. Then it occurred to me that I could scrape the web for Headlines and Teasers of News Articles and Titles of Journals (Maybe do something like the popular Reuters News Dataset but the categories will be related to health). I always looked forward to giving back to the Web too for all the things I have learnt by just searching.

It definitely took a lot of hours to make the data tidy but I ‘low key’ enjoy Data wrangling.

I attempted to present the 39,387 rows main Dataset not just as a whole but in chunks and different file formats, so users can experiment according to their need. I hope that people can dive in, do some Topic Modelling, Sentiment Analysis,Data Classification, Sequence Prediction,Data Preprocessing,Find trends ,maybe some Data Visualization and many tasks that probably will not cross my mind.

I attached samples on different ways to load the files to notebook below

Download the .gz file format from github to computer and load to notebook
Download .csv file from github into notebook using ‘!wget’
Load file link into pandas directly

The files are hosted on my github account:

https://github.com/WuraolaOyewusi/Health-Related-Headlines-Datasets-for-Natural-Language-Processing

--

--

Wuraola Oyewusi
Wuraola Oyewusi

Written by Wuraola Oyewusi

Data Scientist | AI Researcher | Technical Instructor wuraolaoyewusi.com

Responses (1)