word2vec: Skip-gram and CBOW

For PubMed SB and 14M, we experimented with different sizes of word vector Vdim = {50,150,300,500} and window sizes Wr= {2,5,10} reproducing the parameters settings of De Vine et al. The threshold for occurrence of words was set to 1e-5. The open-source code in C was compiled with the name w2v and then executed in Terminal.

./w2v -train PARM_InputFile.txt
-output PARM_OutputFile.bin
-size PARM_Vdim
-window PARM_Wr
-sample 1e-5
-hs PARM_HierarchicalSoftmax
-binary 1
-cbow PARM_model

gensim: Latent Semantic Analysis

For LSA using PubMed SB the number of topics was set up to 300 taking into account an empirical study by Bradford.

lsi = models.lsimodel.LsiModel(corpus=mm, id2word=id2word, num_topics=300, chunksize=1000)

gensim: Latent Dirichlet Allocation

For LDA using PubMed SB the number of topics was also set up to 300. The method numpy.random.seed is used instead of numpy.random (see below) to guarantee replicability.

lda = models.ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics=300, update_every=1, chunksize=1000, passes=1, random_state=numpy.random.seed(0))

SAVSNET dataset

At the end of 2017, SAVSNET has collected a 2.5M free-text de-identified clinical veterinary narratives dataset from approximately 500 veterinary clinics across the UK.

PubMed datasets

To replicate datasets: download the MEDLINE/PubMed baseline files for 2015 and also the update files up to 8th June 2016.

  • SB dataset

    A subset of 301,201 PubMed publications (titles and abstracts) obtained applying the PubMed Systematic Reviews filter and restricting the date of publication from 2000 to 2016. Get IDs.

  • 14M dataset

    A subset of 14,056,761 PubMed publications (titles and abstracts) with date of publication from 2000 to 2016. Get IDs.


  • Protégé

    An open-source ontology editor and is downloadable from Stanford Web site.


    The OWL Java-based API for creating, manipulating and serialising ontologies in OWL, currently hosted at GitHub.

  • ARQ

    ARQ is a SPARQL 1.1 compliant query engine for Apache Jena.

  • gensim

    A free python library and is downloadable from the Python Package Index (PyPI).

  • word2vec

    Initially it was released as open-source code in C, now it is available as python code under TensorFlow.