# Lab: Tomotopy

[![tomoto](figs/tomoto.png)](https://github.com/bab2min/tomotopy)

Package tomotopy {cite}`tomotopy`


## What is tomotopy?

`tomotopy` is a Python extension of `tomoto` (Topic Modeling Tool) which is a Gibbs-sampling based topic model library written in C++.
The current version of `tomoto` supports several major topic models including

- Latent Dirichlet Allocation (`tomotopy.LDAModel`)
- Labeled LDA (`tomotopy.LLDAModel`)
- Partially Labeled LDA (`tomotopy.PLDAModel`)
- Supervised LDA (`tomotopy.SLDAModel`)
- Dirichlet Multinomial Regression (`tomotopy.DMRModel`)
- Generalized Dirichlet Multinomial Regression (`tomotopy.GDMRModel`)
- Hierarchical Dirichlet Process (`tomotopy.HDPModel`)
- Hierarchical LDA (`tomotopy.HLDAModel`)
- Multi Grain LDA (`tomotopy.MGLDAModel`)
- Pachinko Allocation (`tomotopy.PAModel`)
- Hierarchical PA (`tomotopy.HPAModel`)
- Correlated Topic Model (`tomotopy.CTModel`)
- Dynamic Topic Model (`tomotopy.DTModel`)
- Pseudo-document based Topic Model (`tomotopy.PTModel`).


## Getting Started

You can install tomotopy easily using pip. (https://pypi.org/project/tomotopy/)

```bash
pip install --upgrade pip
pip install tomotopy
```

After installing, you can start tomotopy by just importing.

```python
import tomotopy as tp
print(tp.isa) # prints 'avx2', 'avx', 'sse2' or 'none'
```


Here is a sample code for simple LDA training of texts from 'sample.txt' file.

```python
import tomotopy as tp
mdl = tp.LDAModel(k=20)
for line in open('sample.txt'):
    mdl.add_doc(line.strip().split())

for i in range(0, 100, 10):
    mdl.train(10)
    print('Iteration: {}\tLog-likelihood: {}'.format(i, mdl.ll_per_word))

for k in range(mdl.k):
    print('Top 10 words of topic #{}'.format(k))
    print(mdl.get_topic_words(k, top_n=10))

mdl.summary()
```


## Model Save and Load

`tomotopy` provides `save` and `load` method for each topic model class,
so you can save the model into the file whenever you want, and re-load it from the file.

```python
import tomotopy as tp

mdl = tp.HDPModel()
for line in open('sample.txt'):
    mdl.add_doc(line.strip().split())

for i in range(0, 100, 10):
    mdl.train(10)
    print('Iteration: {}\tLog-likelihood: {}'.format(i, mdl.ll_per_word))

# save into file
mdl.save('sample_hdp_model.bin')

# load from file
mdl = tp.HDPModel.load('sample_hdp_model.bin')
for k in range(mdl.k):
    if not mdl.is_live_topic(k): continue
    print('Top 10 words of topic #{}'.format(k))
    print(mdl.get_topic_words(k, top_n=10))

# the saved model is HDP model,
# so when you load it by LDA model, it will raise an exception
mdl = tp.LDAModel.load('sample_hdp_model.bin')
```

When you load the model from a file, a model type in the file should match the class of methods.


## Documents in the Model and out of the Model

We can use Topic Model for two major purposes.
The basic one is to discover topics from a set of documents as a result of trained model,
and the more advanced one is to infer topic distributions for unseen documents by using trained model.

We named the document in the former purpose (used for model training) as **document in the model**,
and the document in the later purpose (unseen document during training) as **document out of the model**.

In `tomotopy`, these two different kinds of document are generated differently.
A **document in the model** can be created by `tomotopy.LDAModel.add_doc` method.
`add_doc` can be called before `tomotopy.LDAModel.train` starts.
In other words, after `train` called, `add_doc` cannot add a document into the model because the set of document used for training has become fixed.


To acquire the instance of the created document, you should use `tomotopy.LDAModel.docs` like:

```python
mdl = tp.LDAModel(k=20)
idx = mdl.add_doc(words)
if idx < 0: raise RuntimeError("Failed to add doc")
doc_inst = mdl.docs[idx]
# doc_inst is an instance of the added document
```


A **document out of the model** is generated by `tomotopy.LDAModel.make_doc` method. `make_doc` can be called only after `train` starts.
If you use `make_doc` before the set of document used for training has become fixed, you may get wrong results.
Since `make_doc` returns the instance directly, you can use its return value for other manipulations.

```python
mdl = tp.LDAModel(k=20)
# add_doc ...
mdl.train(100)
doc_inst = mdl.make_doc(unseen_doc) # doc_inst is an instance of the unseen document
```


## Inference for Unseen Documents

If a new document is created by `tomotopy.LDAModel.make_doc`, its topic distribution can be inferred by the model.
Inference for unseen document should be performed using `tomotopy.LDAModel.infer` method.

```python
mdl = tp.LDAModel(k=20)
# add_doc ...
mdl.train(100)
doc_inst = mdl.make_doc(unseen_doc)
topic_dist, ll = mdl.infer(doc_inst)
print("Topic Distribution for Unseen Docs: ", topic_dist)
print("Log-likelihood of inference: ", ll)
```

The `infer` method can infer only one instance of `tomotopy.Document` or a `list` of instances of `tomotopy.Document`.
See more at `tomotopy.LDAModel.infer`.


## Corpus and transform

Every topic model in `tomotopy` has its own internal document type.
A document can be created and added into suitable for each model through each model's `add_doc` method.
However, trying to add the same list of documents to different models becomes quite inconvenient,
because `add_doc` should be called for the same list of documents to each different model.
Thus, `tomotopy` provides `tomotopy.utils.Corpus` class that holds a list of documents.
`tomotopy.utils.Corpus` can be inserted into any model by passing as argument `corpus` to `__init__` or `add_corpus` method of each model.
So, inserting `tomotopy.utils.Corpus` just has the same effect to inserting documents the corpus holds.

Some topic models requires different data for its documents.
For example, `tomotopy.DMRModel` requires argument `metadata` in `str` type,
but `tomotopy.PLDAModel` requires argument `labels` in `List[str]` type.
Since `tomotopy.utils.Corpus` holds an independent set of documents rather than being tied to a specific topic model,
data types required by a topic model may be inconsistent when a corpus is added into that topic model.
In this case, miscellaneous data can be transformed to be fitted target topic model using argument `transform`.


See more details in the following code:

```python
from tomotopy import DMRModel
from tomotopy.utils import Corpus

corpus = Corpus()
corpus.add_doc("a b c d e".split(), a_data=1)
corpus.add_doc("e f g h i".split(), a_data=2)
corpus.add_doc("i j k l m".split(), a_data=3)

model = DMRModel(k=10)
model.add_corpus(corpus)
# You lose `a_data` field in `corpus`,
# and `metadata` that `DMRModel` requires is filled with the default value, empty str.

assert model.docs[0].metadata == ''
assert model.docs[1].metadata == ''
assert model.docs[2].metadata == ''

def transform_a_data_to_metadata(misc: dict):
    return {'metadata': str(misc['a_data'])}
# this function transforms `a_data` to `metadata`

model = DMRModel(k=10)
model.add_corpus(corpus, transform=transform_a_data_to_metadata)
# Now docs in `model` has non-default `metadata`, that generated from `a_data` field.

assert model.docs[0].metadata == '1'
assert model.docs[1].metadata == '2'
assert model.docs[2].metadata == '3'
```


## Pining Topics using Word Priors

Since version 0.6.0, a new method `tomotopy.LDAModel.set_word_prior` has been added. It allows you to control word prior for each topic.
For example, we can set the weight of the word 'church' to 1.0 in topic 0, and the weight to 0.1 in the rest of the topics by following codes.
This means that the probability that the word 'church' is assigned to topic 0 is 10 times higher than the probability of being assigned to another topic.
Therefore, most of 'church' is assigned to topic 0, so topic 0 contains many words related to 'church'.
This allows to manipulate some topics to be placed at a specific topic number.

```python
import tomotopy as tp
mdl = tp.LDAModel(k=20)

# add documents into `mdl`

# setting word prior
mdl.set_word_prior('church', [1.0 if k == 0 else 0.1 for k in range(20)])
```


## Examples


### Install or upgrade of ekorpkit

```{note}
Install ekorpkit package first.

Set logging level to Warning, if you don't want to see verbose logging.

If you run this notebook in Colab, set Hardware accelerator to GPU.
```

```{toggle}
!pip install -U --pre ekorpkit[topic]

exit()
```


In [2]:
from ekorpkit import eKonf

eKonf.setLogger("WARNING")
print("version:", eKonf.__version__)
print("is notebook?", eKonf.is_notebook())
print("is colab?", eKonf.is_colab())
print("environment variables:")
eKonf.print(eKonf.env().dict())

data_dir = "../data/topic_models"

version: 0.1.39.post0.dev7
is notebook? True
is colab? False
environment variables:
{'CUDA_DEVICE_ORDER': None,
 'CUDA_VISIBLE_DEVICES': None,
 'EKORPKIT_CONFIG_DIR': '/workspace/projects/ekorpkit-book/config',
 'EKORPKIT_DATA_DIR': None,
 'EKORPKIT_PROJECT': 'ekorpkit-book',
 'EKORPKIT_WORKSPACE_ROOT': '/workspace',
 'KMP_DUPLICATE_LIB_OK': 'TRUE',
 'NUM_WORKERS': 230}


In [2]:
corpus_cfg = eKonf.compose("corpus", overrides=["project=esgml"])
corpus_cfg.name = "us_equities_news"
corpus_cfg.verbose = False
# corpus = eKonf.instantiate(corpus_cfg)
cfg = eKonf.compose("pipeline", overrides=["project=esgml"])
cfg.data.corpus = corpus_cfg
cfg._pipeline_ = ["sampling"]
cfg.sampling.sample_size_per_group = 0.1
cfg.sampling.output_dir = data_dir
cfg.sampling.output_file = "us_equities_news_sampled.parquet"
data = eKonf.instantiate(cfg)


INFO:ekorpkit.base:Loaded .env from /workspace/projects/ekorpkit-book/config/.env
INFO:ekorpkit.base:setting environment variable CACHED_PATH_CACHE_ROOT to /workspace/.cache/cached_path
INFO:ekorpkit.base:setting environment variable KMP_DUPLICATE_LIB_OK to TRUE
INFO:ekorpkit.base:instantiating ekorpkit.pipelines.pipe.pipeline...
INFO:ekorpkit.base:instantiating ekorpkit.pipelines.data.Data...
INFO:ekorpkit.base:Applying pipe: functools.partial(<function sampling at 0x7f4a38cd5790>)


       id                                               text  split
0  251155  Investing com Asian stock markets were broadly...  train
1  270611  Solid execution product diversity and strong b...  train
2  237917  Chip name Micron Technology Inc NASDAQ MU is h...  train
3  406989  June is typically a boring month for gold and ...  train
4  231535  A prudent investment decision involves buying ...  train


### Load a dataset


In [3]:
cfg = eKonf.compose("path")
cfg.cache.uri = "https://github.com/entelecheia/ekorpkit-book/raw/main/assets/data/us_equities_news_sampled.zip"
data = eKonf.load_data("us_equities_news_sampled.parquet", cfg.cached_path)
data.text[0]


'Investing com Asian stock markets were broadly lower for a second day on Thursday as weak U S data on durable goods orders added to concerns over the global growth outlook while concerns over declining corporate profits also weighed During late Asian trade Hong Kong s Hang Seng Index tumbled 1 55 Australia s ASX 200 Index dipped 0 1 while Japan s Nikkei 225 Index shed 0 7 The Nikkei came further off a one year closing high hit earlier in the week as investors cashed in ahead of the Japanese fiscal year end March is the final month of Japan s fiscal year and market participants have expected many funds to lock in profits from a meteoric 19 rally in the January to March period after shedding more than 13 in April to December Exporters which have gained sharply in the first quarter on the back a weakening yen declined Automakers Toyota and Nissan slumped 1 65 and 1 8 respectively while consumer electronics giant Sony retreated 1 5 On the upside Sharp saw shares jump 6 7 extending the pre

### LDA Basics

LDA class provides Latent Dirichlet Allocation(LDA) topic model and its implementation is based on following papers:

- Blei, D.M., Ng, A.Y., &Jordan, M.I. (2003).Latent dirichlet allocation.Journal of machine Learning research, 3(Jan), 993 - 1022.
- Newman, D., Asuncion, A., Smyth, P., &Welling, M. (2009).Distributed algorithms for topic models.Journal of Machine Learning Research, 10(Aug), 1801 - 1828.


In [4]:
import tomotopy as tp

save_path = eKonf.join_path(data_dir, "lda_basic.mdl")
mdl = tp.LDAModel(tw=tp.TermWeight.ONE, min_cf=10, rm_top=10, k=20)
for n, line in enumerate(data["text"][:100]):
    ch = line.strip().split()
    mdl.add_doc(ch)
mdl.burn_in = 100
mdl.train(0)
print(
    "Num docs:",
    len(mdl.docs),
    ", Vocab size:",
    len(mdl.used_vocabs),
    ", Num words:",
    mdl.num_words,
)
print("Removed top words:", mdl.removed_top_words)
for i in range(0, 100, 10):
    mdl.train(10)
    print("Iteration: {}\tLog-likelihood: {}".format(i, mdl.ll_per_word))

mdl.save(save_path, full=True)


Num docs: 100 , Vocab size: 972 , Num words: 33695
Removed top words: ['the', 'to', 'of', 'and', 'in', 'a', 'is', 'for', 's', 'on']
Iteration: 0	Log-likelihood: -7.3065742522854
Iteration: 10	Log-likelihood: -7.055851222156576
Iteration: 20	Log-likelihood: -6.9786549585511946
Iteration: 30	Log-likelihood: -6.918256985219584
Iteration: 40	Log-likelihood: -6.906922808788813
Iteration: 50	Log-likelihood: -6.875851555821591
Iteration: 60	Log-likelihood: -6.860287052386768
Iteration: 70	Log-likelihood: -6.837860183047866
Iteration: 80	Log-likelihood: -6.833746569962225
Iteration: 90	Log-likelihood: -6.823284888087085


In [5]:
mdl.summary()

<Basic Info>
| LDAModel (current version: 0.12.3)
| 100 docs, 33695 words
| Total Vocabs: 8855, Used Vocabs: 972
| Entropy of words: 6.35945
| Entropy of term-weighted words: 6.35945
| Removed Vocabs: the to of and in a is for s on
|
<Training Info>
| Iterations: 100, Burn-in steps: 100
| Optimization Interval: 10
| Log-likelihood per word: -6.82328
|
<Initial Parameters>
| tw: TermWeight.ONE
| min_cf: 10 (minimum collection frequency of words)
| min_df: 0 (minimum document frequency of words)
| rm_top: 10 (the number of top words to be removed)
| k: 20 (the number of topics between 1 ~ 32767)
| alpha: [0.1] (hyperparameter of Dirichlet distribution for document-topic, given as a single `float` in case of symmetric prior and as a list with length `k` of `float` in case of asymmetric prior.)
| eta: 0.01 (hyperparameter of Dirichlet distribution for topic-word)
| seed: 3080440490 (random seed)
| trained in version 0.12.3
|
<Parameters>
| alpha (Dirichlet prior on the per-document topic d

In [6]:
for k in range(mdl.k):
    print("Topic #{}".format(k))
    for word, prob in mdl.get_topic_words(k):
        print("\t", word, prob, sep="\t")


Topic #0
		has	0.043887220323085785
		have	0.041293028742074966
		be	0.040644481778144836
		an	0.03415900468826294
		it	0.03286190703511238
		are	0.03156481310725212
		will	0.031132448464632034
		at	0.02745734713971615
		or	0.027241162955760956
		than	0.023782242089509964
Topic #1
		trade	0.047546323388814926
		market	0.04248927906155586
		that	0.03844364359974861
		from	0.03439800813794136
		growth	0.03439800813794136
		8	0.030352376401424408
		4	0.029340967535972595
		quarter	0.02731814980506897
		25	0.02428392320871353
		results	0.022261105477809906
Topic #2
		he	0.05730922520160675
		He	0.04045604541897774
		Trump	0.037085410207509995
		his	0.03624275326728821
		said	0.03455743566155434
		but	0.031186800450086594
		was	0.028658824041485786
		by	0.026973506435751915
		with	0.026973506435751915
		will	0.025288190692663193
Topic #3
		as	0.05886959284543991
		that	0.05130905658006668
		The	0.046988748013973236
		oil	0.03240770846605301
		from	0.03186766803264618
		two	0.028627438470721

### LDA Visualization

This example shows how to perform a Latent Dirichlet Allocation using tomotopy and visualize the result.


In [7]:
import tomotopy as tp
import nltk
import re
import numpy as np
import pyLDAvis
from nltk.corpus import stopwords

In [8]:
porter_stemmer = nltk.PorterStemmer().stem
english_stops = set(porter_stemmer(w) for w in stopwords.words("english"))
pat = re.compile("^[a-z]{2,}$")
corpus = tp.utils.Corpus(
    tokenizer=tp.utils.SimpleTokenizer(porter_stemmer),
    stopwords=lambda x: x in english_stops or not pat.match(x),
)

corpus.process(d.lower() for d in data["text"][:100])
# save preprocessed corpus for reuse
save_path = eKonf.join_path(data_dir, "preprocessed_corpus.cps")
corpus.save(save_path)


In [9]:
mdl = tp.LDAModel(min_df=5, rm_top=20, k=30, corpus=corpus)
mdl.train(0)

print(
    "Num docs:{}, Num Vocabs:{}, Total Words:{}".format(
        len(mdl.docs), len(mdl.used_vocabs), mdl.num_words
    )
)
print("Removed Top words: ", *mdl.removed_top_words)


Num docs:100, Num Vocabs:1074, Total Words:22386
Removed Top words:  year compani stock market earn said zack price percent share quarter expect growth report billion revenu trade investor nyse million


In [10]:
# Let's train the model
for i in range(0, 100, 10):
    print("Iteration: {:04}, LL per word: {:.4}".format(i, mdl.ll_per_word))
    mdl.train(10)
print("Iteration: {:04}, LL per word: {:.4}".format(1000, mdl.ll_per_word))

mdl.summary()


Iteration: 0000, LL per word: -12.02
Iteration: 0010, LL per word: -7.564
Iteration: 0020, LL per word: -7.357
Iteration: 0030, LL per word: -7.263
Iteration: 0040, LL per word: -7.193
Iteration: 0050, LL per word: -7.167
Iteration: 0060, LL per word: -7.159
Iteration: 0070, LL per word: -7.144
Iteration: 0080, LL per word: -7.138
Iteration: 0090, LL per word: -7.128
Iteration: 1000, LL per word: -7.122
<Basic Info>
| LDAModel (current version: 0.12.3)
| 100 docs, 22386 words
| Total Vocabs: 5139, Used Vocabs: 1074
| Entropy of words: 6.62188
| Entropy of term-weighted words: 6.62188
| Removed Vocabs: year compani stock market earn said zack price percent share quarter expect growth report billion revenu trade investor nyse million
|
<Training Info>
| Iterations: 100, Burn-in steps: 0
| Optimization Interval: 10
| Log-likelihood per word: -7.12162
|
<Initial Parameters>
| tw: TermWeight.ONE
| min_cf: 0 (minimum collection frequency of words)
| min_df: 5 (minimum document frequency of w

In [11]:
topic_term_dists = np.stack([mdl.get_topic_word_dist(k) for k in range(mdl.k)])
doc_topic_dists = np.stack([doc.get_topic_dist() for doc in mdl.docs])
doc_topic_dists /= doc_topic_dists.sum(axis=1, keepdims=True)
doc_lengths = np.array([len(doc.words) for doc in mdl.docs])
vocab = list(mdl.used_vocabs)
term_frequency = mdl.used_vocab_freq

prepared_data = pyLDAvis.prepare(
    topic_term_dists,
    doc_topic_dists,
    doc_lengths,
    vocab,
    term_frequency,
    start_index=0,  # tomotopy starts topic ids with 0, pyLDAvis with 1
    sort_topics=False,  # IMPORTANT: otherwise the topic_ids between pyLDAvis and tomotopy are not matching!
)


  from imp import reload


In [12]:
pyLDAvis.display(prepared_data)

In [13]:
save_dir = "../../../assets/extra"
filename = "lda_basic.html"
save_path = eKonf.join_path(save_dir, filename)
pyLDAvis.save_html(prepared_data, save_path)

In [14]:
from IPython.display import display, HTML

display(HTML(f"<a href={save_path} target='_blank'>{filename}</a>"))

### LDA coherence

This example shows how to perform a Latent Dirichlet Allocation and calculate coherence of the results.


In [15]:
# calculate coherence using preset
for preset in ("u_mass", "c_uci", "c_npmi", "c_v"):
    coh = tp.coherence.Coherence(mdl, coherence=preset)
    average_coherence = coh.get_score()
    coherence_per_topic = [coh.get_score(topic_id=k) for k in range(mdl.k)]
    print("==== Coherence : {} ====".format(preset))
    print("Average:", average_coherence, "\nPer Topic:", coherence_per_topic)
    print()


==== Coherence : u_mass ====
Average: -1.2919665389931534 
Per Topic: [-1.17386418557316, -0.8680202687611864, -1.9011274512865044, -1.047667736559903, -0.8366067935289703, -1.0824157815953828, -1.0848464209397464, -1.042256352533198, -2.040907684969454, -0.9648597265277666, -0.6370809895484615, -1.0602086759498055, -0.9300058656023976, -0.8755223027370869, -3.8756595693460896, -0.9561534903362163, -1.0182683509407169, -1.0452619935102012, -0.8683999688876635, -1.6542543949045063, -0.9976812325252717, -1.3517324017228247, -1.4340397599847348, -1.1336003900525302, -3.3037643601664413, -0.9452589069885053, -1.3000326329631926, -1.1188280272579738, -1.1003419572424118, -1.1103284968522937]

==== Coherence : c_uci ====
Average: -4.878004424663513 
Per Topic: [-1.4440664765834577, -2.9780619010239433, -8.708752137868833, -3.0050339171614286, -3.9756801762639395, -5.2244161333617765, -6.687473406288814, -4.228681125865242, -7.003535776093569, -3.3012711354385487, -0.9000521475110517, -3.7900

In [16]:
import itertools

# calculate coherence using custom combination
for seg, cm, im in itertools.product(
    tp.coherence.Segmentation, tp.coherence.ConfirmMeasure, tp.coherence.IndirectMeasure
):
    coh = tp.coherence.Coherence(
        mdl, coherence=(tp.coherence.ProbEstimation.DOCUMENT, seg, cm, im)
    )
    average_coherence = coh.get_score()
    coherence_per_topic = [coh.get_score(topic_id=k) for k in range(mdl.k)]
    print("==== Coherence : {}, {}, {} =S===".format(repr(seg), repr(cm), repr(im)))
    print("Average:", average_coherence, "\nPer Topic:", coherence_per_topic)
    print()


==== Coherence : <Segmentation.ONE_ONE: 1>, <ConfirmMeasure.DIFFERENCE: 1>, <IndirectMeasure.NONE: 0> =S===
Average: 0.1351874610643829 
Per Topic: [0.11101613159001103, 0.14175311097344043, 0.16218523657111322, 0.18403090427713625, 0.13375353018989322, 0.11100999661663936, 0.11403467499327492, 0.19470537423940926, 0.09631062974767758, 0.16263582197968346, 0.13661738585142094, 0.11390028382623481, 0.11258613327854435, 0.1902236282755808, 0.1258461538435896, 0.1319174081261988, 0.11790130382631658, 0.101183821270789, 0.13940305217600987, 0.16222103513433575, 0.11832280151689759, 0.09788931234235808, 0.1759731325287265, 0.1419625983771921, 0.12283456255415204, 0.11983944773489327, 0.1397846799755621, 0.17328171317695196, 0.12248752611724914, 0.10001244082020401]

==== Coherence : <Segmentation.ONE_ONE: 1>, <ConfirmMeasure.DIFFERENCE: 1>, <IndirectMeasure.COSINE: 1> =S===
Average: 0.351438074447317 
Per Topic: [0.33813833362526363, 0.430223838157124, 0.2877488086620967, 0.459748269058764,

### Pining Topics using Word Priors


In [23]:
# make LDA model and train
mdl = tp.LDAModel(k=20, min_cf=10, min_df=5, corpus=corpus)
# The word 'church' is assigned to Topic 0 with a weight of 1.0 and to the remaining topics with a weight of 0.1.
# Therefore, a topic related to 'nasdaq' can be fixed at Topic 0 .
mdl.set_word_prior("nasdaq", [1.0 if k == 0 else 0.0001 for k in range(20)])
# Topic 1 for a topic related to 'bank'
mdl.set_word_prior("bank", [1.0 if k == 1 else 0.0001 for k in range(20)])
# Topic 2 for a topic related to 'car'
mdl.set_word_prior("oil", [1.0 if k == 2 else 0.0001 for k in range(20)])
mdl.train(0)
print(
    "Num docs:",
    len(mdl.docs),
    ", Vocab size:",
    len(mdl.used_vocabs),
    ", Num words:",
    mdl.num_words,
)
print("Removed top words:", mdl.removed_top_words)
for i in range(0, 100, 10):
    mdl.train(10)
    print("Iteration: {}\tLog-likelihood: {}".format(i, mdl.ll_per_word))


Num docs: 100 , Vocab size: 726 , Num words: 23802
Removed top words: []
Iteration: 0	Log-likelihood: -6.9982102600171
Iteration: 10	Log-likelihood: -6.78172128538039
Iteration: 20	Log-likelihood: -6.707927669262908
Iteration: 30	Log-likelihood: -6.669206157186698
Iteration: 40	Log-likelihood: -6.661336047859807
Iteration: 50	Log-likelihood: -6.63225589900849
Iteration: 60	Log-likelihood: -6.619234496642365
Iteration: 70	Log-likelihood: -6.6113331038327985
Iteration: 80	Log-likelihood: -6.589239579596395
Iteration: 90	Log-likelihood: -6.604038265559883


In [24]:
mdl.summary()

<Basic Info>
| LDAModel (current version: 0.12.3)
| 100 docs, 23802 words
| Total Vocabs: 5139, Used Vocabs: 726
| Entropy of words: 6.20784
| Entropy of term-weighted words: 6.20784
| Removed Vocabs: <NA>
|
<Training Info>
| Iterations: 100, Burn-in steps: 0
| Optimization Interval: 10
| Log-likelihood per word: -6.60404
|
<Initial Parameters>
| tw: TermWeight.ONE
| min_cf: 10 (minimum collection frequency of words)
| min_df: 5 (minimum document frequency of words)
| rm_top: 0 (the number of top words to be removed)
| k: 20 (the number of topics between 1 ~ 32767)
| alpha: [0.1] (hyperparameter of Dirichlet distribution for document-topic, given as a single `float` in case of symmetric prior and as a list with length `k` of `float` in case of asymmetric prior.)
| eta: 0.01 (hyperparameter of Dirichlet distribution for topic-word)
| seed: 2936119906 (random seed)
| trained in version 0.12.3
|
<Parameters>
| alpha (Dirichlet prior on the per-document topic distributions)
|  [0.91635215 

In [25]:
for k in range(mdl.k):
    print("== Topic #{} ==".format(k))
    for word, prob in mdl.get_topic_words(k, top_n=10):
        print(word, prob, sep="\t")
    print()


== Topic #0 ==
nasdaq	0.07707776129245758
inc	0.0574597492814064
gain	0.04695010185241699
global	0.04624945670366287
demand	0.04134495183825493
recent	0.039243023842573166
world	0.03223659098148346
develop	0.028032733127474785
industri	0.028032733127474785
system	0.025230159983038902

== Topic #1 ==
bank	0.0991838201880455
financi	0.06384652107954025
goldman	0.047887738794088364
could	0.04332808777689934
increas	0.04218817502260208
invest	0.041048262268304825
fed	0.03876843675971031
firm	0.0364886112511158
rise	0.034208785742521286
interest	0.03306887298822403

== Topic #2 ==
oil	0.07005497813224792
data	0.05201324075460434
countri	0.043523017317056656
product	0.03821662440896034
sinc	0.03821662440896034
low	0.03715534880757332
well	0.03715534880757332
declin	0.029726402834057808
energi	0.027603846043348312
rose	0.02654256857931614

== Topic #3 ==
investor	0.12054447084665298
fund	0.08358242362737656
cash	0.05304856225848198
manag	0.046620383858680725
activist	0.04019220173358917
score