ETL pipline in Python

Pre-Processing steps in NLP:

Normalisation.
Remove stop words, punctuation and HTML.
Tokenisation.
Lemmatisation
TF-IDF.

It is a widely used technique when trying to quantify what a document is about and tends to be used with algorithms such as Gaussian Mixture Models (GMM), K-means or Latent Dirichlet Allocation (LDA).

打着ETL的牌子。。其实上只是一些简单的处理