Pre-Processing steps in NLP:
- Normalisation.
- Remove stop words, punctuation and HTML.
- Tokenisation.
- Lemmatisation
- TF-IDF.
It is a widely used technique when trying to quantify what a document is about and tends to be used with algorithms such as Gaussian Mixture Models (GMM), K-means or Latent Dirichlet Allocation (LDA).
打着ETL的牌子。。其实上只是一些简单的处理