note 2022-04-05 Nlp

NLP-特征提取

  • feature_extraction.text.CountVectorizer
  • feature_extraction.text.HasingVectorizer
  • feature_extraction.text.TfidfTransformer
  • feature_extraction.text.TfidfVectorizer

Bag of Words

scikit-learn 提供的工具

  • tokenizing
  • counting
  • normalizing

Sparsity

Common Vectorizer usage

TF-IDF term weighting

Decoding text files