From: Enriching feature engineering for short text samples by language time series analysis
Process | Module | Parameters | Values |
---|---|---|---|
Preprocessing | None | N/A | |
Raw (light preprocessing) | Remove words not appearing in all authors’ writings | ||
Preprocess (heavy preprocessing) | Remove words not appearing in all authors’ writings and with a relative frequency less than 0.05 percent | ||
Representation | Word n-grams | N-gram range | Start = 1–End = 2 |
Minimum term frequency | 1 (use all terms) | ||
Maximum term frequency | 1.0 (no limit) | ||
Vectorize | Count vectorizer | All set to default | |
Classifier | Nearest Shrunken Centroids (NSC) | Shrink threshold | Tuned by GridSearchCV using a 10-fold cross-validation |