Skip to main content

Table 7 Parametrization of NSC Model. Optimal settings are typeset in bold

From: Enriching feature engineering for short text samples by language time series analysis

Process Module Parameters Values
Preprocessing None N/A
Raw (light preprocessing) Remove words not appearing in all authors’ writings
Preprocess (heavy preprocessing) Remove words not appearing in all authors’ writings and with a relative frequency less than 0.05 percent
Representation Word n-grams N-gram range Start = 1–End = 2
Minimum term frequency 1 (use all terms)
Maximum term frequency 1.0 (no limit)
Vectorize Count vectorizer All set to default
Classifier Nearest Shrunken Centroids (NSC) Shrink threshold Tuned by GridSearchCV using a 10-fold cross-validation