Skip to main content

Table 7 Parametrization of NSC Model. Optimal settings are typeset in bold

From: Enriching feature engineering for short text samples by language time series analysis

Process

Module

Parameters

Values

Preprocessing

None

N/A

Raw (light preprocessing)

Remove words not appearing in all authors’ writings

Preprocess (heavy preprocessing)

Remove words not appearing in all authors’ writings and with a relative frequency less than 0.05 percent

Representation

Word n-grams

N-gram range

Start = 1–End = 2

Minimum term frequency

1 (use all terms)

Maximum term frequency

1.0 (no limit)

Vectorize

Count vectorizer

All set to default

Classifier

Nearest Shrunken Centroids (NSC)

Shrink threshold

Tuned by GridSearchCV using a 10-fold cross-validation