Widely used approaches to assign values to t i are as follows:
Typical examples of k are of the order of 10 5. Thus, in practical applications, k can be very large, as large as the number of words that are considered sufficient for the specific task of interest. For example, each w i can be one word in the vocabulary of a language. The elements, t i, in each dimension, indicate the importance of the corresponding term w i when describing the document. According to VSM, a text document, T, is represented by a set of terms w i,0 ≤ i ≤ k, k ∈ ℕ each mapped to a dimension of the vector t = T ∈ ℝ k. It is very common to represent texts in a vector space, following the vector space model or VSM. Thus, the first decision related to text mining is how one represents the data. To fulfill this task, one needs to represent and act upon textual data. The task is to determine the author of a given text, given a set of training texts labeled with their corresponding author. In this section, our focus will be on the case of authorship identification, for example,, which is a special case of text classification. In other words, the goal can be on detecting basic morphology idiosyncracies in a text that identify its author (authorship identification), on identifying complexly expressed emotions (sentiment analysis), or on condensing redundant information from multiple texts to a concise summary (summarization). A few examples of text mining applications include spam e-mail detection, topic-based classification of texts, sentiment analysis in texts, text authorship identification, text indexing, and text summarization. Text mining is the part of data mining that analyzes textual data to detect, extract, represent, and evaluate patterns that appear in texts and can be transformed into real-world knowledge. Sergios Theodoridis, in Machine Learning, 2015 11.15 A Case Study: Authorship Identification To establish the size of each colour category according to the distribution of its exemplars in perceptual colour spaces, we used a parametric model to label all 1729 samples of the Munsell Renotation Dataset located within the sRGB gamut. The least consistent BCTs were white and orange. In addition, lilac, ochre, teal and turquoise were the most consistent nonbasic colour terms. The most consistent BCTs were blue, green and purple. Participants were not informed about the repeated colour samples, and the gap between each repetition was more than 10 colour choices.
One colour sample was repeated twice in each session, to explore the consistency of unconstrained colour naming responses. The most common use of multiple word descriptions involved light blue, light green, dark green and dark blue, revealing a preference for modifiers over secondary terms in the blue and green regions.
The most frequent colour terms were purple, pink, blue and green, while nonbasic terms such as turquoise, lilac, violet and magenta also occurred in the top ten. Top 30 most frequent colour words in English language.