Text Mining

Our studies deploy some of the most modern technologies of text mining (text analytics), which in turn allow the processing of textual information with high quality. Text mining tools allow the identification of structured patterns in texts based on the application of natural language processing (NLP) and machine learning techniques.

The following example shows the application of text mining techniques to the investigation of the attitudes of 21,104 workers towards the improvement of collaboration in the workplace. In order to extract as much relevant information as possible, we typically use the following techniques in our investigation:

1. Word frequency analysis: it lists the most relevant relevant words, excluding articles, prepositions and conjunctions.


The most recurring words when the topic is “collaborative culture” revolve around people, employees, work, team and culture.

2. Analysis of co-occurrence of words (single sentence): combinations of relevant words are explored in sequence.


The main words that are repeated in the sentences above are employees and people, which suggests that the development of a collaborative culture permeates the people of the organization and their group.

3. Analysis of co-occurrence of words (distance of up to three words): combinations of relevant words with a distance of up to three words are explored in sequence.


According to the analysis of sentences with a distance of up to three words, the focus on the need for better communication in the work environment, as well as the involvement of working groups not only from the same area, but from different areas are the main aspects for the development of a collaborative culture.

4. Word networks: based on the Graph Theory, it creates a network of words from complex relationships and connections between the most significant words.


In addition to the elements listed above, the word network highlights important additional points such as the need for greater job security, the creation of common goals within teams and the application of new ideas.

5. Networks of words with directional relations: it complements the word network by showing directions between words, from the simulation of movements and causalities between the variables of the network.


Complementing the word network, the idea of having an open plan office was perceived as propitiating the communication of the group. Another important point concerns the need to create a culture of collaboration that transcends the boundaries of individuals, the area, the department and even the organisation.

6. Text clustering: it applies cluster analysis to the categorization and consequent grouping of words with conceptual relations between them.


In the cluster of words, two large branches can be observed. The first focuses more on managerial aspects of the organization and is related to the development of new ideas, decision making and greater flexibility in the work, while the second class points out the importance of the development of the collaborative culture in a more micro level and centered in the groups and individuals, with the participation of external actors.

7. Analysis of word communities: it complements the network of words with directional relations when using a community detection algorithm. A community is a grouping of words with strong connections to each other and weaker relationships with other groups of words.


The analysis of communities strengthens previous observations regarding the importance of team development from the sharing of common goals. The central community is the need for involvement of people with more skills or external members, including the possibility of being involved in other areas and departments.

8. Sentiment analysis: identification and categorization of opinions expressed in a text, especially to determine whether the respondent’s attitude toward a topic or product is positive or negative.


In this graph you can see the keywords that are most associated with positive and negative feelings about collaborative culture. The sentiment analysis suggests that some of the more general negative words are related to hard problems and lack of collaboration. The need for greater support, respect, trust and transparency were the positive elements pointed out as necessary to promote better collaboration at work.

9. Sentiment analysis word cloud: it shows what are the most recurring positive and negative feelings, as well as the polarity of the feeling according to its position and width in the word cloud.


Complementing the previous graph, the analysis of negative feelings pointed to a greater general use of adjectives than nouns to describe the culture of collaboration, which suggests a greater dissatisfaction with this dimension. Words like issues, problems and mistakes add up to the difficulty, lack, loss, effort and challenges to foster a culture of collaboration.

Comparisons between different dimensions

If more than one dimension is being studied and potential relations between them are pursued, further analyses can be carried out:

1. Descending Hierarchical Classification (DHC): it uses a cluster analysis to define word classes when different qualitative questions are compared to each other.

In the following example, “collaborative culture” is compared to “training and career development opportunities” dimension. The findings of DHC show that three classes were observed instead. Class 2 (green), linked to collaborative culture joined Class 3 (blue), which reveals negative elements linked to the culture of collaboration as well as concerns about labour issues and involves work benefits such as health care costs and retirement plan. Class 1 (red) clearly relates to the dimension training and career development opportunities.


2. Correspondence analysis (CA): showing factors analogous to principal component analysis, CA provides a means of displaying a two-way table summarising a set of data in two-dimensional graphical form.


The results of correspondence analysis shows that Class 1 (red) has its words positively stated in factor 1, whereas the words of the dimension collaborative culture (Class 2 green) load onto the second dimension. Class 3 (blue) occupy a negative position in relation to both factors 1 and 2, though it is on the same side as Class 2 (green).

3. Similitude analysis: it presents a network analysis by CHD classes with statistically more significant words and integrated as a single set. It shows a graph representing the link between words of the textual corpus. From this analysis it is possible to infer the construction structure of the text and the topics of relative importance, from the co-occurrence between words.


For the dimension collaborative culture, the word “work” is at the centre of the network, connecting other elements such as persons and team goal.


Advances in research on text mining cuts across the development of trailblazing techniques from natural language processing (NLP) and machine learning. The examples showed here and carried out using R statistical software amount to some of the traditional and modern developments in these fields with broader applications to organisational behaviour and HRM practices. As such, this is not an exhaustive list of the various techniques that can be used to understand how humans express themselves through the use of words. For a more comprehensive list of the latest techniques in NLP, please refer to this website.