Visualizing a cloud of skills with Python

As a consultant and a freelance programmer, I need to be aware of which are the most demanded skills in my area of interest. What I did until now was scanning contract openings trying to remember how often certain skills appear in the specifications. But people are bad intuitive statisticians, a scientific fact proven by Daniel Kahneman. At least I am. It is hard for me to objectively aggregate statistic information in my head, and it’s kind of boring to keep notes and crunch the numbers manually. It’s much more fun to have a picture, which, as they say, is worth a thousand words. So I wanted to visualize the skills demands as a tag cloud.

Input

First I dump all skills specifications into a file skills.txt. Skills are comma or slash separated. Leading and trailing dashes and spaces are insignificant.

Here is a sample:

- C++
- Forex/FX, J2EE, Spring, Hibernate, REST
- Clojure
- Scala

Java Enterprise
JEE/J2EE
Hibernate
Spring
Struts

Java 7, Java 8, Agile, TDD, NoSQL
Kafka, Scala Test, Scala, Python, Hazelcast, Cassandra, Docker

Script

A cloud image is generated with a small Python script using a package wordcloud, which has to be installed either with pip

pip install wordcloud

or from the source.

Getting a cloud image with wordcloud is pretty straightforward. The generate_from_frequencies method expects a list of tuples, where the first member of a tuple is a term, and second is a term’s frequency. For example: [('java', 15), ('scala', 12), ('spring', 12)].

There are two ways to assign weights to the terms: by rank and by frequency. Parameter relative_scaling defines a ratio by which these two strategies are mixed. With relative_scaling=0, only ranks are considered. With relative_scaling=1, only frequencies are considered. In the example above, java has rank of 1, scala - of 2, spring - of 3, whereas frequencies are distributed as 15, 12, 12.

When frequency distribution is smooth, with many terms occupying same frequency, the cloud looks better if the frequency-based strategy gets more influence, so that differences in font size would not be huge. In this example relative_scaling is set to 0.8.

import re
from collections import Counter

from wordcloud import WordCloud


lines = file("skills.txt", "r").read().splitlines()
words = []

# split terms separated by commas and slashes
for line in lines:
    words.extend(re.split("[,/]+", line))

# strip any leading and trailing whitespaces and dashes
words = map(lambda s: s.strip(' -').lower(), words)

# remove stopwords and empty strings
stopWords = {"agile", "tdd", "bdd"}
words = filter(lambda s: s and s not in stopWords, words)

# count frequencies
cnt = Counter()
for word in words:
    cnt[word] += 1

# generate a cloud image prioritizing frequency over rank with weight 0.8
wordcloud = WordCloud(width=800, height=600, relative_scaling=.8)\
    .generate_from_frequencies(cnt.items())
wordcloud.to_file("skills-cloud.png")

# Display the image. Remove the lines below if you don't have matplotlib installed 
# and don't want the interactive display. 
import matplotlib.pyplot as plt

plt.imshow(wordcloud)
plt.axis("off")
plt.show()

The image

The generated image is saved to a file and displayed on screen. The lines displaying the image can be removed if interactivity is not needed, and the matplotlib package is not installed.

image
Generated cloud image

More things to do

  • Ignore comment lines starting with #. Keep dates, job details in comments.
  • Read multiple files.
  1. Daniel Kahneman
  2. wordcloud on GitHub
  3. Tag cloud
comments powered by Disqus