As a consultant and a freelance programmer, I need to be aware of which are the most demanded skills in my area of interest. What I did until now was scanning contract openings trying to remember how often certain skills appear in the specifications. But people are bad intuitive statisticians, a scientific fact proven by Daniel Kahneman. At least I am. It is hard for me to objectively aggregate statistic information in my head, and it’s kind of boring to keep notes and crunch the numbers manually. It’s much more fun to have a picture, which, as they say, is worth a thousand words. So I wanted to visualize the skills demands as a tag cloud.
First I dump all skills specifications into a file
skills.txt. Skills are comma or slash separated.
Leading and trailing dashes and spaces are insignificant.
Here is a sample:
A cloud image is generated with a small
Python script using a package wordcloud,
which has to be installed either with
or from the source.
Getting a cloud image with wordcloud is pretty straightforward.
generate_from_frequencies method expects a list of tuples, where the first member of a tuple is a term,
and second is a term’s frequency.
[('java', 15), ('scala', 12), ('spring', 12)].
There are two ways to assign weights to the terms: by rank and by frequency.
relative_scaling defines a ratio by which these two strategies are mixed.
relative_scaling=0, only ranks are considered.
relative_scaling=1, only frequencies are considered.
In the example above, java has rank of 1, scala - of 2, spring - of 3,
whereas frequencies are distributed as 15, 12, 12.
When frequency distribution is smooth, with many terms occupying same frequency,
the cloud looks better if the frequency-based strategy gets more influence, so that differences in font size would not be huge.
In this example
relative_scaling is set to 0.8.
The generated image is saved to a file and displayed on screen.
The lines displaying the image can be removed if interactivity is not needed, and the
matplotlib package is not installed.
More things to do
- Ignore comment lines starting with
#. Keep dates, job details in comments.
- Read multiple files.