Creating tag clouds using logarithmic interpolation in Python

Reports of timefyme.com feature a tag cloud representation. One of the most popular and useful ways to create a tag cloud is by calculating the logarithm of the usage for each tag. Using the logarithm rather than the original usage value results to a smooth gradation from the less to the most used tag.

timefyme_tags_cloud

In order to implement the feature, we searched for a suitable algorithm in usual resources such as stackoverflow, but couldn’t find an approach that works for every data case (e.g. with few or just one tag). So we needed to go back and refresh our memories with interpolation methods we studied in numerical analysis undergraduate courses, to build the (simple to be honest) algorithm ourselves. We offer this algorithm we used and works perfectly for us, for anyone interested in tag clouds.

Let’s assume that the tags usage is a vector \bf{u} = [u_1, u_2, ..., u_n] , then the minimum and maximum usage values are \min \bf{u} and \max \bf{u}.

Using the formula:

l_i = \frac{\log u_i - \log \min \bf{u}}{\log \max \bf{u} - \log \min \bf{u}}

we get a linearised value l_i for each tag varying from 0 to 1. ThenĀ  we can use l_i e.g. to calculate grey (rgb) values or to multiply a cardinality of css classes.

A Python implementation of the above methodology follows, code is also available for download.

#!/usr/bin/env python
# -*- encoding: utf-8 -*-
# vi:expandtab:tabstop=4 shiftwidth=4 textwidth=79

import math

TAG_CLOUD_CSS_CLASSES = 3
"""
Let's have three separate css classes for our example
"""

TAG_CLOUD_CSS_CLASS_TEMPLATE = '.tag-cloud-%d'
"""
This can be a css template, with ".tags-cloud-1" for the less used tag and
".tags-cloud-3" for the most used tag (assuming 3 distinct classes)
"""

def get_tags_cloud(data):
    """
    data should be a list containing tags items. Each item should be a
    dictionary containing at least the `name` of the tag and the number of tag
    occurrences named as `usage`.

    Returns a list sorted by `name` plus two fields: a `css_class` and the
    linear value resulting from logarithmic interpolation.
    """
    if not data:
        return

    # Find the maximum and minimum usage
    maximum = max(data, key=lambda x: x['usage'])['usage']
    minimum = min(data, key=lambda x: x['usage'])['usage']

    # Do the math, find the common subtractor and divider and calculate log
    # value for each tag. Then, assuming that log value is linearized, find the
    # integer class from 1 to TAGS_CLOUD_CSS_CLASSES
    subtractor = math.log(float(minimum))
    divider = math.log(float(maximum)) - subtractor or 1.0 # 1.0 if min==max
    for item in data:
        log_value = (math.log(float(item['usage']))-subtractor) / divider
        d = int(round(log_value*(TAG_CLOUD_CSS_CLASSES-1) + 1))
        item['css_class'] = TAG_CLOUD_CSS_CLASS_TEMPLATE%d
        item['log_value'] = log_value

    # Sort results by name for displaying tags in an alphabetical order
    return sorted(data, key=lambda x: x['name'])


# An example

if __name__ == '__main__':
    import pprint

    TEST_SET = [
            {'name': 'popular tag', 'usage': 100},
            {'name': 'medium popularity tag', 'usage': 10},
            {'name': 'another medium popularity tag', 'usage': 15},
            {'name': 'obscure tag', 'usage': 2}
        ]

    pprint.pprint(get_tags_cloud(TEST_SET))



# This is a potential Django application of the tag cloud. Django code is
# untested, it is just to prove the concept. Let's assume two models, first
# model represents blog posts and second models tags being used in posts.
# 
# from django.db import models
# from django.db.models import Count
# 
# class BlogPost(models.Model):
# 
#     # Several blog fields defined here...
#     #
# 
#     tags = models.ManyToManyField("Tag", related_name='blog_posts')
# 
# 
# class Tag(models.Model):
# 
#     name = models.CharField(max_length=40)
# 
#     @staticmethod
#     def tags_cloud_data(limit=50):
#         data = (Tag.objects.
#                 values('name').
#                 annotate(usage=Count('blog_bosts')).
#                 order_by('-usage')[:limit])
#         data = [item for item in data if item['usage']]
#         return data
# 
#     @staticmethod
#     def tags_cloud(limit=50):
#         return get_tags_cloud(Tag.tags_cloud_data(limit))

By running the example above we get this result:

[{'css_class': '.tag-cloud-2',
  'log_value': 0.5150539804460444,
  'name': 'another medium popularity tag',
  'usage': 15},
 {'css_class': '.tag-cloud-2',
  'log_value': 0.41140808993222105,
  'name': 'medium popularity tag',
  'usage': 10},
 {'css_class': '.tag-cloud-1',
  'log_value': 0.0,
  'name': 'obscure tag',
  'usage': 2},
 {'css_class': '.tag-cloud-3',
  'log_value': 1.0,
  'name': 'popular tag',
  'usage': 100}]
Advertisements
Tagged , , ,

Set the correct python optimizations for online documentation of django-rest-framework API

timefyme.com API is implemented with the amazing django-rest-framework toolkit. django-rest-swagger is used in addition to render nice API pages with full documentation. Documentation text is discovered automagically with django-rest-framework, by reading API endpoint view classes docstrings.

Screenshot from 2015-09-10 21:07:37

When we started to build our API documentation system everything was working fine in development environment. However when we deployed to production site, although the API listing was there, documentation text was missing.

The culprit was the -OO Python optimizations option used in uwsgi setup.

https://docs.python.org/2.7/using/cmdline.html#cmdoption-O

`-OO` discard docstrings in addition to theĀ  `-O` optimizations

We fixed the issue by setting in uwsgi.ini of the production system the optimize value that corresponds to -O:

optimize = 1

See: http://uwsgi-docs.readthedocs.org/en/latest/Options.html#optimize

Tagged , , , ,

Your Computer is Made Out of Magic!

About ten or more years before I found this nice post and I am so happy that it is still there! http://james.hamsterrepublic.com/technomancy/ I suggest it for reading for anyone with some reserves of geek humour. It also gave me the necessary food to make the initial post of my blog.

When I first read it I got my primary super-powers introducing me to the world of Open Sourcery and Voodoo Debugging. From then I practised a lot of Voodoo Debugging, but to be a honest magician, that didn’t help me to become a better sourcerer.. So I quitted Voodoo and I try to seek the true and pure software magic.

http://james.hamsterrepublic.com/technomancy/

  • Open Sourcery
    Open Sourcery is the new magical approach to software design that is replacing the old machine-minded methods. Basically, it works like this; Someone sets up a CVS repository and a bug tracking system, and a mailing list, and most importantly a website to state the goals and status of the project. Then as many Open Sourcerers as possible start arguing about what the software should actually do (positive energy), and complaining that it isn’t being done fast enough (negative energy). Eventually, the software will write itself, and will continue to evolve itself gradually until it reaches the stage of maturity know to Open Sourcerers as Alpha (which is Latin for “Done”). Occasionally a piece of software will continue to grow beyond the alpha stage until it becomes Beta (which is Latin for “I’m bored, lets do something else”)
  • Voodoo Debugging
    Both hardware support and software testing can benefit from the skill of Voodoo Debugging. It’s very simple. When a problem arises, start changing things randomly. Occasionally re-test the problem, and as soon as it goes away, the last thing you changed becomes the cure. Repeat the last fix on every computer you can find, including and especially ones that never had the problem in the first place. This magic can be aided by chanting such mantras as “I always change this setting in the BIOS and it seems to help”
Tagged ,