Maven: Building a Self-Contained Hadoop Job

Posted July 24, 2010 by mafr
Categories: java

Tags: , , ,

Non-trivial Hadoop jobs usually have dependencies that go beyond those provided by the Hadoop runtime environment. That means, if your job needs additional libraries you have to make sure they are on Hadoop’s classpath as soon as the job is executed. This article shows how you can build a self-contained job JAR that contains all your dependencies.

Read the rest of this post »

Berlin Buzzwords Conference 2010

Posted June 9, 2010 by mafr
Categories: misc

Tags: , , ,

This week I attended Berlin Buzzwords Conference 2010, a two-day event aimed at software developers. The conference offered two tracks, one on search and the other one on NoSQL systems. Typical attendees seemed to be MacBook-wielding, twittering lifestyle geeks, often with SQL-induced childhood issues. The hype level was high – a bit too high for my taste – but given the conference’s title that was to be expected.

Read the rest of this post »

Quick Tip #4: Sorting Large Files

Posted May 23, 2010 by mafr
Categories: shell

Tags: , , , ,

With traditional Unix sort(1), the size of the files you can sort is limited by the amount of available main memory. As soon the file get larger and your system has to swap, performance degrades significantly. Even GNU sort which uses temporary files to get around this limitation doesn’t sort in parallel. The only viable option for sorting very large files efficiently is to split them, sort the individual parts in parallel and merge them.

Read the rest of this post »

Are Link-Sharing Services Irrelevant?

Posted April 25, 2010 by mafr
Categories: misc

Tags: , ,

You can use RSS to easily follow a few high-profile websites and link sharing services like Slashdot or Digg to discover popular web content. But that’s like reading a classic newspaper and some magazines: The information provided may have a higher chance of being relevant to you, but there’s still a lot of noise that wastes your time.

In this article, I’ll discuss the shortcomings of link sharing services using Dzone as an example. Dzone is a relatively small-scale service targeted at software developers and one of my most important sources of information.

Read the rest of this post »

Using TCP for Low-Latency Applications

Posted March 14, 2010 by mafr
Categories: misc

Tags: , , ,

Last week I ran into a nasty little problem while implementing an application with soft real-time requirements. I was aiming at 1 ms or less for a TCP-based request-response roundtrip on a local network. Should be trivial, but why did my tests indicate that I wasn’t even getting close?

Read the rest of this post »

The Future of python-musicbrainz2

Posted March 13, 2010 by mafr
Categories: python

Tags: , ,

I started the python-musicbrainz2 project in January 2006 as the first client library to the newly designed MusicBrainz XML web service. It has been my first Python project and I learned quite a lot in the process. Now MusicBrainz is undertaking a major data model change that also changes and extends the web service. As a result, adjustments are needed.

Read the rest of this post »

Finding the Majority Item in a Stream

Posted February 21, 2010 by mafr
Categories: misc

Tags: ,

Going through old CACM issues I discovered a paper (PDF) on stream processing. A common problem in this field is to find frequent items in a data stream when you only get one pass through the data and you need answers in real time. This is interesting in situations where you don’t have enough memory to store counters for each distinct item you see in the stream. One example is keeping track of the most frequent destinations in a high traffic network router.

Read the rest of this post »

Fun with Context Managers

Posted January 23, 2010 by mafr
Categories: python

Tags: ,

Sometimes I need a simple stop watch in my Python scripts to find out how expensive my code is in wall clock time. The problem is trivial to solve, but I thought I’d give it a try using Python’s with statement and a context manager.

Read the rest of this post »