A recent Rubyist goes to the Lone Star Ruby Conference

September 7th, 2008

I attended the Lone Star Ruby Conference (LSRC), held here in Austin over the past two days. As this was my first conference on the topic, I thought I would share some observations. First, a vigorous caveat emptor… I’m a former Java guy still relatively new to the Ruby world. At the very least I hope to provide the humble perspective of a transplant from the crowded streets of Javaopolis to the rolling hills of Ruby country. So here goes…

Rubyists aren’t afraid of change (and some churn)

I’ve long been an interested outsider to Ruby and Rails - following along on the fringes for a couple years now by reading articles, running irb occasionally and generating some scaffolds. At that distance, you basically see Ruby and Rails. What you don’t see is the constant churn of open source projects, plugins, development tools, GUI toolkits, ideologies and best practices. Gregg Pollack and Jason Seifer’s talk on innovations in the last year and Evan Phoenix’s keynote about Ruby memes made the currents of change pretty clear. Java experienced a lot of change during my 8 years but it definitely seemed to be at a more staid pace. There were a few bigger, more corporate, entities driving it (Sun, Apache, IBM) and fewer mythologized personalities like why. In the Ruby community, a lot of energy seems to be expended reinventing various wheels (as Evan pointed out with his list of ARGV processors). There’s a fine distinction between a healthy competition of ideas and a naval-gazing, ego-stroking churn that wastes time. That said, I would never stand in the way of the free market of ideas. So, by all means, (with no apologies to Mao) let 1000 ARGV processors bloom and let’s see which stick.

There are some surprising technology holes - but they’re being filled

As a former Java programmer who worked deeply with concurrency, Ruby’s after-thought of a concurrency model frightened me a bit. Java, though described by Dave Thomas on the last night’s panel as “a blunt tool”, has a basement filled with extremely sharp swords known as threads and a well-defined concurrency model. But, you say, Ruby has Threads. Yes, it does, but it’s a shadow of what’s available in Java. There, you can wield the sharp tools of concurrency with great effect on difficult problems but you can certainly stab yourself without understanding memory consistency effects, atomicity concerns and everything else.

Until recently, it seems the Rubyist approach to most of these issues has been to ignore them by slapping in a Big Fat Mutex. As someone who’s dealt with connection thread pools for RDBMs access for a long time, I was surprised to learn how innovative the non-blocking MySQL connector being worked on by espace as part of NeverBlock was considered. Fibers, mentioned by both Matz and Bruce Williams in discussing Ruby 1.9, offer a light-weight cooperative approach to something resembling concurrency. It was also heartening to hear Matz talk about how scalability is a big concern for the future of Ruby. Of course, no mention of these issues in Ruby is complete without noting that Ruby (and Rails) has been able to achieve a great deal in this area by using processes. This is usually a pretty simple and straightforward approach but it’s fairly large-grained and could certainly stand to be augmented with more sophisticated fine-grained capabilities that live inside the Interpreter.

Joy matters

Despite the above grumbling about concurrency, Ruby as a language is a beautiful thing. It really is damn pleasant to write Ruby code. I think this is a direct extension of Matz’s personality and concerns. In his keynote, he spoke powerfully on the need to see a prgramming language as part of the human interface to computers and the need to make that interface as joyful as possible to use. I don’t think I’ve once read “Joy” as a chief requirement in a JSR submitted to the Java Community Process for inclusion into Sun’s Java spec. That alone is probably enough reason to switch to Ruby. I do wonder how Matz’s vision of a humane language will hold up under the predicted onslaught of mindless Java-drones such as me that will create the 3 million new Ruby programmers expected over the next few years. As for that…

Don’t fear the hordes

We’re not all mindless and we’re not all drones. The unspoken sentiment when talking about the future growth of Ruby was “things are ok now because we’re all smart and dedicated craftsmen, unlike people who are using Java but will eventually invade beautiful ruby-land and make it ‘enterprisey’”. Growth is stressful for any group, especially one that gleefully defines itself as a minority in opposition to the mainstream as I believe Rubyists do. However, I think it benefits no one to assume that new Rubyists who may come to it later than you did are any less smart, dedicated or concerned about their work than you are. A community that doesn’t grow is bound to become introverted and ultimately stagnant. It would be a great loss if the Ruby community did that. The good news is I don’t think it will. Personally, I’ve felt welcomed by the Austin Rails community and I’ve also found plenty of resources elsewhere to help me learn. The best thing thing for Ruby’s future is to continue to embrace the hordes and show them what they’ve been missing.

And, In conclusion…

There are an incredible number of smart people that truly seem to love Ruby and want to see it succeed. Though there are ideologies, squabbles and disagreements, the part of the community assembled at LSRC was as close to a meritocracy as I’ve personally seen. The power in Ruby seems to lie with those that have most demonstrated their technical abilities and not with those that have the right corporate affiliations. This focus on merit leads to the sometimes duplicative and anarchic approach to developing new technology as people compete with their ideas. Ultimately, though, this is probably Ruby’s greatest strength and will serve it well in the future. I’m looking forward to being a part of it.

By the way, thanks to the Lone Star Ruby Foundation for putting on the conference. Great job. See you all at LSRC 2009.

1, 2, 3… ZiteFight!

July 30th, 2008

It’s been really quiet around here because I’ve been hard at work for Appozite these past few weeks. We’re very excited to announce the launch of our first app, ZiteFight. So what’s the deal with ZiteFight? Well, it’s the World Championship of Style. ZiteFight pits user-submitted photos against each other and lets the world pick which one has the best style. Now, I know this blog skews developer (what with the Hadoop posts and all) and developers aren’t exactly known for their style but I’m sure you know someone who thinks they’ve got good taste. Tell them to go prove it to the world by joining the fight at ZiteFight.

The web is still the web

July 4th, 2008

Neil McAllister at Fatal Exception, inspired by the recent announcement that some flash data will be exposed to search engines asks the very intriguing question, “Is the Web still the Web?” The reason for asking is the proliferation of Rich Internet Application (RIA) technologies such as the aforementioned Flash, Silverlight, Google Web Toolkit, and AJAX (sort of). As background, he invokes a history in which Tim Berners-Lee granted us simple text-only documents encoded in HTML. This is, apparently, The Way The Web Is Supposed To Be. He then draws the distinction between RIAs and HTML and asks:

Is it still the Web if it’s not really hypertext? Is it still the Web if you can’t navigate directly to specific content? Is it still the Web if the content can’t be indexed and searched? Is it still the Web if you can only view the application on certain clients or devices? Is it still the Web if you can’t view source?

My answer on all these counts: “Yes”. I’m pretty sure you could replace the term “RIAs” with “images” or “videos” in his argument at various points during the evolution of the web from nicely marked up physics documents all the way to YouTube. Point being that HTTP (as one of the key technologies which underpins the web) only asks that we be able to reference a resource via a URI but makes no claims to the representation of that resource. It’s a testament to the foresight of the original designers of web technologies that HTTP describes only how we locate, modify and de-reference resources and doesn’t come with a dependency on representing those resources in HTML. Neil seems to confuse “resource” with “HTML document”. They need not be the same thing. That would be poor design.

Text-based indexing and search as well as “view source” are (incredibly useful) byproducts of the fact that so many of the resources on the web are represented as HTML. Though it’s hard to remember a time before Google roamed the earth, it hasn’t been that long ago that text-based indexing and search didn’t really work either. In time these other representations of resources will be mined, indexed and made searchable. There’s a lot of money and a lot of smart people trying to make that happen.

As for whether or not it’s still the web if you can only view it on certain clients? Well, as anyone who’s ever tried to develop a standards compliant site that also works in IE6 can attest, even relatively simple HTML web resources have client-specific dependencies. As today’s limited devices get more powerful and as browsers (hopefully) converge towards a reasonable baseline of standards these issues, too, shall pass.

This leaves the hypertext question. The reason we call it the “web” is due to the web-like nature of the links going from one resource to another. HTML does a fantastic job of providing this web of links (the hyptertext) with that simple <a> tag we know and love. If these new technologies don’t encourage connections between resources then they’re not contributing to the “web-ness” of the web. There are two parts to this: linking to other resources and allowing themselves to be linked to. Just because they’re not HTML doesn’t mean you can’t do these things. You can create links to other resources with these technologies and you can create URIs that can point to resources “within” a resource represented by these technologies. That’s not to say you can’t create a Flash site with no outgoing links and no URIs to hook into for incoming links. Of course you can just as easily create a dead-end HTML page with no anchors.

So yes, in my opinion, the web is still the web. Because of the great separation of concerns in the design of the web’s technologies, people have been able to extend it far beyond the original vision as a document sharing mechanism. It’s the greatest platform for experimentation in all the ways we can connect and deliver information yet conceived. Because of this, there will always be innovations that push the boundaries of how we’ve experienced it in the past. RIAs are just another part of the web and its continued evolution.

RSS and e-commerce post on dev.appozite

July 2nd, 2008

I’ve been swamped with annoying stuff like work lately so posting on here has been sparse. That should change soon, I hope. I did come up for air the other day long enough to write a post over on the dev.appozite blog discussing the use of RSS in e-commerce. Bottom line: it would be great for even moderately web-savvy users but it’s just not used very much. Do any of you out there have experience either implementing or using RSS/Atom in an e-commerce context? If so, I’d be really interested in hearing about how it went over in the comments on the dev.appozite post.

Look for more stuff here soon. I’ve got some more to say on the semantic web and I plan to get back to hadoop soon as well.

Telling semantic lies

June 21st, 2008

Inspired by conversations with some smart people at a recent Semantic Web Austin event, I’ve undertaken to restart my education on semantic web technologies like RDF, RDFa, Microformats, etc. When I wear my web developer hat, I’m definitely an advocate of clean semantic markup that correctly describes the structure of the data on the page. These technologies take that approach further (in some cases much, much further). In general, that seems like an unquestionably good idea. More semantic structure means more data portability and data discovery and therefore a more powerful web. It’s probably even a necessary step towards a WebOS.

However, in my limited research to this point, it seems there’s an elephant in the room in all this advocacy. Inevitably discussions of semantic technologies include “better search” as a chief raison d’etre for their use. We’ll have search engines that “understand” the machine readable data on our pages or RDF descriptions which can then draw logical inferences from the relationships among the universe of web resources. But, what if the semantic data is incorrect or just downright dishonest? Over-reliance on easily spammed meta tags gave us garbage in and garbage out in Altavista and Excite back in the 90s. It would be trivial to take my RDFa structured blog post, move it to a spam blog, find the semantically marked-up creator element, change it to someone else and republish. Poof! My finely crafted blog post on the semantic web is now selling ads for herbal remedies to unsuspecting web users with poor search skills. Of course, it’s also easy to just out and out lie when describing content. Maybe I’m not really Angelina Jolie’s spouse or Bill Gates‘ neighbor even though I swear I am in my XFN standard rel attributes.

I would imagine that one thing that sets these approaches apart from 90s meta tags is the fact that many of these are used to specify relationships between resources which must be symmetric. Angelina’s resource dereferenced from her URI must indicate that I’m her spouse as well for that XFN relationship to be “believed” by a semantic web search that understands XFN. (How Angelina or any of us feel about being boiled down to an authoritative web resource identified by a URI is another issue.) Of course some people will try to game any system but I’m sure the vast majority of web users (or publishing tools) will include this structured data for legitimate purposes. But all this does make me wonder how much search engines will ultimately be able to rely on semantic data for drawing the intelligent inferences we hope to see from them. Can any of you out there that know more about these technologies help me better understand how we can ensure semantic data isn’t telling lies? If so, leave a comment; I’d love to know more.

Running Hadoop on Windows

June 14th, 2008

What is Hadoop?

Hadoop is a an open source Apache project written in Java and designed to provide users with two things: a distributed file system (HDFS) and a method for distributed computation. It’s based on Google’s published Google File System and MapReduce concept which discuss how to build a framework capable of executing intensive computations across tons of computers. Something that might, you know, be helpful in building a giant search index. Read the Hadoop project description and wiki for more information and background on Hadoop.

What’s the big deal about running it on Windows?

Looking for Linux? If you’re looking for a comprehensive guide to getting Hadoop running on Linux, please check out Michael Noll’s excellent guides: Running Hadoop on Ubuntu Linux (Single Node Cluster) and Running Hadoop on Ubuntu Linux (Multi-Node Cluster). This post was inspired by these very informative articles.

Hadoop’s key design goal is to provide storage and computation on lots of homogenous “commodity” machines; usually a fairly beefy machine running Linux. With that goal in mind, the Hadoop team has logically focused on Linux platforms in their development and documentation. Their Quickstart even includes the caveat that “Win32 is supported as a development platform. Distributed operation has not been well tested on Win32, so this is not a production platform.” If you want to use Windows to run Hadoop in pseudo-distributed or distributed mode (more on these modes in a moment), you’re pretty much left on your own. Now, most people will still probably not run Hadoop in production on Windows machines, but the ability to deploy on the most widely used platform in the world is still probably a good idea for allowing Hadoop to be used by many of the developers out there that use Windows on a daily basis.

Caveat Emptor

I’m one of the few that has invested the time to setup an actual distributed Hadoop installation on Windows. I’ve used it for some successful development tests. I have not used this in production. Also, although I can get around in a Linux/Unix environment, I’m no expert so some of the advice below may not be the correct way to configure things. I’m also no security expert. If any of you out there have corrections or advice for me, please let me know in a comment and I’ll get it fixed.

This guide uses Hadoop v0.17 and assumes that you don’t have any previous Hadoop installation. I’ve also done my primary work with Hadoop on Windows XP. Where I’m aware of differences between XP and Vista, I’ve tried to note them. Please comment if something I’ve written is not appropriate for Vista.

Bottom line: your mileage may vary, but this guide should get you started running Hadoop on Windows.

A quick note on distributed Hadoop

Hadoop runs in one of three modes:

  • Standalone: All Hadoop functionality runs in one Java process. This works “out of the box” and is trivial to use on any platform, Windows included.
  • Pseudo-Distributed: Hadoop functionality all runs on the local machine but the various components will run as separate processes. This is much more like “real” Hadoop and does require some configuration as well as SSH. It does not, however, permit distributed storage or processing across multiple machines.
  • Fully Distributed: Hadoop functionality is distributed across a “cluster” of machines. Each machine participates in somewhat different (and occasionally overlapping) roles. This allows multiple machines to contribute processing power and storage to the cluster.

The Hadoop Quickstart can get you started on Standalone mode and Psuedo-Distributed (to some degree). Take a look at that if you’re not ready for Fully Distributed. This guide focuses on the Fully Distributed mode of Hadoop. After all, it’s the most interesting where you’re actually doing real distributed computing.

Pre-Requisites

Java

I’m assuming if you’re interested in running Hadoop that you’re familiar with Java programming and have Java installed on all the machines on which you want to run Hadoop. The Hadoop docs recommend Java 6 and require at least Java 5. Whichever you choose, you need to make sure that you have the same major Java version (5 or 6) installed on each machine. Also, any code you write for running using Hadoop’s MapReduce must be compiled with the version you choose. If you don’t have Java installed, go get it from Sun and install it. I will assume you’re using Java 6 in the rest of this guide.

Cygwin

As I said in the introduction, Hadoop assumes Linux (or a Unix flavor OS) is being used to run Hadoop. This assumption is buried pretty deeply. Various parts of Hadoop are executed using shell scripts that will only work on a Linux shell. It also uses passwordless secure shell (SSH) to communicate between computers in the Hadoop cluster. The best way to do these things on Windows is to make Windows act more like Linux. You can do this using Cygwin, which provides a “Linux-like environment for Windows” that allows you to use Linux-style command line utilities as well as run really useful Linux-centric software like OpenSSH. Go download the latest version of Cygwin. Don’t install it yet. I’ll describe how you need to install it below.

Hadoop

Go download Hadoop core. I’m writing this guide for version 0.17 and I will assume that’s what you’re using.

More than one Windows PC on a LAN

It should probably go without saying that to follow this guide, you’ll need to have more than one PC. I’m going to assume you have two computers and that they’re both on your LAN. Go ahead and designate one to be the Master and one to be the Slave. These machines together will be your “cluster”. The Master will be responsible for ensuring the Slaves have work to do (such as storing data or running MapReduce jobs). The Master can also do its share of this work as well. If you have more than two PCs, you can always setup Slave2, Slave3 and so on. Some of the steps below will need to be performed on all your cluster machines, some on just Master or Slaves. I’ll note which apply for each step.

Step 1: Configure your hosts file (All machines)

This step isn’t strictly necessary but it will make your life easier down the road if your computers change IPs. It’ll also help you keep things straight in your head as you edit configuration files. Open your Windows hosts file located at c:\windows\system32\drivers\etc\hosts (the file is named “hosts” with no extension) in a text editor and add the following lines (replacing the NNNs with the IP addresses of both master and slave):

master NNN.NNN.NNN.NNN
slave NNN.NNN.NNN.NNN

Save the file.

Step 2: Install Cygwin and Configure OpenSSH sshd (All machines)

Cygwin has a bit of an odd installation process because it lets you pick and choose which libraries of useful Linux-y programs and utilities you want to install. In this case, we’re really installing Cygwin to be able to run shell scripts and OpenSSH. OpenSSH is an implementation of a secure shell (SSH) server (sshd) and client (ssh). If you’re not familiar with SSH, you can think of it as a secure version of telnet. With the ssh command, you can login to another computer running sshd and work with it from the command line. Instead of reinventing the wheel, I’m going to tell you to go here for step-by-step instructions on how to install Cygwin on Windows and get OpenSSH’s sshd server running. You can stop after instruction 6. Like the linked instructions, I’ll assume you’ve installed Cygwin to c:\cygwin though you can install it elsewhere.

If you’re running a firewall on your machine, you’ll need to make sure port 22 is open for incoming SSH connections. As always with firewalls, open your machine up as little as possible. If you’re using Windows firewall, make sure the open port is scoped to your LAN. Microsoft has documentation for how to do all this with Windows Firewall (scroll down to the section titled “Configure Exceptions for Ports”).

Step 3: Configure SSH (All Machines)

Hadoop uses SSH to allow the master computer(s) in a cluster to start and stop processes on the slave computers. One of the nice things about SSH is it supports several modes of secure authentication: you can use passwords or you can use public/private keys to connect without passwords (”passwordless”). Hadoop requires that you setup SSH to do the latter. I’m not going to go into great detail on how this all works, but suffice it to say that you’re going to do the following:

  1. Generate a public-private key pair for your user on each cluster machine.
  2. Exchange each machine user’s public key with each other machine user in the cluster.

Generate public/private key pairs

To generate a key pair, open Cygwin and issue the following commands ($> is the command prompt):
$> ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
$> cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

Now, you should be able to SSH into your local machine using the following command:
$> ssh localhost

When prompted for your password, enter it. You’ll see something like the following in your Cygwin terminal.

hayes@localhost's password:
Last login: Sun Jun 8 19:47:14 2008 from localhost

hayes@calculon ~
$>

To quit the SSH session and go back to your regular terminal, use:
$> exit

Make sure to do this on all computers in your cluster.

Exchange public keys

Now that you have public and private key pairs on each machine in your cluster, you need to share your public keys around to permit passwordless login from one machine to the other. Once a machine has a public key, it can safely authenticate a request from a remote machine that is encrypted using the private key that matches that public key.

On the master issue the following command in cygwin (where “<slaveusername>” is the username you use to login to Windows on the slave computer):

$> scp ~/.ssh/id_dsa.pub <slaveusername>@slave:~/.ssh/master-key.pub

Enter your password when prompted. This will copy your public key file in use on the master to the slave.

On the slave, issue the following command in cygwin:

$> cat ~/.ssh/master-key.pub >> ~/.ssh/authorized_keys

This will append your public key to the set of authorized keys the slave accepts for authentication purposes.

Back on the master, test this out by issuing the following command in cygwin:

$> ssh <slaveusername>@slave

If all is well, you should be logged into the slave computer with no password required.

Repeat this process in reverse, copying the slave’s public key to the master. Also, make sure to exchange public keys between the master and any other slaves that may be in your cluster.

Configure SSH to use default usernames (optional)

If all of your cluster machines are using the same username, you can safely skip this step. If not, read on.

Most Hadoop tutorials suggest that you setup a user specific to Hadoop. If you want to do that, you certainly can. Why setup a specific user for Hadoop? Well, in addition to being more secure from a file permissions and security perspective, when Hadoop uses SSH to issue commands from one machine to another it will automatically try to login to the remote machine using the same user as the current machine. If you have different users on different machines, the SSH login performed by Hadoop will fail. However, most of us on Windows typically use our machines with a single user and would probably prefer not to have to setup a new user on each machine just for Hadoop.

The way to allow Hadoop to work with multiple users is by configuring SSH to automatically select the appropriate user when Hadoop issues its SSH command. (You’ll also need to edit the hadoop-env.sh config file, but that comes later in this guide.) You can do this by editing the file named “config” (no extension) located in the same “.ssh” directory where you stored your public and private keys for authentication. Cygwin stores this directory under “c:\cygwin\home\<windowsusername>\.ssh”.

On the master, create a file called config and add the following lines (replacing “<slaveusername>” with the username you’re using on the Slave machine:

Host slave
User <slaveusername>

If you have more slaves in your cluster, add Host and User lines for those as well.

On each slave, create a file called config and add the following lines (replacing “<masterusername>” with the username you’re using on the Master machine:

Host master
User <masterusername>

Now test this out. On the master, go to cygwin and issue the following command:

$> ssh slave

You should be automatically logged into the slave machine with no username and no password required. Make sure to exit out of your ssh session.

For more information on this configuration file’s format and what it does, go here or run man ssh_config in cygwin.

Step 4: Extract Hadoop (All Machines)

If you haven’t downloaded Hadoop 0.17, go do that now. The file will have a “.tar.gz” extension which is not natively understood by Windows. You’ll need something like WinRAR to extract it. (If anyone knows something easier than WinRAR for extracting tarred-gzipped files on Windows, please leave a comment.)

Once you’ve got an extraction utility, extract it directly into c:\cygwin\usr\local. (Assuming you installed Cygwin to c:\cygwin as described above.)

The extracted folder will be named hadoop-0.17.0. Rename it to hadoop. All further steps assume you’re in this hadoop