Monday, February 25, 2013

Keep it Clean

As a developer, I was paid the ultimate complement by a coworker last week.

"Hey, that PowerShell script you wrote is really clean." he said in passing.

This was a script that I had ported from another language, then tweaked and forwarded to him to help manage some server resources. It was an ugly hack originally intended to solve a personal need, but I re-factored it to make it more modular, simpler, readable and added a few key comments.

The main benefit to clean code is that it's easy to come back months later and modified to suit your needs without having to do some sort of digital archaeology. But the best outcome is when others can use it and maintain it without any additional assistance from yourself.

Would you like some Data with that?

 Water, water, everywhere, Nor any drop to drink.

One thing that always bugged me about many of the talks and conferences I go to is the lack of good real world datasets and examples. There is the ubiquitous use of Adventureworks, which does fine for many demonstrations. Or the session based on the presenters experience with his employers assets, which are not accessible to the audience to view or play with. And there's the MVP speech with the sports statistics and the matching ball cap discussing ERA or passer efficiency rating to audiences from other cultures that follow different sports. And there's vendors that offer tools to generate sanitized datasets. If you need small or large datasets hopefully you don't always resort to these fall-backs since there are terabytes of interesting public data available on the Internet

Open or "public" data as it is called as been around for years. Before the WWW was in the public spotlight, you could order various data sets and source code on physical media from vendors. Two decades later, with the acceptance of the Internet and the increase of bandwidth, there's a plethora of sources of a huge variety of data sets available. One good stopping point for an overview is Data.Gov, an aggregate of Open Federal Data sources and tools.

Before you dive in and start grabbing collections of miscellaneous agricultural and health care stats from online sources, you need to have a idea of what type, quality and quantity of data set you are seeking. It's probably better to pick a domain that you have an understanding and experience in. And it doesn't hurt to select a data set that may solve a personal itch or business problem.

What's the Frequency, Kenneth?

One of my favourite online databases to pull from is the FCC ULS database. The FCC, (Federal Communication Commission), is responsible from managing the RF, (radio waves) and other communication in the United States. The ULS (Universal Licensing System) is a system to keep track of licenses, frequencency allocations and other business related to the FCC. As an amateur radio license holder, it's a fun to keep track of my and several hundred thousands of other "ham" license holders. As a database professional it's a open, well documented source of real world addresses with which to test skills, geocoding and CASS certification. So let's grab the Amateur Radio Service License database.

The license database (l_mat.zip) is an archive over 400 MBs in size when expanded, so make sure you have the resources to handle it. Once you have the data extracted, it's time to take inventory and break out the tool kit.

Pragmatism vs Partisanship


You ate Chinese food, so obviously you must hate Europeans...

Sounds silly doesn't it? So was the type of reaction I got from a data professional when I showed him a new book on data analysis that I was excited to add to my library. The software language didn't match his worldview or career investment, so I was labeled a "Microsoft basher".  Which is silly since we were at event for users of Microsoft software, I was using a Windows phone and two out of the three operating system I was running on my laptop were Windows 7 and Windows Server 2012.  And I spent much of the time taking notes in OneNote and discussing PowerShell 3 and SQL Server 12 with my cohort.

And the irony of situation is that Microsoft and many of it's employees and advocates recognized that not all the great tools and goodness flows from the mother-ship in Redmond. Buck Woody, a author and well known Microsoft database and Azure evangelist recommends installing OSS text-handling utilities when setting up your Data Science Laboratory. Another well known Microsoft technologist, Scott Hanselman, suggests many third party tools and has a recent post discussing GitHub and line endings. With the existence of CodePlex,the inclusion of Git support  in Visual Studio and offering Linux VMs on Azure, Microsoft is becoming more pragmatic and inclusive in regards to OSS.

And OSS has growing garnering commercial support. Red Hat has been making money for years. VMware supports both commercial and OSS hosts and guest. Some of the projects on CodePlex get adopted by commercial companies. And data analysis tools featured in the book that seed of this post have commercial support from a company, Continuum Analytics, which just received a grant from DARPA, to further develop their tools.

So, while disappointed in the reaction I received from this individual, I still respect him and hope to demonstrate the power of using both OSS and Microsoft tools together to tackle some tough data problems.