Once More, With Parity: 2013

Thursday, September 19, 2013

Open Hatch's Scrabble Challenge: Let's Score and More

Scoring

So we've discussed and analyzed word length as part of our Scrabble Challenge. Now let's discuss scoring the value of the words in the list. The kind folks at OpenHatch kindly give us a dictionary of letters with their values. So the "base" score of a word would be the sum of the value of letters. Let's build a function that calculates the score that utilizes this gift.



def word_score(input_word):

    # score the word

    # need to account for the blanks

    scores = {"a": 1, "c": 3, "b": 3, "e": 1, "d": 2, "g": 2,

                "f": 4, "i": 1, "h": 4, "k": 5, "j": 8, "m": 3,

                "l": 1, "o": 1, "n": 1, "q": 10, "p": 3, "s": 1,

                "r": 1, "u": 1, "t": 1, "w": 4, "v": 4, "y": 4,

                "x": 8, "z": 10}



    word_score = 0

     

    for letter in input_word:

         word_score = word_score + scores[letter]

         

    return word_score

In actual play scoring is more complex. The player has to consider the premium squares, blank tiles and the words that are already in play. These situations are a bit too complex to be calculated in our simple program, but but we can assist the player by returning the base score of their possibilities.

Where to Store the Score?

We could calculate and store the score and words in the SOWPODS list in a large dictionary similar to that of the letters and there scores. But this approach doesn't resolve some issues. We also need a place to store the length of the word. A dictionary can store a key/value pair, but we need the ability to use more than one value to sort by. We could use a list, but instead we are going are going to use a secret weapon included with Python: SQLite.

Loading the word list, word lengths an scores into a database solves a few problems in this challenge. We only have to load the data and the calculated values once, then use the fruits of the work again and again.



#!/usr/bin/env python

# -*- coding: ascii -*-

"""

Load the sowpods word list into a sqlite database table

 

 Note: Rough Prototype

"""



from __future__ import print_function

import string

import sys

import sqlite3 as sqlite



def test_for_db():

    # test for existance of  sowpods database

    pass



def test_for_sowpods():

    # test for existence of sowpods  text file

    pass

  

def word_score(input_word):

    # score the word

    # need to account for the blanks

    scores = {"a": 1, "c": 3, "b": 3, "e": 1, "d": 2, "g": 2,

                "f": 4, "i": 1, "h": 4, "k": 5, "j": 8, "m": 3,

                "l": 1, "o": 1, "n": 1, "q": 10, "p": 3, "s": 1,

                "r": 1, "u": 1, "t": 1, "w": 4, "v": 4, "y": 4,

                "x": 8, "z": 10}

  

    word_score = 0

     

    for letter in input_word:

         word_score = word_score + scores[letter]

         

    return word_score

        

  

def word_list(input_file):

    # create a list of tuples which containing the word, it's length, score and sorted value

  

    sp_list =[]

    f = open(input_file, 'r')

  

    for line in f:

        sp_word = line.strip().lower()

        sp_list.append((sp_word, len(sp_word), ''.join(sorted(sp_word)), word_score(sp_word)))

      

    f.close()

  

    return sp_list

      



def load_db(data_list):

  

    # create database/connection string/table

    conn = sqlite.connect("sowpods.db")

  

    cursor = conn.cursor()

    # create a table 

    tb_create = """CREATE TABLE spwords

                (sp_word text, word_len int, word_alpha text, word_score int)

                """

    conn.execute(tb_create)

    conn.commit()

  

    # Fill the table

    conn.executemany("insert into spwords(sp_word, word_len, word_alpha, word_score) values (?,?,?,?)",  data_list)

    conn.commit()

  

    # Print the table contents

    for row in conn.execute("select sp_word, word_len, word_alpha, word_score from spwords"):

        print (row)



    if conn:

        conn.close()



def print_help():

    """ Help Docstring"""

    pass





def test():

    """ Testing Docstring"""

    pass



if __name__=='__main__':

    # test()

    sp_file = "sowpods.txt"

    load_db(word_list(sp_file))

Sunday, September 15, 2013

DFW Pythoneers 2nd Saturday Teaching Meeting, September 14, 2013

We had 11 Pythoneers show up for DFW Pythoneers 2nd Saturday Teaching Meeting. A special shout out to Jose and Jim from KForce Technology for setting us up with a great conference room and refreshments.

John Zurawski covered the news and interesting projects occurring in the Python world. He also covered his latest entry in the Ludum Dare contest and went into about his challenges with Python application installers. (John also loaned his Apple Mini DisplayPort to VGA Adapter to me which allowed me to present on the projector. Thanks a million, John!)

I made a presentation on Python Challenges and covered another step in solving the OpenHatch Scrabble Challenge.

If you missed the meeting due to the time change, please be aware that the scheduling may vary based on the venue, so double check the time and location for each individual meetup. KForce offered us great venue in a very central location, so it's probable that we will met here in the future and 1:00 PM is the closing time on Saturday. If you have any firm alternative sites please suggest them to either John, Kevin, Jeff or myself. We need room for 25 people, WiFi, a projector and restrooms.

Monday, August 26, 2013

Open Hatch's Scrabble Challenge: The Analysis Explained

How long has this been going on?

In the last post of this challenge we wrote a program to examine the SOWPODS word list. Why don't we just skip all this statistics nonsense and write the final codethat solves the problem? Why do we care about the word length?

We need to determine the scope of the project so we can determine a proper solution. If the SOWPODS word list only had 100 words or 7 million words in size, there may be drastically different approaches to the solution. Also the length of the word can be used to shorten a set of possible matches to a particular rack entered by the user. By analyzing the SOWPODS list, we see that there are substantial number of longer words, what some call five dollar words. If our user only gives us 7 or 8 letters, we can exclude words that are longer from the set of possible matches and speed up response time.

Let's take a look at the code and if you have questions about specific functions and commands, check out the official Python Language Reference. If you are an experienced Python programmer, you may want to skip through the pedantic explanations.

In the Beginning



#!/usr/bin/env python

# -*- coding: ascii -*-



"""

Sowpods stats

    - counts the words on the list

    - finds the longest word

    - breakdown of word length

 

    ToDo

    - Help

    - Error handling

 

"""

If you need some explanation of this section, you can review Executable Python Scripts, Source Code Encoding and Documentation Strings.

Back to the Future



from __future__ import print_function

import string

import sys

Here we use the print function from Python 3 by using the __future__ module. This allows us to easily port our code to Python 3 if need be.

Get Some Class



class LenCounter:

    def __init__(self):

        self.dict = {}

    def add(self, item):

        count = self.dict.get(item, 0)

        self.dict[item]  = count + 1

    def counts(self, desc=None):

        """Returns a list of keys sorted by values.

        Pass desc as 1 if you want a descending sort. """

        result = map(None, self.dict.values(), self.dict.keys())

        result.sort()

        if desc: result.reverse()

        return result

Here we have created a LenCounter class to build a dictionary to count words of various lengths. See if you can determine what each method in the class does.

Get the Stats



def get_stats():

    input_file = sys.argv[1]

    word_count = 0

    longest_length = 0

    lc = LenCounter()

 

    f = open(input_file, 'r')

 

    for line in f:

        word_count += 1

        lc.add(len(line.strip()))

        if len(line.strip()) > longest_length:

            longest_length = len(line.strip())

 

    print("Word Count: ", word_count)

    print("Longest Word Length: ", longest_length)

    for item in lc.counts():

        print(item)

 

    f.close()

The heart of the program. We open the SOWPODS file, read in each line, trim the white space, call the LenCounter class, determine the length of the largest string in the list and print the results.

The Main Event



if __name__=='__main__':

    # test()

    get_stats()

We will be running this program as a standalone utility so this is necessary. Eventually we may modify this and use it a module in a larger program.

Unfinished Business

You may have noticed sections of the stats program are like this:



def test():

    """ Testing Docstring"""

    pass

These functions have yet to be completed and currently aren't necessary to the core functionality of the program. Currently they serve as placeholders for features yet to be implemented. But we will come back to finish them at a later date, since they will make for a more complete, correct and friendly program.

Friday, August 23, 2013

DFW Pythoneers Meeting August 22, 2013

We had 10 individuals show up at Taco Cabana last night throughout the evening for the Monthly Social meeting. There were some questions about the Scrabble Challenge, so laptops were setup for a short while to poke around some code.

There was some discussion on learning Python on a more intermediate level, so a suggestion was made to utilize Doug Hellmann's Python Module of the Week to learn the standard Python library. It's free and it's good. Plus if you like a dead tree copy, he's compiled the postings into a book.

If we start getting more people to attend, we can move to a nearby restaurant that has a separate meeting room. This would improve communication and space since Taco Cabana can get noisy and packed.

The next Second Saturday Teaching meeting is in three weeks so we need to firm up a location. If you have a facility that can host, let us know. Room for up to 30 or 40 people, WiFi and a projector is what we are looking for. You get the glory of being a sponsor and your HR people can make some connections.

If you want to suggest a space for a additional social or project night in a site other the Frisco, feel free to do so. There's a few people on the west side of the MetroPlex that need some attention. I can travel or continue to host at Gazebo Burger.

If you need a Python programmer for a contract, Ralph's available. Send a email on the mailing list or to me with some contact info.

Tuesday, August 20, 2013

Open Hatch's Scrabble Challenge: The Analysis

Charting letter frequency

Our first step is to analyze the SOWPODS word list used in this challenge. If you read the Wikipedia entry, you will notice that there is a word distribution by number of letters. This is important to note if you are a tournament player or just completing this challenge.

Below is a program that will recreate this list, although in a slightly different manner. Read through the code and see if you can explain what each part does. Try to determine why we care about the length of the words,

stats.py



#!/usr/bin/env python

# -*- coding: ascii -*-



"""

Sowpods stats

    - counts the words on the list

    - finds the longest word

    - breakdown of word length

 

    ToDo

    - Help

    - Error handling

 

"""

from __future__ import print_function

import string

import sys



class LenCounter:

    def __init__(self):

        self.dict = {}

    def add(self, item):

        count = self.dict.get(item, 0)

        self.dict[item]  = count + 1

    def counts(self, desc=None):

        """Returns a list of keys sorted by values.

        Pass desc as 1 if you want a descending sort. """

        result = map(None, self.dict.values(), self.dict.keys())

        result.sort()

        if desc: result.reverse()

        return result     

 

def get_stats():

    input_file = sys.argv[1]

    word_count = 0

    longest_length = 0

    lc = LenCounter()

 

    f = open(input_file, 'r')

 

    for line in f:

        word_count += 1

        lc.add(len(line.strip()))

        if len(line.strip()) > longest_length:

            longest_length = len(line.strip())

 

    print("Word Count: ", word_count)

    print("Longest Word Length: ", longest_length)

    for item in lc.counts():

        print(item)

 

    f.close()

     

         

def print_help():

    """ Help Docstring"""

    pass





def test():

    """ Testing Docstring"""

    pass



if __name__=='__main__':

    # test()

    get_stats()

The Results



Word Count:  267751

Longest Word Length:  15

(124, 2)

(1292, 3)

(5454, 4)

(5757, 15)

(9116, 14)

(12478, 5)

(13857, 13)

(20297, 12)

(22157, 6)

(27893, 11)

(32909, 7)

(35529, 10)

(40161, 8)

(40727, 9)

Monday, August 19, 2013

Python Challenge Number 1: Open Hatch's Scrabble Challenge: Intro

Intro to the Intro:

If you are new to Python or new to programming, there's numerous Python tutorials and online courses. But once you get through Learn Python the Hard Way or the official tutorial, what's next?

The best way to learn may be another course or tutorial, but a project. Projects may be work projects or personal projects, or if you haven't found an idea that strikes your fancy, a suggested programming challenge. There's thousands of projects and challenge sites out there, so lets narrow the scope down an pick a few interesting, educational, fun, but doable ones.

The Challenge:

OpenHatch has a list of Intermediate Python Workshops/Projects on their wiki that suit our requirements. The Scrabble Challenge is the first one we want to attempt. Scrabble, especially in the form of "Words with Friends", is a popular pastime among many people, including my spouse.

Scrabble has been around for generations in my family and has evolved in a few forms to be a more interactive game on Facebook and various smart phones. My spouse usually has several games or more ongoing with friends and many of them use "hint" or "cheater" web sites to "broaden their vocabulary" or gain an advantage.

The Scrabble challenge is to make "Scrabble cheater" that helps the play find words in their letter rack with a CLI Python program.

The Requirements:

You need to have a computer with Python 2.6 or greater, a text editor and a copy of the SOWPODS word list that is referenced in the Scrabble Challenge. You should attempt to complete as much as this challenge on your own before resorting to the help of others.

The challenge web site does have some helpful guidelines and hints on how to break the problem down into easier pieces. I'll post some code this week that may help you analyze the challenge.

Sunday, August 18, 2013

August DWF Pythoneers Meetings

This weekend a good chunk of the crew are at PyTexas 2013, so let's recap the local events so far.

Last Saturday we had the 2nd Saturday Teaching Meeting at ZeOmega. 22 Pythoneers showed up and we had a great time with presentations, networking, snacks and pizza. I ended up leading the beginners group and covered analyzing words in the word list for the Scrabble Solver Challenge.

Special thanks to ZeOmega for hosting Saturday. Also to Bill for loaner laptop with Mint Linux, it was a lifesaver! If you have some SQL Server reporting skills, ZeOmega is looking for some good analysts.

Thursday night 3 coders should up at Project Night at Gazebo Burger in Frisco. I continued work on the Scrabble Solver Challenge, putting the word list in a SQLite database and calculated the word score among other functions.

Next Thursday night is Social Meeting at Taco Cabana in Addison. No laptop required, but if you want to talk code or design, feel free.

Monday, July 15, 2013

DFW Pythoneers 2nd Saturday Teaching Meeting, July 13, 2013

18 Pythoneers showed up at the The Collide Center in McKinney. John Zurawski and Joesph Weaver found the location, and John was the gracious host and leader for the meeting. The venue was great, but unfortunately for us the meeting space will become more space for startups afterwards.

Upcoming Events

Kevin Horn announced that PyTexas 2013 is August 16-18 in College Station. Early registration ends July 16th.Friday is oriented towards tutorials and training. See the web site for details.

July 18th is Project Night at Gazeebo Burgers in Frisco from 6:30 to 8:30. There's a separate meeting room that I'll request and the WiFi is usually good. Topic is pandas and data analysis, so show up prepped with software loaded and data to rip. If you are having difficulty loading all the required packages, consider loading the Anaconda distribution from Continuum Analytics.

July 20th is Moon Day at Frontiers Of Flight Museum at Love Field in Dallas. Our buddies at DRPG will be demoing various robots, plus there will be other cool displays. Moon Day is my favorite unofficial holiday, so even if you can't attend, pause for a moment and realize what an awesome achievement the Apollo program was...

July 25 will be the normal casual meeting at Taco Cabana in Addison. This is a informal get together to chat and network with other Pythoneers and techies. If you are seeking a solution for a problem, ask around and chances are someone can help you. Otherwise, just geek out and enjoy the company.

Group Discussions

One of the main topics of discussion revolved around meeting locations and meeting content.

Meeting Space

In the past the group has had corporate sponsors whom had meeting facilities. Currently we need locations for Saturday teaching meetings and alternative sites for the 4th Thursday informal meetings. I can lead the 3rd Thursday Project nights at Gazeebo Burgers.

An ideal teaching location would be central, easy to find, with a room for 30 or 40 people, WiFi, power outlets and restrooms. We also need a slightly large venue than Taco Cabana for the casual meeting since the largest table there is about eight seats.

Meeting Topics

There was an active discussion about Topics, Teaching, Presentation and Projects/Challenges. John proposed a meeting structure that works to start the meeting. and the group discussed various ideas for the "meat" of the meeting. It was recognized that people into Python have different needs, skill levels and interests. Some of the various topics brought forth:

Web Frameworks 
  - Flask
  - Django
  - Idea: Framework Shootout - Simple web app spec; write it in diff frameworks
    Related: find a way to allow beginners to work on subject area before talk

- Database?
  - ORM's

- REST API's in Python

- Network Programming
  - Twisted / Tornado

- Scientific Computing
  - Pandas

- Game Programming

- IPython
  - IPython Notebook

- Python eco-system/community

- Best Practices
  - PEP8

Sharknado

The breakout presentation of the meeting was John Zurawski's pixel accurate clone of Sharknado done in cocos2d. Most awesome use of Python, ever!

Challenges

It was observed that programmers don't learn unless they have projects or challenges. So I'm borrowing a Python challenge from elsewhere and will be working on this particular project in my spare time in the next couple of months. Let's swap notes and review code at a meeting in the future if you are interested. :)

Monday, July 1, 2013

DFW Pythoneers Meeting June 27, 2013

13 Python enthusiasts showed up at the Taco Cabana in Addison for the monthly meeting. There was some confusion about the meeting location and some participants were disappointed with the lack of formality. Overall most of the attendees enjoyed themselves and there was much discussion about Python and related topics. This was the largest group that has met up for the Thursday night meetings.

The good news: We may have found a space for the upcoming 2nd Saturday Teaching Meeting. Joseph Weaver mentioned that John Zurawski had found a possible meeting space in McKinney. John got in contact with me and will try to contact some of the more veteran members and leaders of the group. From what I've heard of the location, it will be good for the larger teaching meetings.

The challenge: With increasing attendees, Taco Cabana is difficult to have more than the most casual of meetings. I spoke with Jay, who said that this meeting has usually never been larger than 4 to 6 participants. If you wish to have a larger space that is more conducive to larger meetings, we're open for suggestions. Key things to consider are central location since we attract folks from 20 to 40 miles away.

Wednesday, June 26, 2013

Another Iteration...

Some time you have to throw a small script together to fix an issue. When you deal with third party data that's manually generated sometimes you have to take what they give you.

I had to determine the start the field positions in a fixed record length text file by the position of the double quotes in the file. After some research I conclude to use a regular expression with an iterator.



import re



text = 'Some Long" Record with " lots of '

pattern = re.compile('"')

print [m.start() for m in pattern.finditer(text)]

Tuesday, June 11, 2013

DFW Pythoneers Meeting June 8, 2013

16 Python Enthusiasts showed up at the Gravity Center for the meeting. This will be the last meeting at the Gravity Center, a new meeting space will need to be found.

We discussed Python and other resources for young programmers. John discussed Cocos2d, Greg demoed Python koans, and Jeremy show how to install Pelican and deploy a static site on Amazon S3.

Sunday, May 12, 2013

DFW Pythoneers Meeting May 14, 2013

I like to talk about Python. - Kevin Horn

11 Python Enthusiasts showed up at the Gravity Center for the meeting. After brief introductions, Kevin was gracious enough to take charge of the presentation and demo vitualenv and pip.

Kevin also discussed various package repositories, Unofficial Windows Binaries for Python Extension Packages, scipy, numpy, ipython, mingw and Chocolatey for Windows.

Mention was made of the forth coming PyTexas conference in August and Pyvideo.org, where various talks and tutorial are available for view.

The main focus of the second half of the meeting was data, databases and Python. Topics of discussion were: CSV, json, pickle, dbapi, SQLite, SQLAlchemy, sqlautocode and Alembic.

After the meeting a group of us went to Cafe Brazil for food and further discussion.

Tuesday, March 12, 2013

Netflix and Python

There's a nice post on their technical blog about how Netflix uses Python.

I found this section quite interesting:

Data Science and Engineering

Our Data Science and Engineering teams rely heavily on Python to help surface insights from the vast quantities of data produced by the organization. Python is used in tools for monitoring data quality, managing data movement and syncing, expressing business logic inside our ETL workflows, and running various web applications to visualize data.

One such application is Sting, a lightweight RESTful web service that slices, dices, and produces visualizations of large in-memory datasets. Our data science teams use Sting to analyze and iterate against the results of Hive queries on our big data platform. While a Hive query may take hours to complete, once the initial dataset is loaded in Sting, additional iterations using OLAP style operations enjoy sub-second response times. Datasets can be set to periodically refresh, so results are kept fresh and up to date. Sting is written entirely in Python, making heavy use of libraries such as pandas and numpy to perform fast filtering and aggregation operations.

Here's the video from PyCon 2013: http://pyvideo.org/video/1743/python-at-netflix

Monday, March 11, 2013

Text processing: The Bottom Line

You get all types of data formats when you deal with clients and financial data. Some send you nicely delimited text files with an current data dictionary. Some send Excel files that look like the intern's preschooler designed them. But sometimes you end up with a report consisting of pages of fixed-width text designed to be print off on the green-bar paper printer by the office AS/400.

If you need assistance in parsing text files, you can use commercial applications designed to handle the job like Monarch. There's also many tools and utilities designed to view and parse text files. Both Scott Hanselman and Buck Woody have detailed lists that you should peruse and explore.
But let's our skills and tackle the problem programatically.

The nice thing about many of the fixed-width text reports is they are very consistent in layout and organization, making them easy to parse. If they are generated from an accounting system that includes the GL, (General Ledger), account number on each row, then you probably have the key to pulling out the information needed on a periodical basis. Let's see an small example.

Federal Borrowings Program

Notice the layout is very regimented with nicely formatted columns, descriptive headers and unique account numbers. The normal way a novice handles this type of file is to hand edit it and then try to clean up the result in Excel. (Shudder!) This report's organization makes it easy to write a simple utility rip the needed values. Even if the file is in a printable "report" format with headers on each page, it's a simple task to ignore these rows by focusing on the ACCOUNT column.

Sometimes you don't need every row since you don't want to load the data back into a database, you want to pull out specific totals and sub totals. It's easy enough to feed a list of account numbers or GL items to a routine, along with a list of position and widths of the account/items and the position and widths of the balances. You then end up with a dictionary, (Python, C#), a data structure that you can reference for calcutions or export/return to be handled by another process. The process is something like this:

Pass file name and list of items to a routine
Create a dictionary structure with the list of items as the key values
Read in each line of the file, looking for matching keys, (using position and width)
If match found, populate the value for the matching key, (using position and width)
Continue till done with file.
Export dictionary to files, do calulations, or whatever.

Note: If you intend to do calculations on the values and wish to use them as numeric values, you will convert the text to numeric. This means you will probably have to clean up the currency characters and thousands separators. Easy to do in Python, but sometimes tricky in C#. In the case of C# include: using System.Globalization;

Then use the following method:

Monday, February 25, 2013

Keep it Clean

As a developer, I was paid the ultimate complement by a coworker last week.

"Hey, that PowerShell script you wrote is really clean." he said in passing.

This was a script that I had ported from another language, then tweaked and forwarded to him to help manage some server resources. It was an ugly hack originally intended to solve a personal need, but I re-factored it to make it more modular, simpler, readable and added a few key comments.

The main benefit to clean code is that it's easy to come back months later and modified to suit your needs without having to do some sort of digital archaeology. But the best outcome is when others can use it and maintain it without any additional assistance from yourself.

Would you like some Data with that?

Water, water, everywhere, Nor any drop to drink.

One thing that always bugged me about many of the talks and conferences I go to is the lack of good real world datasets and examples. There is the ubiquitous use of Adventureworks, which does fine for many demonstrations. Or the session based on the presenters experience with his employers assets, which are not accessible to the audience to view or play with. And there's the MVP speech with the sports statistics and the matching ball cap discussing ERA or passer efficiency rating to audiences from other cultures that follow different sports. And there's vendors that offer tools to generate sanitized datasets. If you need small or large datasets hopefully you don't always resort to these fall-backs since there are terabytes of interesting public data available on the Internet

Open or "public" data as it is called as been around for years. Before the WWW was in the public spotlight, you could order various data sets and source code on physical media from vendors. Two decades later, with the acceptance of the Internet and the increase of bandwidth, there's a plethora of sources of a huge variety of data sets available. One good stopping point for an overview is Data.Gov, an aggregate of Open Federal Data sources and tools.

Before you dive in and start grabbing collections of miscellaneous agricultural and health care stats from online sources, you need to have a idea of what type, quality and quantity of data set you are seeking. It's probably better to pick a domain that you have an understanding and experience in. And it doesn't hurt to select a data set that may solve a personal itch or business problem.

What's the Frequency, Kenneth?

One of my favourite online databases to pull from is the FCC ULS database. The FCC, (Federal Communication Commission), is responsible from managing the RF, (radio waves) and other communication in the United States. The ULS (Universal Licensing System) is a system to keep track of licenses, frequencency allocations and other business related to the FCC. As an amateur radio license holder, it's a fun to keep track of my and several hundred thousands of other "ham" license holders. As a database professional it's a open, well documented source of real world addresses with which to test skills, geocoding and CASS certification. So let's grab the Amateur Radio Service License database.

The license database (l_mat.zip) is an archive over 400 MBs in size when expanded, so make sure you have the resources to handle it. Once you have the data extracted, it's time to take inventory and break out the tool kit.

Pragmatism vs Partisanship

You ate Chinese food, so obviously you must hate Europeans...

Sounds silly doesn't it? So was the type of reaction I got from a data professional when I showed him a new book on data analysis that I was excited to add to my library. The software language didn't match his worldview or career investment, so I was labeled a "Microsoft basher". Which is silly since we were at event for users of Microsoft software, I was using a Windows phone and two out of the three operating system I was running on my laptop were Windows 7 and Windows Server 2012. And I spent much of the time taking notes in OneNote and discussing PowerShell 3 and SQL Server 12 with my cohort.

And the irony of situation is that Microsoft and many of it's employees and advocates recognized that not all the great tools and goodness flows from the mother-ship in Redmond. Buck Woody, a author and well known Microsoft database and Azure evangelist recommends installing OSS text-handling utilities when setting up your Data Science Laboratory. Another well known Microsoft technologist, Scott Hanselman, suggests many third party tools and has a recent post discussing GitHub and line endings. With the existence of CodePlex,the inclusion of Git support in Visual Studio and offering Linux VMs on Azure, Microsoft is becoming more pragmatic and inclusive in regards to OSS.

And OSS has growing garnering commercial support. Red Hat has been making money for years. VMware supports both commercial and OSS hosts and guest. Some of the projects on CodePlex get adopted by commercial companies. And data analysis tools featured in the book that seed of this post have commercial support from a company, Continuum Analytics, which just received a grant from DARPA, to further develop their tools.

So, while disappointed in the reaction I received from this individual, I still respect him and hope to demonstrate the power of using both OSS and Microsoft tools together to tackle some tough data problems.

Thursday, January 3, 2013

Cleaning House, Shifting Gears

I've always practiced safe internet. But over the holidays I ended up spending more time reinstalling software than writing code on my laptop. No matter how careful you are in browsing and locking your environment, there's a punk, a crook or TLA that has the kryptonite to your defenses. The vendors can't keep up with the vandals. The out-of-band security updates from Redmond and other vendor came out one day after my firewall reported strange outbound traffic and blocked it.

This is something I anticipated would eventually happen. So I had backups and developed a plan of action. Since the laptop had to be zapped, I decided to remodel the contents. Instead of vitualizing Linux on a Windows 7 host, I reversed the roles. This allowed me to create a optimized Windows 7 image that can be used for various purposes. I already had Server 2008R2 and 2012, Windows 8 in VMs.

Why Linux? I debated installing Windows 8 as the host for a nanosecond, but I really don't feel any love for that mess of a UI. The tile interface I like on my phone, doesn't work for me on my production desktop. Add the fact that Dell's recovery partition didn't recognized the hardware it's supposed to recover and Linux does. But switching host systems allowed me to play with VM's and set up the tools for my latest digital explorations.

Back in the '90s I used to play with all the OS'es on the block. At one time I was using seven distinct operating systems as part of my job. At the house we had even more esoteric hardware and software. When it came along, Linux was a bear to setup and configure. Fast forward 15 years and a couple of generations of geeks, users and billions in corporate contributions later, Linux and much of the OSS universe has been refined and polished.

Along with Linux, Android, the BSDs and their ilk, many of the programming and data tools have matured. The most interesting code comes from the scientific data community. And some of the best packages work with my favorite programming language: Python.