Tuesday, March 12, 2013

Netflix and Python

There's a nice post on their technical blog about how Netflix uses Python.

I found this section quite interesting:

Data Science and Engineering

Our Data Science and Engineering teams rely heavily on Python to help surface insights from the vast quantities of data produced by the organization. Python is used in tools for monitoring data quality, managing data movement and syncing, expressing business logic inside our ETL workflows, and running various web applications to visualize data.

One such application is Sting, a lightweight RESTful web service that slices, dices, and produces visualizations of large in-memory datasets. Our data science teams use Sting to analyze and iterate against the results of Hive queries on our big data platform. While a Hive query may take hours to complete, once the initial dataset is loaded in Sting, additional iterations using OLAP style operations enjoy sub-second response times. Datasets can be set to periodically refresh, so results are kept fresh and up to date. Sting is written entirely in Python, making heavy use of libraries such as pandas and numpy to perform fast filtering and aggregation operations.

Here's the video from PyCon 2013: http://pyvideo.org/video/1743/python-at-netflix

Monday, March 11, 2013

Text processing: The Bottom Line

You get all types of data formats when you deal with clients and financial data. Some send you nicely delimited text files with an current data dictionary. Some send Excel files that look like the intern's preschooler designed them. But sometimes you end up with a report consisting of pages of fixed-width text designed to be print off on the green-bar paper printer by the office AS/400.

If you need assistance in parsing text files, you can use commercial applications designed to handle the job like Monarch. There's also many tools and utilities designed to view and parse text files. Both Scott Hanselman and Buck Woody have detailed lists that you should peruse and explore.
But let's our skills and tackle the problem programatically.

The nice thing about many of the fixed-width text reports is they are very consistent in layout and organization, making them easy to parse.  If they are generated from an accounting system that includes the GL, (General Ledger), account number on each row, then you probably have the key to pulling out the information needed on a periodical basis. Let's see an small example.

GL Example
Federal Borrowings Program
Notice the layout is very regimented with nicely formatted columns, descriptive headers and unique account numbers. The normal way a novice handles this type of file is to hand edit it and then try to clean up the result in Excel. (Shudder!) This report's organization makes it easy to write a simple utility rip the needed values. Even if the file is in a printable "report" format with headers on each page, it's a simple task to ignore these rows by focusing on the ACCOUNT column.

Sometimes you don't need every row since you don't want to load the data back into a database, you want to pull out specific totals and sub totals. It's easy enough to feed a list of account numbers or GL items to a routine, along with a list of position and widths of the account/items and the position and widths of the balances. You then end up with a dictionary, (Python, C#), a data structure that you can reference for calcutions or export/return to be handled by another process.  The process is something like this:

  1. Pass file name and list of items to a routine
  2. Create a dictionary structure with the list of items as the key values
  3. Read in each line of the file, looking for matching keys, (using position and width)
  4. If match found, populate the value for the matching key, (using position and width)
  5. Continue till done with file.
  6. Export dictionary to files, do calulations, or whatever.
Note: If you intend to do calculations on the values and wish to use them as numeric values, you will convert the text to numeric. This means you will probably have to clean up the currency characters and thousands separators. Easy to do in Python, but sometimes tricky in C#. In the case of C# include:  using System.Globalization;

Then use the following method:

public static decimal getFinancials