Monday, March 11, 2013

Text processing: The Bottom Line

You get all types of data formats when you deal with clients and financial data. Some send you nicely delimited text files with an current data dictionary. Some send Excel files that look like the intern's preschooler designed them. But sometimes you end up with a report consisting of pages of fixed-width text designed to be print off on the green-bar paper printer by the office AS/400.

If you need assistance in parsing text files, you can use commercial applications designed to handle the job like Monarch. There's also many tools and utilities designed to view and parse text files. Both Scott Hanselman and Buck Woody have detailed lists that you should peruse and explore.
But let's our skills and tackle the problem programatically.

The nice thing about many of the fixed-width text reports is they are very consistent in layout and organization, making them easy to parse.  If they are generated from an accounting system that includes the GL, (General Ledger), account number on each row, then you probably have the key to pulling out the information needed on a periodical basis. Let's see an small example.


GL Example
Federal Borrowings Program
Notice the layout is very regimented with nicely formatted columns, descriptive headers and unique account numbers. The normal way a novice handles this type of file is to hand edit it and then try to clean up the result in Excel. (Shudder!) This report's organization makes it easy to write a simple utility rip the needed values. Even if the file is in a printable "report" format with headers on each page, it's a simple task to ignore these rows by focusing on the ACCOUNT column.

Sometimes you don't need every row since you don't want to load the data back into a database, you want to pull out specific totals and sub totals. It's easy enough to feed a list of account numbers or GL items to a routine, along with a list of position and widths of the account/items and the position and widths of the balances. You then end up with a dictionary, (Python, C#), a data structure that you can reference for calcutions or export/return to be handled by another process.  The process is something like this:

  1. Pass file name and list of items to a routine
  2. Create a dictionary structure with the list of items as the key values
  3. Read in each line of the file, looking for matching keys, (using position and width)
  4. If match found, populate the value for the matching key, (using position and width)
  5. Continue till done with file.
  6. Export dictionary to files, do calulations, or whatever.
Note: If you intend to do calculations on the values and wish to use them as numeric values, you will convert the text to numeric. This means you will probably have to clean up the currency characters and thousands separators. Easy to do in Python, but sometimes tricky in C#. In the case of C# include:  using System.Globalization;

Then use the following method:

public static decimal getFinancials

No comments:

Post a Comment