Water, water, everywhere, Nor any drop to drink.One thing that always bugged me about many of the talks and conferences I go to is the lack of good real world datasets and examples. There is the ubiquitous use of Adventureworks, which does fine for many demonstrations. Or the session based on the presenters experience with his employers assets, which are not accessible to the audience to view or play with. And there's the MVP speech with the sports statistics and the matching ball cap discussing ERA or passer efficiency rating to audiences from other cultures that follow different sports. And there's vendors that offer tools to generate sanitized datasets. If you need small or large datasets hopefully you don't always resort to these fall-backs since there are terabytes of interesting public data available on the Internet
Open or "public" data as it is called as been around for years. Before the WWW was in the public spotlight, you could order various data sets and source code on physical media from vendors. Two decades later, with the acceptance of the Internet and the increase of bandwidth, there's a plethora of sources of a huge variety of data sets available. One good stopping point for an overview is Data.Gov, an aggregate of Open Federal Data sources and tools.
Before you dive in and start grabbing collections of miscellaneous agricultural and health care stats from online sources, you need to have a idea of what type, quality and quantity of data set you are seeking. It's probably better to pick a domain that you have an understanding and experience in. And it doesn't hurt to select a data set that may solve a personal itch or business problem.
What's the Frequency, Kenneth?One of my favourite online databases to pull from is the FCC ULS database. The FCC, (Federal Communication Commission), is responsible from managing the RF, (radio waves) and other communication in the United States. The ULS (Universal Licensing System) is a system to keep track of licenses, frequencency allocations and other business related to the FCC. As an amateur radio license holder, it's a fun to keep track of my and several hundred thousands of other "ham" license holders. As a database professional it's a open, well documented source of real world addresses with which to test skills, geocoding and CASS certification. So let's grab the Amateur Radio Service License database.
The license database (l_mat.zip) is an archive over 400 MBs in size when expanded, so make sure you have the resources to handle it. Once you have the data extracted, it's time to take inventory and break out the tool kit.