Piles of Data

(2011-02-12)

Reading Dumped on by Data has brought, first and foremost, a huge problem facing the Internet-connected world of today: now that we have all the data we could ask for, how do make sure we can store it and make sense of it all?

In any university research I've done, collecting digital data and in getting it into the "correct" format for future work is a huge percentage of the work that needs to be done in research in all fields. With the advent of the Internet, we have the potential to form connections between disparate data points like never before, but often times the time cost of munging this in the correct format, bandwidth, and political / IP concerns, and data trust issues prevent this data from being distributed to a wider audience. Some of the worst offenders are those that receive federal aid for their research, and then go on to not provide any publicly-accessible raw data from their findings. Reproducibility is a major component of any valid scientific experiment, but this is impossible unless any and all third-parties have access to original sources to validate experiments.

Almost inexcusable in this day and age is being limited by physical hard disk space. In some cases, this results in a willful deletion of raw data to comply with some antiquated IT / business policy for a given institution. I believe this is where the possibilities of cloud computing really shine -- as an unlimited, persistent, durable resource. Google's technological advances are made possible by their pervasive use of BigTable internally, allowing a clear separation of concerns between those doing the data collection vs. those doing data persistence (which is too often shoveled off to the same person in a university setting). This facilitates technologies like their near amazing real-time Dremel distributed query & analytical tool.

Imagine this technology on a much larger scale -- one that's used by large portions of the university research community. Right now, if a statistician wants to get historical weather data for a given region in time, it's available to a large extent on NOAA's site. Sometimes, however, data is missing from readings and in general might need to be normalized, which researchers end up doing independently and can't/won't share their data due to aforementioned reasons.

Of course, even if suddenly all the data in world was instantly easily accessible, getting everything into cooperative formats is huge effort in itself. This is partially mitigated at companies like Google, as there can be some top-down guidance for standardizing data formats. Determining a canonical data format is an open question when one is dealing with a heterogeneously-owned environment like a research community, but scientists have been good about cooperating on standards given decent proposals (e.g., Berkeley sockets) and value. The Long Now project is partially concerned with this question of finding future-proof canoncial formats for our data today.

Like any good technology, this theoretical ubiquitous persistence and sharing of data only becomes pervasive when two things happen -- 1: anyone can use it, and 2: no one really has to put any cognitive effort into making it work. Managing data has a long way to go in this regard, but it's great to see companies out there making this something realizable.

A holistic view of nearly everything

(2010-10-14)

Finding a connection between disparate fields fascinates me -- the similarities between work habits of hackers and painters, for example. I don't believe time spent studying a field indirectly related to one's field is wasted; although the benefits are often not immediately quantifiable, better understanding other aspects of the world cannot be discounted. Relating this to my university experience, I remain an advocate for hard science majors taking soft science courses.

Technical skills are a given to do well in any CS specialization, but the truly great thinkers call on knowledge from all parts of academia. John Holland took gene replication and turned it into a well-known CS algorithm. Regarding business decisions, understanding the incentives of your customer is often a sociological phenomenon. As Paul Graham says, you have to make something people want. I see that phrase as a reminder to engineer types that the technical solution is not really a solution unless it solves a real problem a real person has.

I would argue a holistic view of CS/business problems is necessary to truly be successful at solving them in an effective way. I highly recommend Thinking in Systems: A Primer by Donella Meadows, a wonderfully simple book on looking at problems from a systems view, which is really just another way of saying, "look at a problem in terms of its inputs and outputs."

There are few constants in life, if Thinking in Systems is to be taken seriously. Ray Kurzweil hints that we as humans have a fundamental difficulty in understanding non-linear concepts, so we necessarily simplify our ontologies of the huge dynamism of the world in order to find some sort of sensible path through all of it.

Scheming to Learn Recursion

(2010-08-01)

Reading Gödel, Escher, Bach opened my eyes to the idea that a huge amount of life is based off of this simple idea, and what a beautiful one it is -- the idea that something can be composed of itself.

I've since been on a kick trying to learn Scheme, which is a dialect of of Lisp known for its minimal standard library which IMO makes learning its fundamentals easier. I've tried starting off with Peter Siebel's Practical Common Lisp, but I found it as more of a book about learning the Lisp language than learning the underlying philosophy of Lisp.

My first serious run in with Lisp was due to me being a fan of Paul Graham's essays for awhile, and especially his book Hackers and Painters , and more recently because of Steve Yegge's blog posts and his strong advocacy of the Lisp language. I was also trying to work through Daniel Friedman's Essentials of Programming Languages and it assumes a working knowledge of Scheme in addition to being pretty academic, so I figured I'd try and start off learning the fundamentals of Scheme before I tried to get any further.

With that said, what I'm reading now is The Little Schemer, also by Daniel Friedman. This book's style is a complete turnaround from Essentials; it assumes no experience programming, and does a great job of going through recursive problems step by step. It's also a pretty short read. I'd highly recommend it to anyone that feels they're missing some background in solving recursive problems.

On a side note: It's a shame that a lot CS schools have either not advocated learning a Lisp to begin with or are phasing them out (MIT, I'm looking at you). I've never really understood the argument that Python is the best language for people that have never programmed before; I've found in practice that Python has as many quirks ("closures", etc) as most other languages I enjoy programming in, but I would agree it has a strong resemblance to pseudocode, which implies something good about it I suppose.

If I could be taught over again, I'd much rather have someone teach me the core CS concepts in an extremely pure language like Scheme than trying to teach that concurrently with the syntax and quirks of C/C++, which is what for the most part went on at the University of Michigan. I believe this is a result of trying to serve many needs at once (practicality, abundance of EE/CE majors). Ideally these would be taught separately.

Learning Perl

(2010-07-09)

After what was probably too long trying to avoid it, I'm finally diving head-first into Perl. I feel learning Perl takes more investigative work than some languages, as what's best practice is somewhat subjective and has changed since a few of the more popular books have been published (e.g. Programming Perl). I've made a few observations:

  • Ruby borrows Perl's syntax more than I realized (e.g., postfix conditionals)
  • The breadth of libraries available is amazing -- by far the best of any language I've used -- and people really seem to be good about writing good docs for their code.
  • cpanminus is by far the best way to use CPAN; it works exactly how I would expect a package management system would in 2010, unlike the bevy of features you need to configure when using the perl -MCPAN -eshell CLI.
  • Moose is an impressive object system, much more feature-rich and competitive with today's OO languages than Perl's default.
  • Pervasive context-awareness is something that took me awhile to understand, but now it seems natural to me. In addition, dealing with references adds another layer of complexity that can be intimidating to new users having to deal with both at the same time.
  • Perl's argument passing reminds me of Javascript; or should that be the other way around?
  • Topic variables (@_ and $_) seem like a bad idea for readability, but really cut down on the tedium of dealing with variable names in cases where it's obvious. It's great for short programs, probably not so much for modules.
  • Perl's efforts to be backwards-compatible has left it with a lot of vestigial syntax; in a way it suffers from the same problems C++ does in trying to please everyone. The solution for a lot of teams using C++ is to use a well-defined subset of the features, and I believe this parallels making sure to use the latest features of Perl as well.

Back in Seattle

(2010-06-27)

I didn't mentally plan for being in Seattle again, but it's here. This time I'm here for the foreseeable future.

I picked up some great records already at Zion's Gate and Half-Price Books. I have a feeling finding good music is going to be pretty easy. Here's what I picked up:

Tuftian

(2010-02-10)

If you're at all interested in visual / graphic design, I'd highly recommend picking up a copy of Edward Tufte's Visual Display of Quantitative Information. I just got done reading through it after having set it aside for awhile. It concisely highlights features of great graphic design. He wrote it pre-information age, but I found that it translated very well to good design for the web and computer interfaces.

One of the main ideas is that great graphic design reveals the greatest number of ideas in the shortest time, with the least ink, in the smallest space. He provides clear examples for each point he makes.

In any case, it's a good coffee table book.

txting

(2010-01-08)

One thing I find interesting about text messaging is the constraint the 160-character limit imposes on what you can say. Our future poets will Tweet.

mPark working

(2010-01-04)

I've finally got mPark running again after it being down for a month or so. The goal is to provide a better service to find any kind of parking in Ann Arbor. It currently uses canned data but I should have (hopefully) it pulling from craigslist pretty soon. Users can search by distance from home and/or price, and it shows all the spots on a Google Map complete with StreetView so you can see what it looks like.

pdlib released

(2010-01-03)

Today marks the intial release of pdlib, which will hopefully encourage more Pure Data development on the iPhone. Also, I'm planning on having wordy and mPark up (and possibly revamped) relatively soon.

Happy new year

(2010-01-01)

Happy new year everyone!

I'm in the middle of redesigning the site. I'd like to increase the activity level of the site as there are a few projects I've been working on which need a home online. One is iJam, a mobile music app for the iPhone that I and a few others are working on. We hope to have it out sometime in the spring. It uses another library called pdlib which I'm excited to be able to share with other developers.