—
(2011-02-12)
Reading Dumped on by Data has brought, first and foremost, a huge problem
facing the Internet-connected world of today: now that we have all the data we
could ask for, how do make sure we can store it and make sense of it all?
In any university research I've done, collecting digital data and in getting it into
the "correct" format for future work
is a huge percentage of the work that needs to be done in research in all fields.
With the advent of the Internet, we have the potential to form
connections between disparate data points like never before, but often times the
time cost of munging this in the correct format, bandwidth, and political / IP
concerns, and data trust issues prevent this data from being distributed to a wider audience. Some of
the worst offenders are those that receive federal aid for their research, and
then go on to not provide any publicly-accessible raw data from their findings.
Reproducibility is a major component of any valid scientific experiment, but
this is impossible unless any and all third-parties have access to original sources to
validate experiments.
Almost inexcusable in this day and age is being limited by physical hard
disk space. In some cases, this results in a willful deletion of raw data to
comply with some antiquated IT / business policy for a given institution. I
believe this is where the possibilities of cloud computing really shine -- as an
unlimited, persistent, durable resource. Google's technological advances are
made possible by their pervasive use of
BigTable internally, allowing a clear separation of concerns between those
doing the data collection vs. those doing data persistence (which is too often
shoveled off to the same person in a university setting). This facilitates
technologies like their near amazing real-time Dremel distributed query &
analytical tool.
Imagine this technology on a much larger scale -- one that's
used by large portions of the university research community. Right now, if a
statistician wants to get historical weather data for a given region in time,
it's available to a large extent on NOAA's site.
Sometimes, however, data is missing from readings and in general might need to be
normalized, which researchers end up doing independently and can't/won't share
their data due to aforementioned reasons.
Of course, even if suddenly all the data in world was instantly easily
accessible, getting everything into cooperative formats is huge effort in itself.
This is partially mitigated at companies like Google, as there can be some
top-down guidance for standardizing data formats. Determining a canonical data
format is an open question when one is dealing with a heterogeneously-owned
environment like a research community, but scientists have been good about cooperating
on standards given decent proposals (e.g., Berkeley sockets) and value.
The Long Now project is partially concerned with this question of
finding future-proof canoncial formats for our data today.
Like any good technology, this theoretical ubiquitous persistence and sharing of
data only becomes pervasive when two things happen -- 1:
anyone can use it, and 2: no one really has to put any cognitive effort into
making it work. Managing data has a long way to go in this regard, but it's
great to see companies out there making this something realizable.
—
(2010-10-14)
Finding a connection between disparate fields fascinates me -- the similarities
between work habits of hackers and painters, for example.
I don't believe time spent studying a field indirectly related to one's field is wasted;
although the benefits are often not immediately quantifiable, better understanding
other aspects of the world cannot be discounted.
Relating this to my university experience, I remain an advocate for hard
science majors taking soft science courses.
Technical skills are a given to do well in any CS specialization,
but the truly great thinkers call on knowledge from all parts of academia.
John Holland took gene
replication and turned it into a well-known CS algorithm.
Regarding business decisions, understanding the incentives of your customer is
often a sociological phenomenon. As Paul Graham says, you have to
make something people want. I see that phrase as a reminder to engineer
types that the technical solution is not really a solution unless it solves a
real problem a real person has.
I would argue a holistic view of CS/business problems is necessary to truly be successful
at solving them in an effective way. I highly recommend Thinking
in Systems: A Primer
by Donella Meadows, a wonderfully simple book on looking at
problems from a systems view, which is really just another way of saying, "look
at a problem in terms of its inputs and outputs."
There are few constants in life, if Thinking in Systems is to be taken
seriously. Ray Kurzweil hints that we as humans have a fundamental difficulty in
understanding non-linear concepts, so we necessarily simplify our
ontologies of the huge dynamism of the world in order to find some
sort of sensible path through all of it.
—
(2010-08-01)
Reading Gödel, Escher, Bach
opened my eyes to the idea that a huge amount of life is based off of this
simple idea, and what a beautiful one it is -- the idea that something can be composed
of itself.
I've since been on a kick trying to learn Scheme,
which is a dialect of of Lisp known for its minimal standard library which IMO makes
learning its fundamentals easier. I've tried starting off with Peter Siebel's
Practical Common Lisp, but I found it as more
of a book about learning the Lisp language than learning the underlying philosophy
of Lisp.
My first serious run in with Lisp was due to me being a fan of Paul Graham's essays
for awhile, and especially his book
Hackers and Painters
, and more recently because of Steve Yegge's blog posts and his strong advocacy of the Lisp
language. I was also trying to work through Daniel Friedman's
Essentials of Programming Languages
and it assumes a working knowledge of Scheme in addition to being pretty academic,
so I figured I'd try and start off learning the fundamentals of Scheme before
I tried to get any further.
With that said, what I'm reading now is
The Little Schemer
,
also by Daniel Friedman. This book's style is a complete turnaround from Essentials; it
assumes no experience programming, and does a great job of going through
recursive problems step by step. It's also a pretty short read. I'd highly recommend it
to anyone that feels they're missing some background in solving recursive problems.
On a side note:
It's a shame that a lot CS schools have either not advocated learning a Lisp
to begin with or are phasing them out (MIT, I'm looking at you).
I've never really understood the argument that Python is the best language for people
that have never programmed before; I've found in practice that Python has as many quirks
("closures", etc) as most other languages I enjoy programming in, but I would agree it
has a strong resemblance to pseudocode, which implies something good about it I
suppose.
If I could be taught over again, I'd much rather have someone teach me the core CS
concepts in an extremely pure language like Scheme than trying to teach that
concurrently with the syntax and quirks of C/C++, which is what for the most part
went on at the University of Michigan. I believe this is a result of trying to serve
many needs at once (practicality, abundance of EE/CE majors). Ideally these would
be taught separately.
—
(2010-07-09)
After what was probably too long trying to avoid it, I'm finally diving head-first
into Perl. I feel learning Perl takes more investigative work than some languages,
as what's best practice is somewhat subjective and has changed since a few of the
more popular books have been published (e.g. Programming Perl).
I've made a few observations:
- Ruby borrows Perl's syntax more than I realized (e.g., postfix conditionals)
- The breadth of libraries available is amazing -- by far the best of any language
I've used -- and people really seem to be good about writing good docs for their code.
cpanminus is by far the best way to use CPAN; it works exactly
how I would expect a package management system would in 2010, unlike the bevy of
features you need to configure when using the perl -MCPAN -eshell CLI.
- Moose is an impressive object system, much more
feature-rich and competitive with today's OO languages than Perl's default.
- Pervasive context-awareness is something that took me awhile to understand, but
now it seems natural to me. In addition, dealing with references adds another layer of
complexity that can be intimidating to new users having to deal with both at the
same time.
- Perl's argument passing reminds me of Javascript; or should that be the other way around?
- Topic variables (
@_ and $_) seem like a bad idea for readability,
but really cut down on the tedium of dealing with variable names in cases where
it's obvious. It's great for short programs, probably not so much for modules.
- Perl's efforts to be backwards-compatible has left it with a lot of vestigial
syntax; in a way it suffers from the same problems C++ does in trying to please
everyone. The solution for a lot of teams using C++ is to use a well-defined
subset of the features, and I believe this parallels making sure to use the latest
features of Perl as well.
—
(2010-06-27)
I didn't mentally plan for being in Seattle again, but it's here.
This time I'm here for the foreseeable future.
I picked up some great records already at Zion's Gate
and Half-Price Books. I have a feeling finding
good music is going to be pretty easy. Here's what I picked up:
—
(2010-02-10)
If you're at all interested in visual / graphic design, I'd highly recommend
picking up a copy of Edward Tufte's Visual Display of Quantitative
Information.
I just got done reading through it after having set it aside for awhile.
It concisely highlights features of great graphic design. He wrote
it pre-information age, but I found that it translated very well to good design for the
web and computer interfaces.
One of the main ideas is that great graphic design reveals the greatest number
of ideas in the shortest time, with the least ink, in the smallest space. He
provides clear examples for each point he makes.
In any case, it's a good coffee table book.
—
(2010-01-08)
One thing I find interesting about text messaging is the constraint the
160-character limit imposes on what you can say. Our future poets will Tweet.
—
(2010-01-04)
I've finally got mPark running again after it being
down for a month or so. The goal is to provide a better service to find any
kind of parking in Ann Arbor. It currently uses canned data but I should have
(hopefully) it pulling from craigslist pretty soon. Users can search by
distance from home and/or price, and it shows all the spots on a Google Map
complete with StreetView so you can see what it looks like.
—
(2010-01-03)
Today marks the intial release of pdlib, which will
hopefully encourage more Pure Data development on the iPhone. Also, I'm
planning on having wordy and mPark up (and possibly revamped) relatively soon.
—
(2010-01-01)
Happy new year everyone!
I'm in the middle of redesigning the site. I'd like to increase the activity
level of the site as there are a few projects I've been working on
which need a home online. One is iJam, a
mobile music app for the iPhone that I and a
few others are working on. We hope to have it out sometime in the spring. It
uses another library called pdlib which I'm excited to be
able to share with other developers.