Main

July 01, 2013

First, Philosophy, then Data Science

One of the most telling articles on "Data Science" appeared in the NYTimes in April[1]. We are facing a massive shortage of data scientists, it read. "There will be almost half a million jobs in five years, and a shortage of up to 190,000 qualified data scientists."

Trouble is, the same article says, "Because data science is so new, universities are scrambling to define it and develop curriculums."

So -- we don't know what they are, but we need 500,000 of them.

Continue reading "First, Philosophy, then Data Science" »

[ Yahoo! ] options

November 03, 2012

Immovable Objects

There are many things that we treat everyday as immovable objects. Immovable? More correctly, they are things that we believe to exist in a particular context, and to always exist in that context. We see Mount Rushmore and quickly -- unconsciously -- put it in the context of South Dakota, the United States, or perhaps in the movie 'North By Northwest'. We see hula dancers, and unconsciously think of Hawaii. We see steamboats going along a river, and unconsciously think of the Mississippi, Missouri or perhaps the Ohio rivers, maybe we think of Mark Twain or 'Showboat'.

We make these cognitive leaps because we have become accustomed to certain signals in the data -- in this case, visual data -- that trigger assumptions about other related things we know.

When we are dealing with data, software and even business processes, we often make these same unconscious leaps. They are unconscious because we don't recognize that we are making the connection. And because we don't recognize that, we don't challenge it or demand proof of its validity, we don't question it. We make the leap and move on.

That's a reasonable thing to do in most cases. Challenging every assumption would waste an enormous amount of time, become extremely tiresome and tedious and annoying, and cost a lot of money.

Fair enough. We don't challenge obvious things. But we need to always be prepared to discover that the obvious connection is invalid. We must always be prepared to discover that some unquestionable belief is in doubt -- in fact, that it is wrong, and that it is leading us to make bad decisions.

Case in point -- telephone numbers.

Some years ago, the United States telephone system adopted the concept of an "area code". This 3-digit prefix to the US phone number (which, by then, was standardized on a 7-digit number format) was assigned to specific geographical areas of the country. Hawaii's area code was 808. Michigan had three area codes (517, 616 and 313). Other states, and Canadian provinces, got their selection of area codes.

Over time, as populations grew and telephone usage became more concentrated, the number of area codes expanded. High-density areas were divided into two area codes, then divided again, and yet again, as population -- or more specifically, as telephone usage -- grew. In my case, over the course of 35 years, I have lived in three different area codes, yet neither I nor my phone moved.

In due course, the area code became recognized as an indicator of not only your telephone number but -- more interestingly -- an indicator of your location. Your phone number starts with '212'? You're a New Yorker. '415'? San Francisco Bay Area. '313'? Detroiter. '312'? Chicago.

Enter the profilers and aggregators. Given a 10-digit phone number, we can now derive your location. From your location, we may be able to infer something else about you -- your demographic, your interests (downhill skiing if you're in Colorado; surfing in Southern California; theater if you're in New York; country music if you're in Nashville). A small piece of data intended to make the phone numbering system more manageable is now revealing something about you -- because we made the (now unconscious) leap from area code to location to probable interest.

Enter the mobile phone.

Before the entry of the consumer mobile phone, a telephone number was associated with a location. It was in my house at a particular address, or in an office in a particular building, or even in a phone booth on a particular street corner. The phone number pointed to a physical, geographical location. Thus it was completely reasonable to jump from phone number (specifically, area code) to location, and from location to interests.

But the mobile phone has no location. It moves from place to place. It is -- what's that word? -- right, it's "mobile".

With the mobile phone, you can't make any assumption about its location at a point in time. However, you can still make an assumption about the owner of a mobile phone. People will buy a mobile phone at a store near their home. That store will assign a number based on its location which is likely to be "close enough" to the customer's regular location. So, for a time, the assumption is safe, though a little shaky.

Enter the ability to "port" your landline phone number to a mobile phone.

In time, those phone numbers that did have a physical location -- your landline phone number -- became mobile. It was possible to replace a landline with a mobile phone, but keep the number. And, of course, the mobile phone was mobile, so a mobile phone with a landline phone number made that landline phone number mobile as well.

Suddenly, phone numbers which were reliably associated with a location became disassociated. In my case, I have a phone number that had been a landline in a location 5,000 miles away from my home. Beyond that, I took my mobile phone to a third location -- neither my home nor the original location of the phone number -- for an extended period of time.

Needless to say, the systems and users who had already made the phone number-to-location assumption did not change their assumptions. The 5 or 6 hour time difference between my phone number's apparent location and my actual location created some particular annoyance -- that telemarketing call at dinner time became a call waking me up at midnight. When the charity was scheduling a route to pick up discards in "my neighborhood", I would get a call asking if I had a donation to put on a porch 5,000 miles away. And -- this is, after all, election season -- the robo-calls would wake me in the middle of the night to urge me to vote for someone who did not represent me in any government.

Immovable objects sometimes move. That's the obvious lesson.

The more important lesson goes to our readiness to recognize that an assumption that served us well in the past has begun to fail. If we are ready for that shift, then we keep careful track of the assumptions we've made; as importantly, we keep track of the conclusions we've drawn that are dependent on those assumptions. When the shift comes, and the assumption must be abandoned, all the dependent conclusions also have to be adjusted -- either abandoned in turn, or defended on the base of some other known and reliable information.

When we build our software, or our data, models, we usually base the model on the business rules that we are given. It is nonetheless our responsibility, as software architects and data architects, to be vigilant about inferences, assumptions and conclusions that are built on convention rather than on a solid and lasting characteristic. If we design according to convention or common usage, we have to be prepared for that convention to be discarded, and with it, significant portions of our design.

Software Metaphysics calls for the careful vetting of all assumptions, all inferences, all derived information and practice that may -- however reliable it may seem at the time -- be one day relegated to the dustbin of history.

[ Yahoo! ] options

May 01, 2012

The Key is "Just Be Natural"

A questioner asked how to select a natural key in a business data model. In this case, the business is modeling client organizations across international boundaries. He wrote:
In a business data model I was wondering about something like National Registration Number (varies by country), Country, and National Registration Number Qualifier (if there could be multiple registering organizations - which raises other considerations).
A natural key (as opposed to, say, a surrogate key) has the two-fold goal of being (1) unique, and (2) "naturally" memorable. A memorable attribute of a person or an organization is their name, but -- alas -- that is almost never guaranteed to be unique. In a business setting, where the business has control of the organization's name, the name is a much better candidate. However, it's not clear from the question whether the business has control of the name.

It is important to know that the "National Registration Number" is a surrogate, and not a natural key. At least this is so in the United States, where this is the "Federal Employer Identification Number" (EIN) or the "Taxpayer Identification Number" (TIN) -- a number assigned by the Federal government but not indicative of the business itself. Like any surrogate key, the number is arbitrarily assigned and usually forgotten by the business itself. (Technically, the TIN is not guaranteed to be unique either, though in practicality it is.)

My point, I guess, is that the purpose of having a natural key is to simplify look-up -- and typically you are simplifying the look-up of a surrogate key. Within the data model, all relationships are by the surrogate key, a system-defined identifier that is guaranteed to be unique (though not memorable, meaningful or derivable).

So a suitable natural key could be the name of the organization or an organizational identifier, coupled with some geographic qualifier (in the US, the country is not specific enough; many business names are only unique within a state).

But it's important, I think, to keep in mind the reason why a natural key is desired: it is memorable (naturally) and it is unique. That only matters where the user enters into the data model -- it doesn't matter, and shouldn't be used, to navigate relationships within the model.
[ Yahoo! ] options

February 12, 2012

Big Data in the age of punched cards

In 1973, when I entered the computer field, the standard format for data was an 80-column punched card and a 132-character line printer. Data was forced to fit into those boundaries, and many creative solutions were crafted to make that happen.

If we needed to store the data, we would put it on paper tape or magnetic tape -- 800 bpi, then high density (1600 bpi, 6250 bpi) tape. One day, we installed a disk drive -- 5 MB -- and could update single records without copying whole files.

Today, we talk about "Big Data" and propose massive reconstruction of our technical mindset to deal with this novel problem. "Big Data" -- in quotes and capitalized -- is the raging problem of the day.

Does this mean that "Big Data" is more a statement of what limits our technology (and our skills) have, than a statement of what is actually going on in the real world?

Continue reading "Big Data in the age of punched cards" »

[ Yahoo! ] options

January 23, 2012

"In today's information-driven economy"

My e-mail graced me with an invitation -- well, yes, a sales-pitch -- to a webinar entitled: I read on, despite the obvious sponsorship by CA Technologies advertising their ERwin product. And the opening sentence was:

"In today’s information-driven economy, data drives your business."

Now, I have held the belief for the last 30 years that "data drives your business". That, after all, is why I have been engaged in IT and focused on databases and data architecture, rather than on networking, server management, user interfaces, programming, security or any of the dozens of other disciplines within IT. At one time, before an audience in a small auditorium, I explained my interest in databases by saying simply "everyone has data".

So, what is it about "today's information-driven economy" that makes it different from the 1970's, 1980's or any previous decade?

Perhaps there is more discussion about data today. Perhaps there is more awareness. Perhaps there is more data.

That last point -- that there is more data -- is the underpinning of the current "Big Data" fad. Yes, I consider it to be a fad, not an "emerging technology" or some other major paradigm shift. "Big Data" holds that there is more data now than ever before, and as a consequence we need to attend more seminars, read more journal articles, buy more software and hire more consultants -- but only those rare (read, high-priced) consultants who understand "Big Data".

My advice is somewhat different, I think: don't.

Don't what? Well, don't rush into this without asking and answering some very, very fundamental questions. Here's a few to start with:
  1. What data are you using to drive your business today?
  2. What data are you ignoring today? Why?
  3. If you paid attention to that data, what would you learn?
  4. Are you ready to change your business, if the data tells you to?
  5. and my very favorite question, when talking about "Big Data":
  6. What is data? How is it different from "noise"?
Today's economy is information-driven, just as the economy has been for decades. And it is important to manage both the quality and quantity of data, to ensure that you're spending your time on data that actually matters. To this end, the threatened tsunami of massive volumes of data can't be blithely ignored. But, importantly, it can be intelligently ignored, if you have asked and answered the fundamental questions about data: what is it? does it matter?


While this write-up was triggered by an invitation to a webinar by CA Technologies, it is not an endorsement of CA or of ERwin.
[ Yahoo! ] options

May 22, 2006

Identity Exposure is an Architecture Failure

Today's software story is on the front page of the day's news: 

Monday, May 22, 2006; Posted: 5:46 p.m. EDT (21:46 GMT)

WASHINGTON (CNN) -- Personal information on 26.5 million veterans was stolen from the home of a data analyst in what appears to have been a random burglary, Veterans Affairs Secretary Jim Nicholson said Monday.

The computer records include names, Social Security numbers and dates of birth, Nicholson said. The Department of Veterans Affairs disclosed the theft Monday and said it has seen no indication that the information has been misused.

The analyst took the data home without authorization, Nicholson said. Department spokesman Matt Burns said the employee has been put on administrative leave while the investigation is conducted.

What makes this a story about software? Exactly this: Why did the software architecture permit this personal data to be available to anyone in the VA?

Continue reading "Identity Exposure is an Architecture Failure" »

[ Yahoo! ] options