Main

July 01, 2013

First, Philosophy, then Data Science

One of the most telling articles on "Data Science" appeared in the NYTimes in April[1]. We are facing a massive shortage of data scientists, it read. "There will be almost half a million jobs in five years, and a shortage of up to 190,000 qualified data scientists."

Trouble is, the same article says, "Because data science is so new, universities are scrambling to define it and develop curriculums."

So -- we don't know what they are, but we need 500,000 of them.

Continue reading "First, Philosophy, then Data Science" »

[ Yahoo! ] options

October 11, 2012

What's the best answer?

In a recent dialog on a web forum, someone asked how to find the "best" way to construct an SQL query. They had posed a problem and received many solutions, each different in syntax, in approach, and in detail. Reasonably, he asked "how do I find the 'best' solution?"

That's a question we should always be asking when we craft a solution to any problem. In IT, and in SQL particularly, I use these criteria to evaluate an answer:

(1) it must return the correct answer on the target platform
(2) it should use standard syntax and construct
(3) it should be easy to understand by someone who didn't write it
(4) it should be as fast as necessary... it needn't be as fast as possible
(5) it should play well with others in a multi-user environment
(6) it must be ready-to-run on time and under budget

The "best" answer is the one which blends all of these. Notice that only two of these are "must", the others are "should".

Let's take these one by one:

(1) it must return the correct answer on the target platform

You would think this goes without saying, but an astonishing number of solutions don't solve the problem, or solve a different problem. Gratefully, that doesn't happen frequently, yet it does happen often enough that you are well-advised to verify the result before you worry about whether the solution is a 'quality' solution or not.

It's important also to make sure the solution works on the target platform. Testing in a lab or a test-bed or in a database 'just like' the real one is not assurance that the solution works on the target. Why not? Usually because of the scale of the problem -- a table of 10's of millions of rows partitioned over multiple servers can turn your simply elegant solution in an out-of-control monster. If scale is not the problem, data often is -- your test-bed data is carefully crafted and follows all of the rules, but the real-life platform may have outliers, offenders and other difficulties.

(2) it should use standard syntax and construct

This is where the first 'religious' argument begins. On one side are those who argue that you should take best advantage of the platform-specific features afforded by the vendor. By the same token, they might argue that the vendor's ANSI SQL syntax is just a wrapper over its native implementation, and that the wrapper itself is a performance penalty.

The counter-argument is, of course, that the adoption of standard syntax and construct enables the query to work with any tool, or any tool user -- including any SQL programmer. That provides a long-term value and benefit, as tools and staff are replaced over time.

Ultimately, this question resolves to whether you are taking a short-term or long-term view, whether the solution is a single-use item (to be replaced when needed) or it is the foundation for re-use (to be adopted and extended when needed).

Having been an eyewitness to decades of churn in both technology and staff, I am always an advocate for standards, reuse and long-term viability.

(3) it should be easy to understand by someone who didn't write it

By the same token -- taking the long-term view -- the syntax that is used, and the solution strategy that is adopted, need to be intelligible to anyone who might chance to come across it. In the immediate case, you will be obligated to demonstrate to someone -- the user, your team partner, a peer review -- that your solution works on paper, that it is a sensible approach, that it is safe and will cause no harmful side-effects. In the longer case, you may be called upon to debug it six months from now, or someone else may have to do so, when it suddenly stops working, when a user wants a small change to it, or when the vendor upgrades their platform. If the solution is not easily understood, then a minor maintenance task may turn into hours or days of experimentation.

Being easy to understand might mean adding appropriate documentation to the programmer's manual, or embedding inline comments at appropriate places in the code. My preference, however, is to use common and recognizable idioms, to avoid "magic tricks", to use verbose language and names, to organize the code visually, and to use standard syntax (see previous item).

(4) it should be as fast as necessary... it needn't be as fast as possible

This is religious argument, the sequel -- how fast is fast enough?

On the forum previously mentioned, the first answer predictably said: "In the world of databases, usually, the fastest solution is the best answer." Is it? My answer is 'No' (obviously -- or else there would only be one criterion here, not six).

Ask anyone who has tried to set a speed record -- on the Bonneville Salt Flats, for example -- and they will tell you that it takes some very heroic, and very expensive, effort to be " the fastest". Even there, the goal is never to be "the fastest" -- instead, these speed-seekers only try to be faster than the last guy -- to be as fast as necessary to take the first position.

Why not go for the fastest possible? Because it takes too long, costs too much, and carries too much risk. Instead, as fast as necessary is a reasonable -- though difficult -- approach.

In IT, there is an unhealthy, even wasteful, focus on speed. Granted, in some cases, speed is critical -- but these are the rare exception, not the rule. The rule is that your solution is fixing a real-life problem that has real goals and timelines and costs, and that probably has little to do with technology. It probably has a lot to do with the business' domain -- banking, health care, government, education, manufacturing, retail or some other concern.

When your attention is directed to getting the solution that is "the fastest", you are likely distracted from the business that you are there to serve. That business will tell you that a solution you think is "too slow" is "fast enough" to solve their problem. When they tell you that, stop working on the solution.

(5) it should play well with others in a multi-user environment

It's very common to hear of a solution to a poorly-behaved database query that begins with "Your database table is wrong. You need to create a new index/column/table. Then you can simply...".

And it is equally common for the poor programmer to reply: "I can't change the database" -- the DBA or the vendor or the legacy system has already decided how the database will be structured. Or, more likely, implementing a database change is a months-long effort involving planning by four or five organizations, refactoring many tools and programs, and planning a data migration. By contrast, the user needs an solution by Monday morning.

Playing well with others requires that you take notice that your little problem is a small part of a much larger, more complex system. Therefore, your solution must fit within its allotted space -- memory-wise, space-wise, processor-wise and network-wise -- within that system complex.

(6) it must be ready-to-run on time and under budget

For those of us who don't have the luxury of working with an unlimited budget, without a clock or calendar, who are not doing problem-solving just for the pure joy of it -- this last criterion might be the first, or only, criterion on the list.

Every problem that is presented to you comes with the question: "When can you get this fixed?" and "How much will it cost?". We are fond of saying that anything is possible, it just takes time and money. This is where time and money have their way with your beautiful, elegant, all-encompassing, everlasting solution.

Someone is waiting -- impatiently -- for your solution. And someone else is waiting for that person to finish what he was asked for so they can get on with their job. You, the solution provider, the expert, are at the end of a long chain of people waiting to do their jobs.

I have never seen a solution that couldn't use just a few more tweaks. That wouldn't be just a little better if we could make one more change.

But inevitably, the time and the money run out. Your solution must be ready and deliverable before that happens.

So -- how do you know which solution is the best solution? Make sure of these two things -- that the solution works and that you've spent no more time and money than you can afford. If you've met those criteria, then you can use the others to evaluate how much better one solution is than another. If you've got reasonable-to-high scores in all criteria, you can be comfortable knowing you've got as good an answer as you might expect.... until the next time.

[ Yahoo! ] options

May 05, 2012

Why "Metaphysics"?

A very short blog entry this time : Read this article: Data Management Is Based on Philosophy, Not Science

My field of study at the university was philosophy (after a couple years of political theory and constitutional democracy). I still keep the works of Aristotle on my shelf. I've always thought philosophy was the perfect way to prepare for a career in information systems. I make that recommendation to parents asking what their computer-savvy child should major in. They always look at me like I've grown an extra eye!

But philosophy is the study of ... well, of everything, and metaphysics is the precursor to philosophy.

Hence -- software metaphysics -- the fundamental thinking that prepares the ground for all software.

Worth a read.
[ Yahoo! ] options

July 23, 2010

Software for Idiots

Tonight, I tried to install an HP Photosmart C4700 printer at home.

Now, I'm a fan of HP. They've brought me a lot of business over the last 8 years. I have HP PCs and laptops. I'm one of the few people in the world, apparently, with an HP TV.

So, I want this HP printer to replace the worn-out Lexmark. It's a nice printer, an all-in-one copy/scan/print machine. But the key (besides the low price) is that it's wireless. I need a network printer, and this little baby fits the bill.

Until, that is, I tried to install it. Everything works -- except the wireless. Hmmmm. I click on the "configure wireless connection" in the HP Solution Center, and my PC spins for a bit and then -- nothing.

Nothing. No light flashing. No message box. Nothing. It just starts up like it's going to do an install and then it quits.

Eventually, I figure out that, although I've installed all the software already, I need to have the original CD in the drive to configure the wireless. Huh? Try again.

This is better, but not by much. It starts the program and begins to scan looking for my Wi-Fi.

Okay, here's where I tell you that I am paranoid about my network. I take lots of Wi-Fi security measures. But the first Wi-Fi security measure is -- don't broadcast your Wi-Fi. So I don't.

HP seems to think that you would -- and should. So they search, and search, then show me some networks from the neighbors' houses.

I'll stop there, because here's my point -- every software package I get seems to have adopted the most obnoxious behaviors of AOL, Microsoft and others. And that is to assume that you (the consumer) are too ignorant to know whether you need help or not.

So software will scan every hard drive attached to your machine -- no matter how many terabytes -- looking for some software or configuration file. And it will make you wait until it's done. Sure, they could have just asked: What program do you want to use? or What is the name of your network? But that would assume an intelligent user. Software doesn't assume that. On the contrary, they assume you are ignorant, too ignorant to even allow you the option of taking a shortcut.

When you write software like this -- and more and more vendors do -- you are making an architectural choice. You are choosing to treat your customer like he/she is an idiot.

The significance of that choice is profound. It seeps into all of your design decisions, all of your marketing decisions, all of your support decisions. And it certainly seeps into the friendliness or hostility of your product.

It is a simple thing to ask, when installing or configuring software -- do you know how to install this? do you know where your programs/files are? do you know where your network is? Maybe 90% of the customers will say "No, please help me". But some of your customers will say "Yes, thank you, my time is valuable. I'm installing this software because I'm trying to get something done. My goal is to get back to real work, so let me speed this process up. Please."

Part of Software Metaphysics' mission is to examine some principles that form the basis of how we think about software -- principles that we don't even realize we hold.

How we envision our users defines how we build our software.

Do yourself a favor -- think about users as pretty smart folks.

[ Yahoo! ] options

July 20, 2010

ETL 3.0

I have started a separate blog entitled "ETL 3.0". This is an exciting new way of looking at ETL and how it has evolved and where it needs to go next.

I invite you to look at "ETL 3.0".

 

 

[ Yahoo! ] options

March 12, 2009

ETL Architecture - Core Principles

IT at the Speed of Business

The driving factor of the modern IT shop is to operate at the speed of business.

More than anything else, this means being able to respond rapidly to changes in the business climate. These changes come from the business units within the enterprise, from trading partners outside the enterprise and – for IT – from the continuous advancements of technology itself.

Whenever the business makes a new demand, IT must be prepared to satisfy that demand. It must do so rapidly. It must do so without major disruption. And it must be free to move forward when it needs to move forward.

The approach to technology that underlies a system contributes to the responsiveness of IT. And the core principles underlying the system’s architecture drive that approach.


Core Principles for ETL Architecture

It is important to give appropriate weight to the principles that drive the architectural foundation of the Extract, Transform and Load (ETL) system. These principles – listing the most important first – are:

·   Accuracy

·   Reliability

·   Flexibility

·   Extensibility

·   Autonomy

·   Cost of Ownership

·   Scalability

·   Speed

Three of these principles – Flexibility, Extensibility and Autonomy – have the greatest impact on IT’s ability to respond to changing business demand.

Flexibility is the key principle to guide all design, adoption and implementation choices. Flexibility means being able to adapt to forces of change, easily, swiftly and with minimum risk. Technology, products, the marketplace, or – especially –the business may impose necessary and beneficial change. Flexibility is essential to avoid “tear-up” when the inevitable changes occur.

Extensibility ranks right behind Flexibility. It is especially important when Flexibility must be compromised because of a limitation in a product or design choice. Extensibility means being able to take a product beyond its intended capabilities. This is the enabler of discovery and invention, two key elements of a vibrant IT organization. Extensibility allows you to overcome limitations, not with “workarounds”, but with solutions that are well-designed and architecturally sound.

Autonomy is the IT organization’s capacity for moving forward at its own pace. Autonomy is enabled by Flexibility, Extensibility and a skilled workforce. If an architected solution supports Autonomy, the IT organization can take an active role in creating what is needed, when it is needed, to support the specific business demand. The IT organization is not dependent on, or encumbered by, the ability or desire or timetable of vendors or markets.


The ETL Model

A model for this ETL architecture is as simple and complete as that shown in Figure 1

 

ETL Model (small)

Figure 1 ETL Model

 

EXTRACT is platform-specific. Its role is to optimally collect the data that needs to be shipped to the Transformer, including both the core data and its context. The Extractor may be consuming resources on a highly-active, highly-volatile system. It must be able to take advantage of platform-specific features, and deal with platform-specific limitations, in order to minimize disruption to the platform.

TRANSFORM is platform-agnostic. Its role is to mediate between two, sometimes conflicting players, resolving the differences between the two, preserving (or adding to) the value of the data from the source extractor, normalizing it for general usage and delivering it to the target loader.

LOAD is platform-specific. Its role is to optimally organize and store the data that has been pulled from the Application platform and mediated by the Transformer. Like the Extractor, the Loader must be able to leverage the platform-specific features and avoid the platform-specific limitations of its host system.


Essential Characteristics

The essential characteristics of the system depicted in the ETL model are:

Encapsulation – Each component of the ETL processing is functionally encapsulated. This provides the degree of isolation that is required to allow each component to incorporate whatever optimizations are most appropriate to achieve its objective.

This encapsulation limits the component’s scope of awareness – it “knows” only about its own environment and (literally) knows nothing about its partners. For example, Load knows the details of its physical database design and implementation, but knows nothing about the source system, the transform engine or the business rules that moved the data from one form to another. Likewise, Extract knows the details of the source data and perhaps knows the details of the application that produced the data, but knows nothing about the target system (or systems).

Because its scope of awareness is constrained, each component is also unaffected by change to either of the other components. This constraint offers more opportunities for adaptation and flexibility as new sources, new targets and new business rules emerge.

Loose Coupling, Standard Interfaces – The components are loosely coupled – that is, they communicate only through standard, open interfaces. The rules of encapsulation require that each component knows its partner only through the coupling interface. Using loose coupling respects that encapsulation.

Loose coupling promotes extensibility. A standard interface permits the insertion of additional components which can add functionality to the standard model. For example, a “fan-out” requirement – in which a single transform feeds multiple load targets simultaneously – can be implemented by inserting a “one-to-many” distribution component between the transform and the loads. Each load remains encapsulated, unaware of its sibling loads. The transform remains encapsulated, unaware of the “fan-out”.

Likewise, using a standard interface promotes autonomy. Since the interface is non-proprietary, the IT organization can add functionality without waiting for the product vendor to incorporate that functionality into the product. This capability is essential in allowing IT to respond rapidly at its own pace to changing business demands.

Platform Awareness – The ETL model allows for platform-awareness. Because the platform-related components – Extract and Load – are encapsulated, they can freely take advantage of those features specific to their respective platforms. This allows the use of special utilities, known only to the platform, to be used to their best advantage, for performance or other purposes.

The Transform component is not platform-related – its domain is the data itself. The rules of encapsulation and loose coupling dictate that the Transform component is unaware of the specific nature of the (physical) source or target. In this respect, the Transform is platform-agnostic. It must work with the data, regardless of its physical origin, applying the transformation rules imposed by the business requirements. The Transform component is therefore free to receive from any source and deliver to any target, trusting that its Extract and Load partners know what is to be done with the data.

It is important to encapsulate platform-specific characteristics, capabilities and requirements within the platform-specific processes. This allows those processes to flex or expand as they must to leverage the platform.

 

 

[ Yahoo! ] options

ETL Model

 

ETLModel.png

 Figure 1. ETL Model

This simple model of an ETL application - "3 circles, 2 arrows" - is all that is needed to highlight the key architectural principle of Flexibility. Each component is separated from the other processing components by a "wall". They communicate with one another through a loose coupling based on an exchange of messages. This is the long-understood requester-server model, that is not often enough adhered to.

return to ETL Architecture - Core Principles

[ Yahoo! ] options

February 20, 2009

Architecture's Business Stakeholder

Whenever an IT architecture discussion begins, all the participants line up on different sides.

There are, of course, those who just look around puzzled, wondering "What architecture?" -- or, more frequently, "Why architecture?"

But the those who call themselves "architects" quickly take a side:

  • Architecture is how all the machines are deployed, and what kind of network links the machines together
  • Architecture is separating machines by function: database server, web server, security server, application server, middle tier distributor
  • Architecture is how the vendor's product is distributed, scaled, and sped up

I go, first, to the more fundamental question -- the "technology-free" question -- of "What are we trying to accomplish, and why?"

It's the "Why?" that creates the greatest silence in a roomful of techies.

  • Why do we have a web site?
  • Why are we building a data warehouse?
  • Why are we creating a real-time EAI system?
  • Why is the data secured? Why isn't it?

Unless the answer to "Why?" is "because we're a technology research firm and this is what we study", then I would expect the people who write the check -- the Business people -- to offer up the only relevant answers. And they are rarely technical.

What would the business stakeholder say about the choice between EAI and real-time data warehouse?

How does the business stakeholder feel about Open Source?

As it turns out, even these geek debates have a business implication. It's a business problem specifically because, at some point, one solution will cost more than the other. That's dollars. And dollars are a business problem.

  • What is the cost of Open Source? How does that compare against a proprietary software suite, like Oracle, Microsoft or IBM?
  • What is the cost of a vendor-specific software solution? What is the "lost opportunity" cost of being unable to change from one vendor to another without a multi-million dollar rewrite of a working system?

There is -- almost -- always a business interest. All technical questions reduce, ultimately, to a business question, because they always come down to costs -- and costs are measured in business dollars.

 

[ Yahoo! ] options