|

The impact of RFID on data warehousing
By Dan Linstedt
The amount of information that will be generated
by radio frequency identification (RFID) tags and other micro-devices
is on the verge of exploding. That leaves us with questions like
"What happens to data quality? What data should we capture, and
how often should we capture it? What about 'white noise'?"
While we can't address every issue regarding
the coming data avalanche, we can highlight some of the more "front
of mind" concerns surrounding RFID.
All about RFID
While a lot has been written about RFID, not a lot of thought
has been given to the information architecture necessary to support
the data streaming in from this technology. RFID tags are just
another data source, right? What does this mean?
The data from RFID tags can easily overwhelm
any interface in use today. Teradata, while the most capable of
all RDBMSs to handle this influx, currently doesn't have remote
filtration capabilities. Thus, today we can't code a business
rule in Teradata and then mark it for distribution to the RFID
listener. The ability to maintain a single point of business rules,
and then allow distributed processing of these business rules,
may soon become a necessity in order to support the data generated
by RFID interfaces. However Teradata, as I note later in this
article, is best positioned to take on the challenge.
Let's step back for a minute. How does the
RFID function?
From a simplistic definition standpoint, an RFID
tag consists of a transponder and an embedded silicon chip with
encoded data. The tag is placed on an object, and when the object
passes within range of an antenna broadcasting radio waves on
a specific frequency, the transponder "wakes up" and sends the
chips data to a transceiver, sometimes over distances up to 20
feet.
What does the transceiver do?
The transceiver collects the data from each RFID tag, decodes
it and transmits it to a data store or central processing computer.
From there, the data can be analyzed and used according to specific
requirements.
What does this have to do with active data
warehousing?
Good question, because it is the crux of this article. There are
several areas to explore when discussing active data warehousing
and RFID technology. From an architectural perspective, we are
now faced with the following challenges:
- Constantly streaming data in massively parallel transponders
- Astronomical amounts of information if every single product or item is tagged
- The ability or lack thereof to tag at a granular level
- The need to filter incoming data
- Dynamic maintenance of the business rules, which must be capable of being distributed to the transponders for action at the point of collection
- Melding of hardware and software systems
- White noise and radio frequency interference
This is only the tip of the iceberg. There are
many other concerns to consider, including privacy, GPS locations,
bad data, compromised or damaged RFID tags and the distribution
of rules and filters. How do we decipher what data is meaningful
and what is meaningless?
"Distributed" active data warehouse
The result of all this RFID data activity is something I call
a "distributed" active data warehouse. There are massive sets
of data arriving in parallel streaming modes to the transponders,
and the data is distributed because the rules for filtration must
be running on the transponders themselves, otherwise the active
warehouse would be overwhelmed.
Today, there may not be enough channels or fast
enough networks to connect the transponders directly to the data
warehouse itself. It's similar to the problem that disk-drive
manufacturers faced years ago. (That problem was partially solved
through fiber-optic connectivity.)
Two things that force changes to our architectures
and designs are latency and volume. RFIDs are active on both fronts.
Let's examine a hypothetical example to explore latency and volume.
Suppose we have a carton of candy bars, and
each candy bar wrapper is tagged with an RFID tag. Now assume
that the manufacturer has transponders at the plant, and the data
from the transponders begins streaming into a centralized data
warehouse the minute the candy bar is wrapped. Through the packaging
process the candy bars are put in boxes (20 at a time). The boxes
are then shrink-wrapped and put on a pallet for distribution.
Let's say 500 boxes fit on a pallet. Now from one pallet alone,
the transponders are receiving and transmitting data from 10,000
tags.
The questions that arise at this point might
be: How frequently do we "access" the RFIDs through the transponders?
Do we want 10,000 signals every second, every minute or less frequently?
If we have more than one transponder in the plant, how do we eliminate
duplicate signals?
Lest we forget, it's all about business driving
technology, not the other way around. We have this hypothetical
example because the business feels the need to track all products
from inception to consumer, and possibly back again. So what's
important to the business user here, and how do we get the desired
answers?
From an active data warehouse perspective, 10,000
transactions per transponder every second is not too bad, considering
that most Enterprise Application Integration (EAI) tools run from
10,000 to 25,000 transactions per second (depending on the technology
and the performance and tuning done in the environment). But that's
just from one pallet. What happens when we have 100 pallets in
the room?
This is just the active feed side of the data
warehouse. What if we stored all the incoming feeds (such as location
from global positioning devices, for instance)? In that case,
we would have a massive set of derived or computed items that
can be produced from each product, such as: time on a pallet,
time to ship, time in a stock room and time on the shelf.
What else can we derive from this information?
We can determine when the product might spoil, whether the vendor
is selling enough of a particular product and when product is
sold out. We might even predict when the product will sell out,
and ship more just in time to restock the shelves. I could go
on and on.
If we look at the technical aspects of implementing
the system, what should we consider?
From a Teradata standpoint, the pipes into the database must be
wide open and capable of handling massive parallelism. Alternatively,
we could change the rate and volume of information coming in through
the use of business rules.
Teradata might need an extra management component
to handle the registration of transponders as a data source. It
might also require a partnership with a business-rules vendor
for dynamic, data-driven business rules that can be deployed to
these transponders.
One thing is certain, not everyone will want
all the data all the time. In this light, a business interface
will be necessary to customize the amount and type of data revealed
to an end-user. Another technical component that is often overlooked
by database vendors is change data capture. In 2002, I wrote an
article for Teradata Magazine about EAI vs. ETL vs. RDBMS (http://www.teradata.com/t/go.aspx/?id=115378).
I suggested that these technologies would begin to merge together
and push on each other for functionality.
RFIDs drive this case home. It will require
the use of EAI rules to move the data through the transponders
and networks, the use of "transformation" (the T in ETL) in the
database and the power of the RDBMS to be massively parallel and
capable of handling high-speed, high-volume data feeds.
What happened to ETL?
Extracting from transponders will be a moot point unless storage
devices are actually built into the transponders themselves; perhaps
they become distributed operational data stores instead of dynamic
feeds. This might be required for fault-tolerance and fail over.
However, devices like network routers and hubs will need to become
"smart" and run filters to rid the network of bad feeds and undesirable
information. Data quality and change data capture will have to
operate on these distributed nodes.
Transformation will occur in two stages: on
the transponders themselves and on the warehouse receiving the
RFID transmission information. Time and date stamping will move
to the forefront of database processing necessity. The ability
to be time/date and geo-location aware will become a competitive
advantage.
What happens if a transponder receives bad
data?
Bad data can be generated (theoretically) by a defective RFID
tag or an RFID virus (let's hope not). Transponders must have
change data capture logic programmed in, along with parallel authentication
devices to ensure that the data from the RFID is indeed bad. We
may want to capture this information and record the fact that
the RFID is bad so it can be replaced. We may even want to know
how to replace it and how to keep it from "infecting" other nearby
RFIDs. We can take a lesson from the credit card processing companies
here. In an active data warehouse, they have flags that signal
possible fraudulent activities. A similar rating system might
be employed to detect bad data from the RFIDs and to either re-program
them remotely or shut them down. Either way, the transponders
must be connected to an active data warehouse in order for these
decisions to be made.
What happens to Teradata?
Maybe a better question is what should happen to Teradata? When
we peek into the future, no one can say for sure, but here's what
I think may come about as only one result of RFID technology:
- Teradata will manage dynamic business rules
engines and contain a warehouse of information about the transponders,
their locations, their filtration devices, their throughput rates,
their success/failure points and a prediction of potential future
failure.
- Teradata will contain additional transformation rules that include
powerful time-, date- and geo-location-based comparisons.
- Teradata will generate cubes of space and time information to
resolve complex geo-spatial formulas. The answers will be fueled
by business questions like tracking, production speed, supplier
management, manufacturing time, etc.
- Teradata will disperse the business rules pertaining to white
noise and filtration to all the registered transponders. These
will be maintained in the single warehouse mentioned previously.
- Teradata will change the core engine technology, allowing parts
of the engine to run directly on the transponders themselves.
They may begin to operate in a grid-fashion, with ODSs on each
of the transponders to manage the flow of information. Queries
will be executed against distributed transponders in parallel,
as well as at the data warehouse level.
- The lines of the ODS and the data warehouse will blur. Users
will no longer "think" about the question as being strategic or
tactical; instead, all the data will be available all the time.
- Teradata's core engine will be the EAI, ETL and ELT engine of
the future. These technologies will still be utilized, but only
to source data outside the transponder world.
- Teradata will extend its networking capabilities to allow
transponders to be parallel, fault-tolerant, and redundant.
These are just a few ideas that come to mind
for the Teradata engine when we consider the implementation and
application of RFIDs to data warehousing. Teradata may or may
not implement these suggestions-they are just logical extensions
to the technology at hand that Teradata is so well equipped to
handle. The other RDBMS engines have a long way to go to play
catch-up in this space, particularly when it comes to distributed
computing power, parallel everything, redundancy and fault tolerance.
In summary
Teradata is sitting smack-dab in the middle of the future. It
has the capacity to handle RFID technology and beyond. A few modifications
may be necessary to accommodate a single version of the business
rules, but this will be necessary as we move forward.
RFIDs, like any other technology, will bring
change: to our lives, to our data architectures, to our designs
and our implementations. We should not sit on the sidelines and
watch the technology go by.
Dan Linstedt, of Myers-Holum, Inc. can be reached at daniel.linstedt@MyersHolum.com.
|