Hadoop and the Data Warehouse: When to Use Which Q&A


 QUESTIONS & ANSWERS

From the April 20th 2013 webinar “Hadoop & the Data Warehouse:  When to Use Which” as presented by Eric14 of Hortonworks and Stephen Brobst of Teradata 

 

Q:  Will we have access to a replay of the presentation?

A:  The replay version is available here.

 

Q:  What is the twitter # for this event and ongoing dialogue?

A:  #EDWandHadoop

 

Q:  In the existing Hadoop implementations how is data level security achieved?

A:  In existing Hadoop deployments there are multiple methods to achieve security in Hadoop. Basic data security is achieved through kerberos and user authentication can be achieved through integration with user management systems like LDAP. Additionally there is a new open source project that has been proposed to unify security on the Hadoop stack called Knox. This will provide a single point of access and user management providing uniformity to the security model.

 

Q:  Where can I get more information about eBay and Linkedin that were referenced in the webinar?

A:  Check out more about eBay and LinkedIn.

 

Q:  There is a great demand in the data storage/retrieval/analytics arena, but the Data Integration is still not catching up….any insight?

A:  A lot has been completed in the data integration space to create extensions for Hadoop and big data technologies.  Every vendor has both a roadmap and delivered features. 

 

Q:  Isn't Hadoop being held back by the lack of software applications that companies can use in conjunction with it to make the data they are storing useful? We really need to see a proliferation of big data-analysis applications.

A:  Yes, agreed. Much of what we are doing with SQL-H is to abstract the user/application and bring FULL SQL capabilities to the data in Hadoop. Teradata Aster is building more discovery analytic applications and visualizations, as well, which can leverage data in Hadoop and help unlock the value/insights.

 

Q:  How would Hadoop compare with NoSQL databases like Mongo and Casssandra?

A:  Generally speaking Hadoop and NoSQL are complimentary data management and analysis systems.  NoSQL database systems can either exchange data with Hadoop or use HDFS as a data store to read and write data.

 

Q:  What are the advantages of using Hadoop as the primary ETL staging area for a data warehouse rather than doing ETL in the warehouse platform?

A:  Frees up data warehouse cycles for higher value analytic use cases.  Leverages the right platform for the right job.  Enables deployment of data integration processes instantly for exploratory analytics.

 

Q:  In what format does Hadoop store the data?

A:  Hadoop can store data in any format including: raw text files, binary formats, images, video, xml, object etc. There are file formats which make some of the Hadoop related tools have a higher level of performance for example the ORC file format and Hive which can result in significant performance improvements.

 

Q:  When would you perform Time Series analysis in Teradata using their features and when would you perform them in Hadoop?

A:  It depends on the latency requirements, the numbers of iterations you expect in the data discovery process, and the skill of the analyst.  Aster is best for time-series analysis through SQL interface and low-latency. It can also be done in Teradata with heavy SQL jock, or Hadoop with a data scientist, but Aster is workload-specific best choice for iterative time-series path and pattern analysis.

 

Q:  In the context of Hadoop being financially viable, what about the operational overheads with respect to having good MapReduce programmers, the massive effort in terms of development, testing etc.  Any stats around savings after accounting for these overheads?

A:  To optimize the value between MPP RDBMS and Hadoop, a balance needs to be struck between the size of the data and the usage and development costs.  Hadoop provides low cost per terabyte, whereas MPP RDBMS provides superior economics the more the data is reused and the higher the frequency of access.

 

 

Q:  Is it a must to have knowledge on Java/ C to work on Hadoop?

A:  Having a working knowledge of Java can be helpful if you are working directly with MapReduce or Pig when programming Hadoop. It is also helpful when working on administering Hadoop at a low level. There has been significant work done by the ecosystem to expand SQL-like tooling for Hadoop. For example Hive is the defacto standard for querying data in Hadoop and has a SQL-like language called HiveQL. If you know SQL you can write HiveQL queries without knowledge of Java.

 

 

Q:  What kinds of frameworks exist for cataloging all the data stored forever on the HDFS?

A:  There are many frameworks to catalog data within Hadoop.  For instance, you can now use Hcatalog to share structures across mapReduce, Pig and Hive. 

 

Q:  Can you clarify the Agile methodology thread with respect to Hadoop? Are you implying that RDBMS BI projects can't adopt Agile and therefore increase the cost of a project?

A:  We are not implying that “Agile methodologies” do not work with RDBMS BI projects.  Agile development methodologies have very successfully been implemented by numerous Teradata accounts when developing projects in an MPP RDBMS to both lower costs and accelerate time to value.  The point Stephen Brobst was making is that new big data technologies, like Hadoop, provide even greater options for agility (lower case agile) when an organization is in the process of discovering new data signals or transformation rules.  This is due mostly to the fact that Hadoop allows an organization to ingest and store more data economically, and allow end users to experiment with applying different forms of structure to the data which can then either be discarded or added to a more formal development sprint.  Stephen also used the term agility to make the point that due to Hadoop's late binding approach for processing data, data scientists are not locked into a pre-defined schema and can alter definitions by applying new rules against the raw data.  In some cases, this agility provides significant value.

 

Q:  Given that the big data platform would store "all data forever", I still do not understand how this is economically better than storing only the needed data. Can you please clarify by giving an example?

A:  The difference is that you don't necessarily know what data is "needed" when dealing with Big Data. Once you discard a piece of data you cannot go back and ask new questions like trying newly developed algorithms or combine it with newly available data. By making it possible to store all of the data for an extended period of time you always have the flexibility to go back and ask new questions in new ways.  Also having a complete historic context allows you to answer temporal questions like you never could before.

 

Q:  I'm used to relational data, and I'm not sure I even have any "non relational" data lying around here.  Can I take some relational data and throw it into a Hadoop environment to begin learning how Hadoop works?

A:  Yes absolutely. You can export tables of data from a relational system and begin to play around with it. A great place to start is with the Hortonworks Sandbox. The Sandbox is a single node virtual machine that includes an easy to use interface and tutorials which step you through Hadoop basics.  As for "unstructured" content...ultimately, you cannot analyze data without structure.  One of the most interesting features of Hadoop is that you need schema only on read and not on write.  You can throw data into Hadoop without concern of schema.  You can then apply schema when you extract it.   

 

Q:  Since Hadoop is not as efficient with query optimization as an RDBMS, why should I use a BI tool against Hadoop?

A:  The main reason to use a BI tool against Hadoop is because it enables you to query against data types that you would not traditionally load into a data warehouse or database. You can also combine data types together that you wouldn't traditionally use together.  So although the performance and responsiveness will likely be slower than users are accustomed to the fact that they can execute queries that can't be done in other systems provides a significant value.

 

Q:  Are you saying that since Hadoop mainly leverages commodity hardware, there will be a net economic gain storing big data over the RDBMS approach?

A:  There are 2 key economic advantages with Hadoop - Free open source software, and commodity hardware.  This results in the lowest cost per terabyte when compared to MPP RDBMSs like Teradata.  There economic advantages of MPP RDMBS come from resource optimization, performance, reliability, and manageability.  The more reused the data, the more it makes sense to manage in an MPP RDBMS;  whereas data that is less frequently accessed is best placed in Hadoop where the management infrastructure of an MPP RDBMS exceeds the value of the incremental insights provided from the data.

 

Q:  Our general business users have access to the Teradata Data Warehouse.  Is this the same audience that will have access to HADOOP data and what training will they need to access data?

A:  If you use Teradata SQL-H, you can give them access to data in Hadoop without additional training. You will likely need to have services or someone knowledgeable in Hadoop do the initial implementation and administrate Hcatalog/Hive over time. This can be the DBA with some training

 

Q:  What are the current methods to better integrate data between Hadoop and Teradata?

A:  For a brief overview on Hadoop and Teradata integration, check out this page.  For a more detailed look, check out this whitepaper.

 

Q:  What does Google use for their BIG Data?

A:  Google has their own internally developed systems. The genesis of Apache Hadoop are two papers released by Google on MapReduce and the Google file system. They use derivatives of these two capabilities internally in addition to a wide range of other technologies of course.

 

 

Q:  What do you mean by Iterative versus Batch Analysis when comparing MPP RDBMS and Hadoop?

A:  We mean low-latency, fast lookups, interactive queries with fast response time and fast development of analytic functions within the discovery process.

 

Q:  I thought all lookups and queries have to be re-written into a map/reduce function structure

A:  Many Hadoop related projects "compile" their code into MapReduce jobs to execute natively against the data. However there are an increasing number of systems which also execute directly against HDFS meaning they are bypassing the MapReduce access point and pulling the data directly into their own execution engine.

 

Q:  How are they solving the single point of failure with Hadoop?

A:  There are multiple strategies for mitigating risk around the NameNode in Hadoop including cold and hot copies of the NameNode and a shared file system like NFS. Which one you use depends on your tolerance for the timing for a restart of the cluster. Additionally the community has developed built in failover mechanisms for NameNode as part of Hadoop 2.0.  The code is soon to be released as production ready. It should also be said that the current instantiation of the NameNode has proven to be extremely reliable in production situations.

 

Q:  Can you share couple of use cases where it would make sense to use both Hadoop and Teradata technologies as a combination?

A:  Hadoop can process text, voice records from call centers, documents, and other "non-relational" data very efficiently.  Call center information about customer interactions, insurance claims with adjuster notes, etc. can then be joined with other customer data in the EDW to get deeper/broader insight to customer interactions, sentiment, and profitability.  Another use case would be a large data set captured in Hadoop then refined to place the high value and high access data in relational.

 

Q:  Is multi-tenancy supported within a Hadoop cluster?

A:  In Hadoop 2.0, multi-tenenacy is native to YARN.  This will be a native component of Hadoop.  Today, Hadoop can process multiple jobs from multiple submitters.

 

Q:  With respect to Evolving Schema, when the same values of a structured attribute data changes over time, how will such cases be handled in Hadoop?  What would be the overheads with handling this in Hadoop compared to a more structured approach like TD?

A:  To be clear, "evolving schema" refers to the problem of evolving a database schema to adapt it to change in the modeled reality.  Evolving schema would not be applied to a case where new values are created for an anticipated column (e.g. a new product code is created), as that is easily handled by any RDMBS.  Evolving schema applies to changes in the underlying data model - such as creating a new value to be captured in a website.  Traditionally, that would require a new column to be added and ETL scripts to be created, which breaks down when the velocity of the schema evolves at a rate that makes such work impracticable.  Big data technologies, such as Hadoop and Aster, use modeling techniques such as Name-Value-Pairs, where the new values introduced into the data platform are self referenced.  In order to then query those values, the overhead comes in the form of late-binding, whereby the structure is rendered at run time rather than load time.

 

Q:  What about Cloudera Impala, Shark, Apache Drill and the Stinger Initiative?

A:  There is lots of buzz and interest around providing more interactive query on Hadoop. Different organizations have taken different strategies for providing that capabilities and Impala, Shark, Drill and Stinger represent some of those activities. Each has its merits like supportability, architecture, cost, etc. and you should do a thorough investigation into each against your specific needs and requirements.

 

Q:  How is Information Lifecycle Management (ILM)) implemented on Hadoop?

A:  ILM in Hadoop is at the early stages but progressing rapidly. A new Apache project was just proposed called Falcon which introduces ILM capabilities into Hadoop. Falcon is an open source project that enables you to set data policies to move data from one cluster to another.

 

Q:  When comparing Hadoop to MPP RDBMS in terms of security, Hadoop is listed as "N/A".  What kind of security is currently available for Hadoop, and where do you see Hadoop security moving in the future?

A:  To be clear, the category was “fine grain security.”  Eric14 called out specific security features only available in RDMBS, and others that were much more advanced.

 

In existing Hadoop deployments there are multiple methods to achieve security. Basic data security is achieved through kerberos and user authentication can be achieved through integration with user management systems like LDAP. Additionally there is a new open source project that has been proposed to unify security on the Hadoop stack called Knox. This will provide a single point of access and user management providing uniformity to the security model.

 

Q:  Is there really no "vendor lock in" for Hadoop?  We recently switched from a Cloudera distro to HortonWorks, and infrastructure personnel have been working on the conversion for over a month. This may not apply to all development, but still an effort.

A:  There are no proprietary components to the Hortonworks Data Platform. More over all of the projects are Apache projects meaning that they are open to contributions from any developer. Competing distributions use Apache licensed projects but not Apache governed projects meaning the solely control what happens in that project. So yes, there is really no vendor lock-in with HDP.

 

Q:  How do you plan to provide perimeter security within Hadoop?  Do you have some kind of gateway that provides that?

A:  In existing Hadoop deployments there are multiple methods to achieve security in Hadoop. Basic data security is achieved through kerberos and user authentication can be achieved through integration with user management systems like LDAP. Additionally there is a new open source project that has been proposed to unify security on the Hadoop stack called Knox. This will provide a single point of access and user management providing uniformity to the security model.

 

Q:  I like the Big Data capabilities of Hadoop but MapReduce is not adequate for random access queries. Does Teradata have some real time access to the HDFS data?

A:  Yes, through SQL-H we go direct against HDFS without the MR overhead in Hive.

 

Q:  Are there any limitations to the type of files used on the Hadoop file system? For example, does Hadoop support mainframe based datasets in EBCDIC format and COMP-3 data types or files with records in different layouts?

A:  You can load any file type into HDFS.

 

 

Q:  We are using Windows Azure cloud soI would be interested on the integration of Teradata with a Windows Azure cloud application?

A:  You can connect to Teradata via ODBC or JDBC and VPN, or do a cloud-to-cloud connection with Teradata Cloud Services on a designated IP address.

 

Q:  How do I make the walk from a very rigid structured data warehouse model into the Hadoop operating model?

A:  If the question is around support of an actual data warehouse "data" model, Hadoop can support table constructs via Hive, HBase, Cassandra as just three examples.  The model is not the issue. More significant, is the logic that must be supported on the Hadoop platform both from the complexity of SQL logic that is feasible along with performance characteristics.  SQL logic from a traditional database may need to be redesigned to function properly within a Hadoop environment. It will have different performance characteristics that may need to be addressed for a production implementation. Other potential challenges to consider is security and workload management.  These are well established in traditional databases and not so much in Hadoop. The thing to consider about Hadoop is the fact that it is a maturing platform and some of the functionality that is taken for granted in traditional databases (locking, for example) may not be available within the current level of Hadoop resident services.

 

Q:  How about leveraging reduced cost of storage in Hadoop for retention of structured data? Like keeping current year in EDW and old data in HDFS.

A:  Yes, we have customers that do that today. Perfect use case for SQL-H to provide transparent access as needed.

 

Q:  Can we implement a complete data warehouse infrastructure on Hadoop which can replace the current MPP RDBMS based infrastructure?

A:  Analytical relational databases were created for rapid access to large data sets by many concurrent users. Typical analytical databases support SQL and connectivity to a large ecosystem of analysis tools. They efficiently combine complex data sets, automate data partitioning and index techniques, and provide complex analytics on structured data. They also offer security, workload management, and service-level guarantees on top of a relational store. Thus, the database abstracts the user from the mundane tasks of partitioning data and optimizing query performance. 

 

Since Hadoop is founded on a distributed file system and not a relational database, it removes the requirement of data schema. Unfortunately, Hadoop also eliminates the benefits of an analytical relational database, such as interactive data access and a broad ecosystem of SQL compatible tools. Integrating the best parts of Hadoop with the benefits of analytical relational databases is the optimum solution for a big data analytics architecture.

 

Q:  We are looking for real-time analytics and I am wondering how Hadoop will fit it into real-time data mining, instead of batch processing?

A:  The term "real-time analytics" is an overloaded marketing phrase for sure. Both Hadoop and MPP RDBMSs support loading data in near-real-time.  Complex Event Processing (or CEP) is not native to either MPP RDBMS or Hadoop, and provides near-real-time decisioning capabilities, but certainly isn’t “data mining.” It is a workload-specific platform optimized to handle certain data analysis use cases for streaming data and other cases. This is why we advocate the Teradata Unified Data Architecture which combines the best of each workload-specific platform into a cohesive and transparent analytics architecture, essentially providing the right tool for each job.  If the use case requires multiple iterations of the data mining algorithm, and performance is a key driver of value, then MPP RDBMSs, or specifically a Discovery Platform such as Teradata Aster, have superior capabilities with respect to low latency.  Both MPP RDBMSs and Hadoop support sub-second delivery of resulting insights from batch data mining processing through Active Data Warehousing and HBase respectively.  

 

Q:  How is ETL best done between Hadoop and the Data Warehouse?

A:  Through several connectors. See Teradata Enterprise Access for Hadoop on Teradata.com. Specifically, check out SQL-H, Teradata Studio, and Bulk Connectors.

 

Q:  Can you share steps for getting started with Hadoop. Is there any way to explore it at small level before putting big investment in Hadoop?

A:  Yes, try the Hortonworks Sandbox, a single node VM with tutorials and videos to help you get started. It's on the Hortonworks website.

 

Q:  What front-end tools are available for use with Hadoop that are comparable to Cognos, Microstrategy, etc?

A:  Microstrategy and Tableau have product integration features for Hadoop, but that’s not to say they work the same as when deployed against a relational database that supports SQL.  Nearly all of the major (and minor for that matter) BI vendors have announced connectivity to Apache Hadoop via the ODBC driver in Hive. There are also natively built tools from Datameer and Karmasphere as well as newer tools from companies like Platfora.

 

 

Q:  If Open Source is best for Hadoop, should that same approach also be applied to EDW? If not, why not?

A:  One of core value propositions of state of the art Enterprise Data Warehouses as discussed by Eric14 of Hortonworks is the "tight vertical integration" of resources, such that resource management and utilization, parallelism, and consistency of service are optimized.  The sweet spot for open source is projects where highly independent features can be spread out amongst the vast community of developers. 

 

Q:  How does Hadoop compare with Qlikview?

A:  Qlikview is a BI tool. They are a partner of Teradata.

 

Q:  What is the main difference between Cloudera distribution & Hortonworks distribution? Is there any advantage of one over the other?

A:  The high-level differences come down to market approach. Hortonworks has a 100% open source strategy meaning there are no proprietary hold backs or offerings which frees the organization from vendor lock-in. Cloudera uses an open core model of open source where the core components are given away under an open source license but several key pieces are licensed as proprietary software. For example the management console from Cloudera is proprietary software and you must sign a license agreement to use it where Hortonworks uses an open Apache project called Ambari for monitoring and management.

 

Q:  Where can we find the best detailed use cases for Hadoop?

A:  There are a range of sources for use cases for Hadoop. We suggest you start with the vendor web pages (www.hortonworks.com) and others as sources for use cases. You should also review the patterns of use white paper from the Hortonworks resources site to get an understanding of general use cases for Hadoop.

 

Q:  Since Hadoop is not relational, can you give an example of how Hadoop analysis RELATES data points?

A:  It is true that Hadoop does not use a relational model to organize data however it is possible to relate data together. In particular you can write jobs in MapReduce and Pig (scripting languages) that associate data in different tables together to find and analyze relationships. You can also use tools like Hive which have SQL-like languages for analyzing data in Hadoop. Because of the way Hadoop works these operations will likely be slower than running similar queries in a relational system however you can also join data that is not practical in those systems.

 

Q:  I would like to know how Hadoop deals with legal hold issues for data.

A:  Please send an inquiry through Hortonworks’ website and we would be glad to discuss strategies for securing data with you.

 

Q:  What is to stop Teradata, Oracle and IBM (and others) from extending functionality to include ability to ingest big data?

A:  Analytic environments for the enterprise should ideally have coexisting technologies that exploit the assets of both approaches. That has certainly been Teradata’s methodology with the Teradata Unified Data Architecture, which leverages the Teradata Integrated Data Warehouse in concert with Teradata Aster to analyze and discover patterns and Hadoop for loading, storing and refining data.

 

Q:  I heard that Hortonworks is creating a Hadoop Distribution for Windows. Other companies have moved their MPP products from UNIX/Linix to Windows only to satisfy only 2 or 3 customers. Why does Hortonworks want to go down this path?

A:  It turns out that Windows Server has more than 70 % of the data center market and many organizations want a 100% Windows native stack. Additionally there is a tremendous Windows developer and partner ecosystem, and by putting Hadoop on Windows we are further expanding and enabling the community.