Saturday, January 25, 2014

Choosing MySQL or Oracle for your Hadoop Repositories

When setting up a Hadoop cluster the administration team has to decide which relational databases to use for the Hadoop metadata repositories.  I strongly recommend that one type of relational database be used for all the repositories instead of using different database vendors for different frameworks.

Hadoop requires metadata repositories (relational databases) for Ambari (management),  HiveServer2 (SQL), Oozie (scheduler and workflow tool) and Hue (Hadoop UI).  Choices include Postgres, MySQL, Oracle or derby.  The databases holding the Hadoop metadata repositories have to be backed up and maintained like any other database server.

I recommend using MySQL for the following reasons:
  • Oracle is too heavyweight of a database server that it's full resources will not be utilized.  The Oracle database server will take extra memory, disk space and CPU that will not be taken advantage of.
  • Postgres is a good solid database but it has no tipping point.   I do not see a lot of Postgres databases when I go to customers and I do not see Postgres increasing in the market.
  • Derby  (used with Ozzie) and SQLite (used with Hue) are not robust enough to be used in a heavy production environment.  I would only use these databases if I was going to create a small Hadoop cluster for personal development.
MySQL has a lot of features that make it ideal as a database repository for different Hadoop frameworks.  They include:
  • Extremely fast and lightweight.
  • Relatively easy to administer and backup.
  • Replication is very easy to set up and maintain.
  • MySQL has extremely high adoption and it is easy to find resources to manage it.
If Oracle is the corporate standard and the database and Hadoop administration team prefer to use Oracle, I have provided links for setting up Oracle for the primary Hadoop frameworks.



No comments:

Post a Comment