neodata686 All American 11577 Posts user info edit post |
So my company is moving from a Postgres/Greenplum infrastructure to Hadoop over the next 6-8 months. Very excited about this but it's a learning process for everyone in the company.
Saw there's no threads on the topic. Anyone have any exposure to HDFS, Map->reduce, etc? We've chosen Hortonworks as our vendor and I've been going through some of the tutorials on Hive, Pig, etc. 1/17/2014 3:45:41 PM |
0EPII1 All American 42550 Posts user info edit post |
Quite a few MSA graduates here. They have done it in their program. 1/17/2014 3:49:13 PM |
neodata686 All American 11577 Posts user info edit post |
Neato yeah my company provides an analytics software platform with a lot of big hosted data so it's a big leap for us in terms of processing power.
I've been interested in going back to school. Unfortunately I moved to Denver so it would be something locally. 1/17/2014 3:52:12 PM |
smoothcrim Universal Magnetic! 18968 Posts user info edit post |
you should look into EMR if you want to do this cost effectively. hadoop on your own hardware in this day is hard to justify unless you're doing steady state jobs 24x7. even then, it's still going to be hard without significant scale http://aws.amazon.com/elasticmapreduce/ http://www.bigdatahpc.com 1/17/2014 4:13:34 PM |
neodata686 All American 11577 Posts user info edit post |
Quote : | "hadoop on your own hardware in this day is hard to justify unless you're doing steady state jobs 24x7." |
We already have the hardware infrastructure, nodes, etc because the underlying infrastructure is very similar to Greenplum in regard to nodes, distribution, etc. Not to mention we have many of the top Fortune 50 companies and data integrity and security are always a concern. Much simpler to host our own data or have our clients host it themselves.1/17/2014 4:20:57 PM |
smoothcrim Universal Magnetic! 18968 Posts user info edit post |
I was suggesting you host in amazon.. 1/17/2014 4:50:49 PM |
neodata686 All American 11577 Posts user info edit post |
That's my point. We already have the infrastructure and it's harder to get contracts where data is hosted elsewhere. From what I understand it makes more sense to host it ourselves. 1/17/2014 6:02:38 PM |
Noen All American 31346 Posts user info edit post |
^Definitely, and if he WERE going to go with a public cloud infrastructure for hadoop, he would be using http://www.windowsazure.com/en-us/solutions/big-data/ anyway
back to the OP, yes, the product I design (http://blogs.msdn.com/b/visualstudioalm/archive/2013/11/13/announcing-application-insights-preview.aspx) has been building on hadoop. The biggest continual problem is finding the happy medium between data latency and compute costs. We want to deliver as close to realtime data as possible, but that starts costing insane $$$ once you hit certain thresholds.
[Edited on January 17, 2014 at 8:47 PM. Reason : .] 1/17/2014 8:43:50 PM |
Tarun almost 11687 Posts user info edit post |
i am interested in learning about hadoop too. Any good tutorials/online courses out there? I looked into MSA program a few years ago but cannot afford to go back to school full time 1/21/2014 9:21:45 AM |
Tarun almost 11687 Posts user info edit post |
I know not a lot of you are in DC area but still posting it if anyone is interested
IBM Big Data Developer Day
https://www-950.ibm.com/events/wwe/grp/grp004.nsf/v17_agenda?openform&seminar=FDDQVFES&locale=en_US 1/27/2014 8:48:09 AM |
neodata686 All American 11577 Posts user info edit post |
^^I've been going through these:
http://hortonworks.com/tutorials/
They're pretty good for a basic understanding of the different components of Hadoop. 1/27/2014 11:01:25 AM |
y0willy0 All American 7863 Posts user info edit post |
i read this as hard poop 1/28/2014 10:59:12 AM |
neodata686 All American 11577 Posts user info edit post |
Update here. So we're advancing with both Cloudera and Hortonworks. Once we complete our Hadoop lake and fully convert our software over to Hadoop we're told we're going to be the largest Hadoop lake that either distributor is helping deploy/support. Pretty neato! 3/20/2015 4:44:57 PM |
neodata686 All American 11577 Posts user info edit post |
In Cloudera training this week! Woohoo. 4/27/2015 12:51:33 PM |
CaelNCSU All American 7132 Posts user info edit post |
We had a sales pitch/training for Amazon Kinesis/red shift and EMR. They have ways to run Hive and or Pig directly on S3 or Dynamo.
I'm rewriting some of our analytics ranking algorithms and the last step will be to use one of those platforms. 4/30/2015 10:29:09 AM |
smoothcrim Universal Magnetic! 18968 Posts user info edit post |
kinesis + spark is the new realtime hotness imo https://spark.apache.org/docs/latest/streaming-kinesis-integration.html
then pass the data to s3 for later redshift ingest or EMR processing 4/30/2015 12:51:17 PM |