Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
You don't have big data (mongohq.com)
99 points by liz_mongohq on Feb 6, 2014 | hide | past | favorite | 48 comments


> “Big Data” data tends to be cold data, that is, data that you aren’t actively access­ing and, apart from ana­lyz­ing it, prob­a­bly never will. In fact, apart from analy­sis, it could be regarded as frozen. It may be fed with fresh rapidly cool­ing records and the cool­ing records ana­lyzed for up-​​to-​​date analy­sis, but the “Big Data” pool should be at least con­cep­tu­ally sep­a­rated from the live data; min­gling the two’s require­ments can eas­ily end up in an unsat­is­fac­tory lowest-​​common-​​capability sit­u­a­tion where nei­ther is opti­mal.

So in other words, “Big Data” is what used to be known, somewhat less sexily, as “data warehousing”?


I think of data warehousing as being concerned with what the article might call "formerly hot" data. That is to say that it deals with data that was used in critical business functions at previous points in time, e.g. a grocery store's orders for potato chips in the third week of November, 2003. This is in contrast to big data such as the number of times the store's automatic door opener operated during the same period.

Big data is about finding the most obscure statistical correlations between phenomena with obvious relevance and seeming irrelevance. It's the mythical relationship between Wal-Mart, strawberry poptarts, and hurricanes

http://www.hurricaneville.com/pop_tarts.html


So the author is suggesting.

I think that the two are different, though. The difference: scale. Data warehousing solutions simply can't scale to "big data". Many companies have a "data warehouse" that is usually a single (albeit very beefy) server running SQL Server. But "big data" is the kind of data that can't be supported by a single server or by a traditional database. It's the difference between warehousing activity in a network of a 10k employees vs. warehousing activity on a website with a million daily uniques.

That's my imrpession, anwyay -- "data warehousing" is dominated by the RDBMS, but "big data" is dominated by highly scalable, distributed databases.


Most people who use the term "big data" are wannabe big data people, but will never have data too large to fit/process on a single server. In this use, it is really just OLTP vs OLAP all over again, which leads us naturally to ETL our data into a DW.


I'm going to somewhat disagree. Traditional data warehouses (teradata for example) can handle well into the petabytes. I think Ebay's main TD cluster was around 15 or 20 PB last I knew.

I think you're more along the proper direction in the single server vs N server. Every data warehouse I've worked on in the past decade or so has been multi server.


Well, I think Teradata probably falls under the big data umbrella, given that its basically a distributed database, right? I was thinking more about just plain Oracle or SQL Server or whathaveyou, which can scale pretty big, big enough for 99% of companies, but not "big", and are notoriously difficult to cluster/distribute.


Fair enough. I think the reason people struggle with classifying this stuff is because there are a lot of gray areas and it's difficult to verbalize all the details meaningfully. One could say "Oh...RDBMSes!"...but teradata is still relational...it's just distributed. "Oh! OLAP vs OLTP!"...but it's not really that either. One _could_ use teradata for a couple gig data mart just fine (and argue that it's not big data). That's why i tend to fall back to the 3 / 4 Vs that have tended to typify the 4 aspects of "big data" (volume, velocity, variety, veracity).


There is a qualitative difference in the way people are analyzing large amounts of data. Data Warehousing typically means fairly traditional typical SQL queries on vast amounts of data; a window function might be considered sophisticated.

Now, technologies that can do much more are becoming more accessible. Graph, time series, path analysis, fraud detection, customer churn analysis, statistical methods, etc.

Disclaimer: I work for a "big data" company that is also a data warehousing company.


As a data warehousing/BI specialist I'd like to share my interperetation.

'Data' data (as in your company's data warehouse) is primarily internal data. By this I mean: data generated primarily by your source systems or associated business processes.

'Big' data is external data. It is tweets that express opinion about your brand. Weather patterns that might influence supply logistics. Demographics of markets you wish to penetrate.

I think there are a great number of slightly different interpretations out there but pragmatically speaking, what I have suggested here makes sense.


I think that _can_ be a starting point; however, there are definitely cases where it's not the case. Take ebay or facebook for example. I'm pretty sure their "data" data is "big" data...would you concur? I think the external data increases the variety (naturally) as well as has some effect (depending on the sources) on the volume and velocity of said data....thusly pushing it over the edge from "data" data to "big" data potentially.


>would you concur?

I see what you're driving at and I don't disagree but I bet both Ebay and Facebook have a bit of data in a ruddy old fact/dimension warehouse and some neat dashboards and reports over the top of it.

In which case I'd stick with my definition and simply state that in these examples, they have both types of data.

I just realised I know zero about the infrastructure of either of those businesses so I have some reading to do... oh, and everything I said about them might be crap.


Precisely.


> “Big Data” data tends to be cold data

My experience is the opposite: big data tends to be a lot of rapid streaming data, near-real-time, with inputs from many sources and sensors. This overwhelms traditional databases.

For us, big data means we want stream filtering, heuristic sampling, map reduce, and the like.


How often do you go over the raw data, in a manner that aggregates couldn't help you with, for data from 6+ months ago?


Never. "Big Data" for this use case is entirely about velocity, not volume. I suppose a better catchphrase would be "Fast Data". :)


Yes, I think "streaming data" and "big data" really are different things, and I like your term "Fast Data". From a lower-level perspective they involve very different challenges.

If you need to store the data (Twitter-like data) then storage becomes your primary concern. Analysis can be done later and in less-than-real time. But if your data are coming in so fast that keeping up is the problem, then storage isn't even a consideration (you generally just don't) and analysis becomes the challenging part, you need to aggregate do be able to update your metrics on the fly.

So they really aren't even overlapping problems outside of the fact that they both deal with lots of data.


manufacturing systems are one area where you really care about historical data for many years. for operational reasons as well as for related data retention adherence.


The 3 typical traits used in discussing big data are volume (size of the data), velocity (speed of data entry / exit...or as you call it how "hot" the data is), and variety (number of different sources / schema, etc.)...with sometimes adding veracity (trust of the data). Is DJ/Liz's assertion that if you don't have volume, you're not dealing with big data?

I guess my response would be... who cares?

While the techniques and infrastructure for high volume can be different from those for high velocity (near-time / real-time analytics is different than the more batch-based whole hog analytics associated with high volumes), they are often related. Similarly, there is often commonalities in dealing with the various Vs which may just be of different scale.

Would they argue that one shouldn't be looking at using chef/ansible/salt, etc. as long as you're dealing with a handful of machines? If so, I would think most would argue that laying a good baseline so that when growth occurs is a good thing to do. From a "big data" perspective would they then argue that one doesn't need an ETL/ELT pipeline process? That one shouldn't think about how to deal with the high volume? That would seem to ... less than optimal.


We, as an industry, feel the need to invent catchy names to existing things so that we can sell/market our products. Big Data, Cloud, SOA, (and the worst of all) Web 2.0, etc. Then, we get to debate forever trying to fit definitions to those names... sigh There's a saying where I come from: "The village fool threw a stone in a well, forty wise men couldn't pull it out."


That's what I was thinking. As I read, I kept thinking "but you're just making up terms and definitions and pretending that there's some objective standard." Even for fairly standard tech terminology, usage drifts all over the place.


Translating this blog post:

"MongoDB can't handle Big Data so we are going to redefine Big Data for our own convenience as cheerleaders of MongoDB and then assert that Big Data, per our new definition, is really not that important."

The blog post makes quite a few dubious assertions, apparently for the sole purpose of justifying MongoDB's inadequacies as a large-scale data platform.


We don't dispute the importance of big data (whatever the definition). What we tend to find, though, is that customers want to optimize for big data problems they'll never have. You can run a Mongo cluster for 100s of TB of data (is that big data?), but it means making application and schema compromises that most people don't need to make. This post, more than anything, is to help our customers think properly about their data problems.

So I don't think your translation is correct. I actually think it's pretty far off. We're probably some of the most cynical MongoDB users you'll meet.

(disclosure: I'm one of the founders of MongoHQ)


Being required to make application and schema compromises in order to scale to tens or hundreds of terabytes is a symptom of the inadequacy I was referring to. It is not a property of databases generally, it is a property of MongoDB.

I get the argument that customers should not over-engineer their database systems but in other databases a lot of that "over-engineering" comes almost for free in terms of user effort.

Also, for many types of analytics, there really isn't a concept of "cold" data. A single query should be able to access data inserted milliseconds ago and data inserted a month ago as though it were in the same table. A lot of "real-time" analytics work this way. This does not need to be done purely in-memory if the storage engine is designed well. The old OLTP/OLAP dichotomy of the 1990s has been slowly fading for a long time.


Just to clarify, I meant "compromise" in a more broad sense. Scaling databases requires continuous compromise, usually of flexibility. There's nothing unique to Mongo about this. In the relational world (actually, with Mongo too) you end up denormalizing data, which creates complexity. You may also give up joins, secondary indexes, constraints, etc.

Some DBs handle this by requiring compromise at the very beginning, which makes sense when you're going to have a huge amount of data from the get go.

There's no DB on the planet that lets you go up into the TB range without having to make some compromises (either up front or down the road).


You do have to give up the transaction theoretic elements once you get into the hundreds of nodes. Or at least, you will notice the sub-linear behavior in the scaling. Complex updates across multiple records will show some limitations on performance.

For things like joins, query selectivity on multiple columns, etc not so much. You don't even need secondary indexing or denormalization to do things like graph analysis or polygon searches on a table (in the same query even) at scale. All of the access method related operations can scale very efficiently to massively parallel systems if you use the appropriate data structures and algorithms.

MongoDB uses an approximation of the correct algorithms for gigabyte scale systems. Those algorithms are just wildly inappropriate when you start talking about terabyte scale systems. I have no investment in MongoDB negative or positive, but like all databases it is going to be lousy outside of the implicit scope supported by the design and architecture. In the specific case of MongoDB, and as someone that has designed their share of database engines, the internals are not designed to support non-small databases to any significant extent.

And honestly, a terabyte is a pretty trivial database these days. That is the kind of thing you run on a single server with ease. Smoothly scaling that to dozens of nodes as though it was a single system is something you can buy. I really don't understand the assertion that scaling to 10TB is difficult or requires anything different than scaling to 10GB. That is demonstrably untrue.


Every company builds a narrative around their products. I don't really see a problem with that. "Big Data" has already been redefined or misinterpreted enough times already, so I don't really think they're doing any additional damage to the terminology.

Personally, I think there are some holes in the story they are telling in general, but this is not one of them. The question I have about their product is: why not just store the JSON in a SQL database, many of which are adding JSON as a native type? Did they really need to reinvent an entire database system? Why not just take postgres and build on that like everyone else does?


There's probably a parallel somewhere here between Jevons Paradox and this idea of what big data is all about. Jevons Paradox was the observation that the consumption of coal increased, rather than decreased after steam engines came along (they originally thought steam engines using less coal would lead to less coal consumption overall). But what happened (obvious to us in the 21st century) was that steam engines made transportation itself more energy-efficient, and created increase in demand for fuel, of which coal was the primary type available at the time.

In other words: the less costly a useful task becomes, the more we tend to do it "just because".

Seems like the author is trying to say: storing a lot of data vertically "just because" does not big data make. But the author does not fully explain why this is significant.

It is significant, though, because Jevons Paradox would predict that eventually companies will want to access so-called "frozen" data more frequently and will want to make that frequent access as cost-effective (and payload-efficient) as possible, which distributed NoSQL-type DBs do very well.


A college professor of mine observed something quite similar: he'd been in administration and had returned to teach for a final year before retiring. His observation: whenever you build a computer system, you've got to design it for much more than the scale you think you'll need, because people will come up for uses for it which you hadn't anticipated.

Granted, this was the 1980s when mainframe / centralized computing was big (client-server was still a few years off), and the desktop revolution was just getting underway.


What exactly is the authors point? Is he just upset about the terminology usage? Is he implying that he has "big-data" like it is some kind of pissing contest? He never even explains what kind of scale he considers to be "big-data". He just says everything is lots of data and nothing is big data at all.

Also his usage of childish memes makes me unable to take it seriously.


I think the point is that people often use the wrong tools thinking they have "big data" and therefore they must use the same techniques someone like Google uses. The difference is most likely Google has many orders of magnitude more data and does many more orders of magnitude processing on it.

I think it is useful to have some rules of thumb as to when you need to apply more exotic techniques and tools vs. something where simple stuff works. So in that sense asking do you have "big data" or not can be useful...


So let's skip the discussion about how much data is a lot.

Around which dataset sizes are which methods appropriate?


There was one meme, and it was just a picture of a dog.


It's from a mongodb shop, so big data means web scale!


Also mongodb likes to use a lot of storage space to actually store data - we see using at twice the number of bytes as the same data in JSON.

The solutions are obvious - store it compressed https://jira.mongodb.org/browse/SERVER-164 - and tokenize repeated values like field names https://jira.mongodb.org/browse/SERVER-863 - note how many years those have been open without progress!

Not that it really matters - mongodb's single per database lock means you'll have trouble inserting, updating or deleting data quickly enough for web scale!


To understand "big data", you need to read this book (it's free) http://infolab.stanford.edu/~ullman/mmds.html. Anything else anyone blogs about regarding definition of big, cold or whatever buzzwords (especially on a mongo-related site) is water under the bridge.


So "big data" is just log data that has no apparent use? I would tend to disagree: you can perform "big data" operations on the same data sets you perform traditional analytics, you're just looking for different things.

Really, why don't we put down the pitchforks and call "big data" what it really is: data mining. You can perform data mining on small sets and large sets. Nearly infinitely large sets, given the tools to manage that data. "Big data" is just a buzzword that doesn't mean anything more than data mining (which includes machine learning, AI methods, etc) on a really large data set.


Data size is relative to company size. If I work with a 120Gb clicklog file for a company with 10 employees, where no other employee has the tools or know-how to work with such a dataset, then that data is (treated as) big data. In the hands of Google, Yahoo, MS or Facebook if would probably look like a floppy.

No, you probably do not necessarily need a Hadoop cluster to work with a 120GB file. With Python and Pandas you could probably run through it on a budget laptop. But data will always be relative to your companies size and current know-how. In the most banal way: Data can be big data, because some manager can't open it in Excel.


I've done work for fairly small regional companies who feel that now that they have an eCommerce site, it means they have Big Data and they now need to spend a ton of money to manage this Big Data, when all they really have is worthless data and no understanding of what they could gain if they were collecting the right data.

Just because you have a lot of data and you don't know how to put it together doesn't mean you have Big Data. More often than not (in my experience), it means you have useless data. Take that money and hire someone who actually understands the right metrics to capture.


I tend to subscribe to the ideas that Joyent has around their manta product -- it's about "data gravity" and "data velocity". There was an article submitted earlier to hn that didn't get many votes -- and no comments -- but is well worth a read:

"Building a Black Swan: Disrupting NetApp, EMC, Amazon’s S3 … maybe all BigData": https://medium.com/money-banking/b8427c23bf0f

https://news.ycombinator.com/item?id=6028334

Also worth skimming is the blog post introducing Manta:

"Hello, Manta: Bringing Unix to Big Data": http://www.joyent.com/blog/hello-manta-bringing-unix-to-big-...

And, (more closely) related to the "you don't have big data", I found this little post about Manta also interesting: http://building.wanelo.com/post/54110156963/a-cost-effective...

(as an example of how keeping things simple (here: simple log files) can combine to provide the scalability needed to collect "BigData" -- and then the Manta architecture can help with turning that data into information)

(No affiliation with Joyent/Manta -- but it does seem like they have a pretty good product concept/idea -- and even if not using Manta or Joyent -- building on their ideas and architecture -- seems like something that might be useful also for smaller installations).


The article's described origin of Big Data ignores the original big data problem: Insurance, specifically actuarial tables. The origins of big data are actually far older than 1990, going back to the first mass computing systems in the old days which were literally aisles of women working in offices computing actuarial tables[0].

One could even go further back to the 16th century (and earlier if you're really adventurous) where the ideas of life statistics applying to groups (but not individuals) was first explained.

In short, 1990 is sort of an arbitrary date and does not accurately reflect the origin of Big Data. We have wanted to record the sum total of humanity for as long as we have had the ability to record things; it's simply becoming more reasonable to attempt today.

[0]http://www.officemuseum.com/1907_Actuarial_Division_Metropol...


Big data is also a moving target.

We can look at size and we can look at rate (read or write).

One size classification is: data we can store in memory vs. on drive. I wouldn't refer to anything <64GB as "big" data. If you can store it on one machine it probably shouldn't be considered big from a size perspective (so let's say 4TB as some sort of threshold).

More generally, if you need more than a single server to store or do real-time processing of your data I'd say it qualifies as big. Otherwise probably not.

Another factor a lot of people often don't look at is efficiency. Yes, if everything is a JSON string with a lot of static annotations we can make a small amount of data look very big. You need to look at a more information theoretic "size". This often applies to processing... Yes, you can spend many CPU cycles parsing incoming HTTP requests but that doesn't mean the data is inherently big.


There needs to be a time-dependent definition of "Big Data", because this term is frequently abused and results in confusion all around.

In my view there has always been "big" data. As others have pointed out, none of this is really any different from the data warehousing days. How much data you can store is a function of how much money you are willing to spend, and all that has changed is the amount of disk storage you get per dollar.

I would propose a definition as follows:

S_t > N * H_t

Where S_t is "big" data when it is greater than some constant # of hard drives N and H is the size of the average consumer HDD at time t.

So if we assume that today the average HDD is 2TB and we define C as say 200, big data is 400TB. 10 years ago "big data" would have been say 20TB. Simple.


Whenever I see algebraic formula written like that, I have this odd compulsion to try and pronounce it as if it were english.


I read the underscores like a "u"... so i guess a mnemonic for this formula could be SuT 'N HuT... sutton hut?!

Maybe some other letters would make the mnemonic more interesting :)


8 core procs are dirt cheap. Dualy mobos are cheap. A quarter terabyte of ram is a couple grand. A terabyte drive costs fifty bucks. A 1000 core gpu card is a couple hun.

Til you max out a box with those specs, you don't need big data.


1 terabyte is definitely small data. There's only one company consistently doing big data and that's, of course, the goog. I like the dynamic definition that big data is only what is not easily handled by existing tools. The tools of the past few years (hive, cascalog, elastic search, etc) and some of the emerging ones (shark, impala, Druid, etc) have radically raised the bar for what problems smaller shops can handle.

We're handling a very modest small data set (1,000 messages/sec) with just a few people. That's thanks to the awesome tools coming out of this fruitful (and overhyped) industry.


I tell everyone that it's not big data that they're handling, but it's like speaking to deaf people. It's even in radio spots, "handle and organize big data with our tools."

If you want to refer to real big data, perhaps just switch to the term "50PB of data" or whatever order of magnitude it is.


Not if you use mongo




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: