Big Data Resources

Quick Hadoop Startup in a Virtual Environment

A fully-featured Hadoop environment has a number of pieces that need to be integrated. Vagrant and Ansible are just the tools to make things easier.

January 5, 2016

by Alan Hohn

· 11,382 Views · 8 Likes

Space-based Architecture and Microservices in XAP

Find out what Space-based architecture (SBA) is and how it relates to microservices.

December 29, 2015

by Shay Hassidim

· 10,247 Views · 10 Likes

How to Exploit All Sensors of Your Mobile Phone as an IoT Device

How to use your Android phone, Apache Camel, Kibana, and ElasticSearch to use your mobile phone sensors as an IoT device.

December 17, 2015

by Abdellatif Bouchama

· 21,757 Views · 16 Likes

IoT is the Future — And You Can Get Started With it Now

Everyone wants in on IoT, so how can you get started?

December 16, 2015

by Amy Groden-Morrison

· 3,225 Views · 4 Likes

Security and Back-End Integration Top Mobile Challenges, Says Survey: Alpha Software

The mobile space is constantly evolving, and a recent survey published by Red Hat found that back-end integration and security were the two most rapidly changing areas of the mobile landscape.

December 11, 2015

by Amy Groden-Morrison

· 3,375 Views · 1 Like

Internet of Things: 4 Free Platforms to Build IoT Projects

If you're looking to get started with a new idea in IoT, check out these platforms.

December 5, 2015

by Francesco Azzola

· 87,737 Views · 12 Likes

The Internet of Things, Gateways, and Next Generation of APIs Speaker Session at APIStrat Austin

Check out these conference highlights concerning IoT development, API optimization, and data security from APIStrat Austin.

November 22, 2015

by Steven Willmott

· 7,435 Views · 6 Likes

The Future of Smart Farming With IoT and Open Source Farming

Smart farming is a concept quickly catching on in the agricultural business. Offering high-precision crop control, useful data collection, and automated farming techniques, there are clearly many advantages a networked farm has to offer.

November 5, 2015

by Michael Tharrington

· 51,608 Views · 11 Likes

Analytics with Apache Spark Tutorial Part 2: Spark SQL

This tutorial will show how to use Spark and Spark SQL with Cassandra.

November 3, 2015

by Rick Hightower

· 57,581 Views · 9 Likes

Reactive Trends on the JVM

Check out these Reactive trends on the JVM, including a look at what Reactive is, patterns, and event logging.

October 26, 2015

by Jonas Bonér

· 12,729 Views · 4 Likes

Why Java 8?

A preview of our new research guide: The DZone Guide to the Java Ecosystem, from MVB Trisha Gee.

October 23, 2015

by Trisha Gee

· 71,634 Views · 54 Likes

Recipe: rsyslog + Kafka + Logstash

Here's a similar follow up to the previous rsyslog + Redis + Logstash recipe, this time with Kafka instead of Redis.

October 8, 2015

by Radu Gheorghe

· 15,079 Views · 8 Likes

Microservices and Kerberos Authentication

How to use Kerberos authentication with microservice architectures and API gateways.

October 6, 2015

by Jethro Bakker

· 18,508 Views · 7 Likes

The Limitations of the IoT and How the Web of Things Can Help

Understand the limitations of the Internet of Things and how the Web of Things can help build an application layer for the IoT.

September 28, 2015

by Dominique Guinard

· 26,998 Views · 6 Likes

Problems Solved by IoT

We spoke with 20 executives across the IoT space about problems the Internet of Things are addressing.

September 24, 2015

by Tom Smith

CORE

· 36,155 Views · 5 Likes

Customer Journey Analytics and Data Science

Deciphering the "nuts-and-bolts” of individual customer journeys (and deducing intent) is core to improving customer experience and driving brand loyalty.

September 9, 2015

by Ravi Kalakota

· 8,255 Views · 1 Like

Too Big Data: Coping with Overplotting

written by tim brock. scatter plots are a wonderful way of showing ( apparent ) relationships in bivariate data. patterns and clusters that you wouldn't see in a huge block of data in a table can become instantly visible on a page or screen. with all the hype around big data in recent years it's easy to assume that having more data is always an advantage. but as we add more and more data points to a scatter plot we can start to lose these patterns and clusters. this problem, a result of overplotting, is demonstrated in the animation below. the data in the animation above is randomly generated from a pair of simple bivariate distributions. the distinction between the two distributions becomes less and less clear as we add more and more data. so what can we do about overplotting? one simple option is to make the data points smaller. (note this is a poor "solution" if many data points share exactly the same values.) we can also make them semi-transparent. and we can combine these two options: these refinements certainly help when we have ten thousand data points. however, by the time we've reached a million points the two distributions have seemingly merged in to one again. making points smaller and more transparent might help things; nevertheless, at some point we may have to consider a change of visualization. we'll get on to that later. but first let's try to supplement our visualization with some extra information. specifically let's visualize the marginal distributions . we have several options. there's far too much data for a rug plot , but we can bin the data and show histograms . or we can use a smoother option - a kernel density plot . finally, we could use the empirical cumulative distribution . this last option avoids any binning or smoothing but the results are probably less intuitive. i'll go with the kernel density option here, but you might prefer a histogram. the animated gif below is the same as the gif above but with the smoothed marginal distributions added. i've left scales off to avoid clutter and because we're only really interested in rough judgements of relative height. adding marginal distributions, particularly the distribution of variable 2, helps clarify that two different distributions are present in the bivariate data. the twin-peaked nature of variable 2 is evident whether there are a thousand data points or a million. the relative sizes of the two components is also clear. by contrast, the marginal distribution of variable 1 only has a single peak, despite coming from two distinct distributions. this should make it clear that adding marginal distributions is by no means a universal solution to overplotting in scatter plots. to reinforce this point, the animation below shows a completely different set of (generated) data points in a scatter plot with marginal distributions. the data again comes from a random sample of two different 2d distributions, but both marginal distributions of the complete dataset fail to highlight this separation. as previously, when the number of data points is large the distinction between the two clusters can't be seen from the scatter plot either. returning to point size and opacity, what do we get if we make the data points very small and almost completely transparent? we can now clearly distinguish two clusters in each dataset. it's difficult to make out any fine detail though. since we've lost that fine detail anyway, it seems apt to question whether we really want to draw a million data points. it can be tediously slow and impossible in certain contexts. 2d histograms are an alternative. by binning data we can reduce the number of points to plot and, if we pick an appropriate color scale, pick out some of the features that were lost in the clutter of the scatter plot. after some experimenting i picked a color scale that ran from black through green to white at the high end. note, this is (almost) the reverse of the effect created by overplotting in the scatter plots above. in both 2d histograms we can clearly see the two different clusters representing the two distributions from which the data is drawn. in the first case we can also see that there are more counts from the upper-left cluster than the bottom-right cluster, a detail that is lost in the scatter plot with a million data points (but more obvious from the marginal distributions). conversely, in the case of the second dataset we can see that the "heights" of the two clusters are roughly comparable. 3d charts are overused, but here (see below) i think they actually work quite well in terms of providing a broad picture of where the data is and isn't concentrated. feature occlusion is a problem with 3d charts so if you're going to go down this route when exploring your own data i highly recommend using software that allows for user interaction through rotation and zooming. in summary, scatter plots are a simple and often effective way of visualizing bivariate data. if, however, your chart suffers from overplotting, try reducing point size and opacity. failing that, a 2d histogram or even a 3d surface plot may be helpful. in the latter case be wary of occlusion.

July 3, 2015

by Josh Anderson

· 13,251 Views

Crowdsourcing our way to better food hygiene

The last few years has seen a tremendous boom in the number of sources online relaying information about restaurant quality. Whether it’s review sites or more general social media, there is no shortage of feedback on how people have found a particular restaurant. I wrote a few years ago about a project from the University of Rochester that aimed to mine Twitter for mentions of eating out, with the hope of producing a detailed and comprehensive map of food hygiene standards throughout restaurants in New York. The system, called nEmesis, analyzed millions of tweets, and was on the hunt for people sharing an attack of food poisoning after visiting a restaurant. You might think, or hope at least, that this would be a relatively small number, but over a four month period they found 480 such mentions in New York City alone from a total of 23,000 restaurant visitors. What’s more, the data collected correlated well with public health data on those diners. Crowdsourcing food hygiene A recent Harvard led project is hoping to provide similar assistance to the Boston food hygiene authorities by providing more intelligent information for the authorities to base their inspection checks on. Rather than using Twitter for data however, the Harvard project is turning to the review website Yelp. They have launched a NetFlix style competition to create an algorithm that can search through the ratings of restaurants in Boston and produce recommendations for which restaurants warrant a visit from the hygiene police. The competition, organized by the data company DrivenData, will see the raw data posted online and then an army of data scientists charged with solving the puzzle. The founders observed that whilst the collection of machine readable data was now mandated by the government, there was a literacy problem that rendered much of that data left dormant and unused. Bringing data science to the masses And so the competition was born to try and make data science affordable for organizations with a clear social need but no budget to afford what are still very expensive skill sets. Of course, the food hygiene challenge is but one of the challenges on the DrivenData website, with the venture coming along way from their first challenge to make a better algorithm for improving spending in schools. The organization try and ensure that whatever winning entries emerge from the competitions receive support and help to grow and improve. The winner of that initial competition, for instance, eventually turned their algorithm into a software tool for schools to use. The eventual aim is to establish a community of data scientists that are happy to deploy their talents for socially worthwhile endeavors. “Our mindset has grown; we want to solve the big-picture data literacy and data capacity problems in the social and public sectors,” the creators say. “We think competitions are a great mechanism to do that right now, but our goal is to do more, to serve that community in other ways.” Suffice to say, challenges have come a long way from their beginnings in the 18th century when the UK government launched such a competition to help find longitude more easily. The likes of the X Prize has taken them to newfound heights, and it’s great to see organizations like DrivenData apply the concept to more manageable challenges. Of course, they aren’t the only organization seeking to make algorithms more accessible. I wrote last year about the Algorithmia social network, which aims to connect up organizations with lots of data with algorithms that are being under-utilized. The aim is that this match up will create not just new insights but extra profits. Data science is undoubtedly a burgeoning field, and it’s one with a great many exciting developments in it. Original post

July 2, 2015

by Adi Gaskell

· 706 Views · 1 Like

Emerging Niches and Technologies in Mobile App Development

If there have been wide array of successful consumer apps like Angry Bird or WhatsApp or DropBox. After years of reign in the publicity focus finally these consumer apps giants understood the importance of offering enterprise grade features. In last few years suddenly the focus shifted to enterprise mobile apps. Rapid development, tracking or monitoring apps, wearable apps, Internet of Things Apps, Geo-location technologies like iBeacon and Geofencing in business apps, the list of emerging app niches and technologies seem to be too long. Let us have a quick look at some of the most definitive app niches and technologies in recent times. Enterprise apps While smartphones and mobile devices continue to move off the shelves and millions of apps continue to make the app stores brimming with energy, activity and competition, most consumer app still fail to make a earning to survive beyond the year of their launch. This has been the sordid storyline for consumer apps for years. So, for some time the focus of developers is shifting towards enterprise apps. Moreover, now businesses are bent on going mobile and they are keen to develop apps that make their business process more productive. Although enterprise mobile apps have just started to take off this new and broad app niche already shown huge promise to take over consumer apps in just more than a year down the line. Rapid development As enterprises now focusing all out to embrace mobile apps in their business process, the new demand of enterprise grade apps made rapid development cycle obvious. When winning competition for businesses is boiling down to a fast and user focused mobile presence, fast paced development will naturally be the rule. This overwhelming demand of business apps and enterprise grade software made rapid development a criterion in the present scenario. Shortening the development lifecycle has now become the major focus for most mobile app development companies around the world. Mobile monitoring apps Wide adaptation of mobile devices and apps among all age groups and people in recent years gave rise to certain concerns. Child security concern, parental concern for negative influence on children, employer’s concern on employee productivity and information security, etc. are some of the major concerns centered on the mobile devices. IOS or Android monitoring software, child phone tracker apps, mobile spy software, text message tracking apps, are few of the app types getting increasingly popular these days to address the aforementioned concerns in family or workplace environments. Internet of Things (IOT) apps The world around us is becoming connected with the mobile devices and gadgets and devices around us are increasingly finding themselves equipped with mobile control interface. This new horizon of interconnected devices is referred as Internet of Things or IOT. Now an electric toaster can be controlled from its respective app on the mobile device. Similarly, the music system with the respective mobile app can be turned on and off, tuned in and given other commands. This new breed of apps is being called IOT apps. Wearable apps The smartphones or smart mobile devices are now playing the central role in connecting all types of wearable smart devices. Most smartwatch apps are still now in character only the extension of their mobile counterparts. But as smartwatch is slowly picking up to be the next big device platform as commonest wearable, a new breed of apps are being developed targeting smartwatch and wearable users besides offering their respective mobile apps as well. From smart jewelries to health trackers and fitness bands to optically mounted computers like Google Glass, these new wearable devices will be the target development platform for a vast majority of mobile app developers in the time to come. More user-optimized mobile UI design UI design is presently the most focus driven area for mobile app development around the world. Experiments and analysis on making UIs better and user optimized is continuing and a wide variety of new techniques and design approaches are giving birth to unprecedented level of excellence in user experience. From motivational design to flat design to and playful interfaces, we have come across quite a few dominating design trends and techniques. Geo-location technologies Contextual and user specific push notification is the new maneuver to engage users with a mobile app and to garner revenue from the process. This cannot be better done than by knowing the user location. When you know the location of a user close to your retail shop you can notify him with an offer to grab his attention and push him for a visit to your store. Thus knowing the user location translates to far better contextual and business driven messaging and notifications. Several mobile friendly Geo-location technologies like iBeacon, Geofencing, Geomagnetics, etc. are there to let you integrate location based user engagement features in your app.

July 2, 2015

by Juned Ghanchi

· 3,556 Views

The Secret to More Efficient Data Science with Neo4j and R [OSCON Preview]

It’s a sad but true fact: Most data scientists spend 50-80% of their time cleaning and munging data and only a fraction of their time actually building predictive models. This is most often true in a traditional stack, where most of this data munging consists of writing lines upon lines of some flavor of SQL, leaving little time for model-building code in statistical programming languages such as R. These long, cryptic SQL queries not only slow development time but also prevent useful collaboration on analytics projects, as contributors struggle to understand each others’ SQL code. For example, in graduate school, I was on a project team where we used Oracle to store Twitter data. The kinds of queries my classmates and I were writing were unmaintainable and impossible to understand unless the author was sitting next to you. No one worked on the same queries together because they were so unwieldy. This not only hindered our collaboration efforts but also slowed our progress on the project. If we had been using an appropriate data store (like a graph database) we would have spent significantly less time pulling our hair out over the queries. Why Today’s Data Is Different This data-munging problem has persisted in the data science field because data is becoming increasingly social and highly-connected. Forcing this kind of interconnected data into an inherently tabular SQL database, where relationships are only abstract, leads to complicated schemas and overly complex queries. Yet, several NoSQL solutions – specifically in the graph database space – exist to store today’s highly-connected data. That is, data where relationships matter. A lot of data analysis today is performed in the context of better understanding people’s behavior or needs, such as: How likely is this visitor to click on advertisement X? Which products should I recommend to this user? How are User A and User B connected? Written by Nicole White People, as we know, are inherently social, so most of these questions can be answered by understanding the connections between people: User A is similar to User B, and we already know that User B likes this product, so let’s recommend this product to User A. The Good News: Data-Munging No More Data science doesn’t have to be 80% data munging. With the appropriate technology stack, a data scientist’s development process is seamless and short. It’s time to spend less time writing queries and more time building models by combining the flexibility of an open-source, NoSQL graph database with the maturity and breadth of R – an open-source statistical programming language. The combination of Neo4j’s ability to store highly-connected, possibly-unstructured data and R’s functional, ad-hoc nature creates the ideal data analysis environment. You don’t have to spend an hour writing CREATE TABLE statements. You don’t have to spend all day on StackOverflow figuring out how to traverse a tree in SQL. Just Cypher and go. Learn More at OSCON 2015 At my upcoming OSCON session we will walk through a project in which we analyze #OSCON Twitter data in a reproducible, low-effort workflow without writing a single line of SQL. For this highly-connected dataset we will use Neo4j, an open-source graph database, to store and query the data while highlighting the advantages of storing such data in a graph versus a relational schema. Finally, we will cover how to connect to Neo4j from an R environment for the purposes of performing common data science tasks, such as analysis, prediction and visualization.

June 30, 2015

by Mark Needham

· 1,455 Views

The Latest Big Data Topics