Hinterlasse einen Kommentar

Building Big Da…

Building Big Data Confidence – Continuing the Conversation

Confidence in big data is essential. Without confidence, decision-makers may not act on big data insights – which would completely negate the benefits of big data in the first place. Understanding the level of confidence in big data, and selectively improving confidence to the required level, ensures the successful adoption of big data and analytics.

Last week IBM launched new innovations in Information Integration and Governance at an event titled “Building Confidence in Big Data”. The term “confidence” really resonated with clients, analyst and press in attendance. It’s a business issue that organizations are struggling with – and their ability to understand and improve confidence is directly related to their ability to leverage big data.

One of the speakers at last week’s event was Michele Goetz from Forrester. She unveiled new research strongly linking big data success with the presence of mature information integration and governance (IIG). Organizations with mature IIG technology and practices were far more likely to be doing big data projects, and also more likely to be successful with them. One of the interesting aspects uncovered is the notion that big data is governed in ‘zones’. Certain types of big data require certain types of governance. And specific big data use cases have specific requirements for governance. This changes the old notion of governing the data once and then using it; the new approach is to understand the usage and the data, then govern to the appropriate level – that is Agile Governance. There are many other interesting conclusions and also recommendations based on surveys of hundreds of organizations, which you may download here: http://ibm.co/17DNTvS

This is an important topic in the big data market, and we plan on continuing the conversation on big data confidence. Our next conversation will be Tuesday September 17, on a webcast that Michele and I will host, entitled “Building Confidence in Big Data with Information Integration and Governance”. I hope you can join us tomorrow at 2 PM EST – you can register here – http://bit.ly/19p2WgI

Quelle: „http://corrigandavid.wordpress.com/“

Advertisements
Hinterlasse einen Kommentar

3 Data Integrat…

3 Data Integration Technologies, 1 Common Foundation

I’ll be speaking with Eric Thoo of Gartner on a webcast on June 13 entitled “Data Integration Styles: Choosing an Approach to Match your requirements.” Click here to register – http://bit.ly/KLhZ7J

In the webcast, we will go into detail on three styles of integration: bulk data movement, real-time, and federation. Bulk data integration involves the extraction, transformation, and loading of data from multiple sources to one or more target databases. One of the key capabilities of bulk integration is extreme performance and parallel processing. Batch windows continue to shrink and data volumes continue to grow; and the new wave of big data puts even more emphasis on batch integration performance. Real-time integration involves replication and low-latency integration. It is often uses to syncrhonize operational databases and to power real-time reporting and analysis. Federation is a completely different approach – it leaves data in place and allows users to access it via federated queries. This style of integration is very important for operational systems and it is a cost-efficient complement to batch integration – only move what is necessary, leave the rest in place and access it as required. In the webcast Eric Thoo will provide details on each style and the uses for each.

These three styles of integration should not be independent and discrete from one another. They should share something in common – a foundation that establishes trust in information. A foundation that profiles data quality, improves the accuracy and completeness of data, tracks its lineage, and exposes enterprise meta data to facilitate integration. Client’s derive real value from a common approach to all three styles because they leverage a common foundation for information trust – common rules for data quality, meta data, lineage, and governance.

On the webcast we will explore the specific requirements for which each style is suited. If you look at the larger IT project, typically all three styles are required. For example, supplying trusted information to a data warehouse will require bulk data integration, but for specific reporting needs it may also need real-time integration, and potentially even federation to access other data sources. Building and managing a single view with MDM will again require bulk integration to populate MDM, real-time integration both to and from the MDM system, and federation to augment MDM’s business services to blend data stored within MDM and data stored in other source systems.

The common foundation of trust and governance, and the need to use multiple technologies are the keys to making a strategic choice of technology – one that you can leverage across the life of a project and into other projects that require integration.

Please join us on June 13, when we will share more details on this topic.

Register here http://bit.ly/KLhZ7J

Quelle: „http://corrigandavid.wordpress.com/“

Hinterlasse einen Kommentar

Big Data Integr…

Big Data Integration & Governance – Leveraging Your Existing Technology

In September 2013, IBM announced exciting innovations in Information Integration & Governance (IIG) that represented an evolution of these products for the new era of computing – big data, cloud, mobile, and social. The key point was evolution – because this product line was designed from its onset to handle large volumes, was built to be modular and therefore accommodate new components and technologies, it was able to adapt to new big data requirements and adopt new big data technologies with ease. Fortunately the product set was already designed to handle the key requirements of big data – massive scalability and parallel processing.

Some of the key things IBM announced were agile integration and governance, confidence in big data, and automated integration. It’s been six months since those announcements, and these capabilities are really taking hold in the market.

First let’s examine data confidence. This has really become a top market issue in the past year. Numerous companies that I’ve met with are more concerned with measuring confidence in data, and improving confidence in data for their big data & analytics projects. The rise of the Chief Data Officer (CDO) gives evidence to this point – companies are charging CDOs with understanding and improving data so it may be utilized, and trusted, for analytics. There are numerous press articles highlighting the importance of “confidence” and “metadata” and “lineage” with respect to big data projects. Here’s one such article that appeared in Forbes – http://onforb.es/1hU1VOt. But most importantly, clients are making this part of their big data projects – I’ve seen a noticeable shift in terms of utilizing metadata and lineage to approximate confidence levels in big data.

IBM also announced advancements in automated integration, a capability it was the first to champion as ‘the next frontier for data integration’. The hypothesis was simple – in the era of big data, you’ll have more sources and more repositories for specialized analytic workloads, therefore you’ll have more integration, so it should be made simple. And that’s exactly what has played out in the market. Companies have adopted IBM’s Data Click to speed integration between big data sources and targets, and to offer business department ‘self-service’ to access big data. This has proven extremely valuable, as more companies look to centralize big data in a ‘data lake’, and technical users in business departments want access to that data to bring into their own repositories for ad hoc analysis.

And IBM’s capabilities in Agile Governance have received a warm reception from the market. The ability to apply appropriate levels of governance at different points in time is the key to quickly, yet safely, harnessing the power of big data. Many organizations have adopted IBM’s agile data security capabilities to monitor, mask and protect sensitive big data.

IBM’s Information Integration & Governance portfolio is the market leader in Information Governance, and it was designed from the beginning to handle the biggest and most complex requirements. That original design placed it in a fortunate position – it was already ready for the era of big data in terms of performance and scalability, and simply had to adapt for newer requirements such as variety, and new technologies such as Hadoop and NoSQL. So while competitors have played catch up and have reinvented their portfolio for big data, IBM has invested to move forward. Be on the lookout for some exciting new innovations in IIG in 2014.

For more information, go to IBM Big Data Hub.

Quelle: „http://corrigandavid.wordpress.com/“

Hinterlasse einen Kommentar

Evolving Integr…

Evolving Integration and Governance for Big Data Requirements

I had the pleasure of Tweet-chatting with Jim Harris, James Kobielus, Tim Crawford, and Richard Lee on the topic of Evolving Integration & Governance for Big Data. We started the discussion by asking whether there will be a backlash to high-profile mistakes with big data & analytics, such as the recent OfficeMax issue when they sent a market mailing with the name “Mike Seay – Daughter Killer in Car Crash” in the envelop window. Everyone agreed there would be some backlash, both from mistakes such as these, as well as consumer reaction to the ‘big brother’ feeling that some big data campaigns evoke (maybe that company knows a little too much about me). Regulations will force companies to address fundamental issues, such as knowing the origin of data and its intended use. We all agreed – there would be some consumer backlash, an increase in regulations, and therefore orgs must be ready to respond with a more agile approach to big data security. All were in agreement that this should in no way slow down the adoption of big data. In fact, the ability to protect sensitive big data could become a major differentiator for some firms.

Next, we drilled into how to find data and establish a level of confidence in it. It simply takes too long to gather data on each big data project – today I heard estimated ranging between 40 and 80% of the time was consumed on that one task. That’s obscene. What’s worse – companies pay that same tax repeatedly by failing to leverage an integration & governance platform. Integration technology could surely help reduce that number dramatically, with the aid of automated discovery and classification, and self-service data integration capabilities. The issue of confidence brought out varied opinions. While some suggested that confidence was low, or always in question by business users, others asserted that confidence was high until proven otherwise, or at least high in reports that users have used previously (we trust what we’ve always assumed to be true). Confidence, while subjective and difficult to quantify, is very important to adoption of big data and analytics. If users lack confidence in data, they will lack confidence in the results.

Another topic which sparked debate was the best way to fix these issues of confidence and rapid big data discovery – can existing integration and governance technology evolve and adapt to new requirements, or does it need to be reimagined and reinvented? Everyone agreed that evolution was the desirable and logical path. Existing integration and governance technologies should adapt to big data scale, and adopt a wider variety of data types. They should also evolve to include new big data technologies to address those broader requirements. The conclusion was clear – there’s no need to reimagine and reinvent when the core products are fundamentally sound and built for big data – simple evolve them for these new requirements.

These tweet chats are hosted weekly under the hashtag #bigdatamgmt and I encourage you all to join the discussion. Also check out the latest blogs, videos, and infographics on big data integration & governance at ibmbigdatahub.com.

Quelle: „http://corrigandavid.wordpress.com/“

Hinterlasse einen Kommentar

Selbst Replizierender 3D-Drucker

Der Drucker von The Boots Industries „BI V2.0“ ist einer der ersten 3D-Drucker die sich selbst replizieren können. Natürlich nicht komplett aber das Grundgerüst ist abgespeichert und von den Entwicklern mitgeliefert. Der Drucker ist ein im Delta Style gehaltene 3-D Drucker welcher dadurch breite Druckflächen erlaubt. Ein dreifach Rollensystem erlaubt viel Volumen in den hochqualitativen 3-D Druck Stücken. Es lassen sich Stücke mit 300 mm x 300 mm Durchmesser herstellen.

zzBild

Dadurch dass man seine Teile selber drucken kann, ermöglicht es kaputte Teile schnell zu ersetzen, und Freunden zu helfen ein günstigen eigenen 3-D Drucker zu bauen. Der Aufbau des 3-D Druckers erfolgt circa in 30 Minuten bis 60 Minuten. Das komplizierteste war die Bespannung der Riemenscheiben, die die Höhe und die Position des Druckkopfes genau bestimmen können. Hauptsächlich wird der Drucker am PC betrieben, er kann mit einem externen LCD Board aber auch ohne PC beutzt und eingestellt werden. Dieses LCD Board Läuft auf einem Betriebssystem von einer SD Karte.
Unter diesem Video sieht man den Drucker noch einmal in Aktion:
Leider im Hochformat gefilmt.

Durch einen guten Druckkopf zieht er keine Fäden, und baut Schwachstellen ein.

Es lassen sich noch Erweiterungen für das Gestänge einbauen, um so ein größeres Volumen des Bauteils zu erreichen.
Dabei bleibt die Auflösung des Druckkopfes sehr gering, und zwar bei 0.05mm. Das ist etwas halb so hoch wie Standardmäßige 3D-Drucker in der Preiskategorie.

Hier einige Beispiele aus dem 3D-Drucker

Jetzt gehe ich noch einmal auf den Aufbau des Druckers ein, die jeweilige Beschreibung des Bildes steht unter dem Bild

Closeups

Hier das 3-Fach Rollensystem für eine genaue justierung der Position

3 Rollensystem Antrieb

Der Antrieb unten durch einen 32Microstep Motor, dadurch lässt sich die hohe Auflösung bewerkstelligen. Das Band ist statisch (also nicht dehnbar) und auf 23kg getestet.

Alle Bauteile

Alle Bauteile die mitgeliefert werden. Und dann in 30-60min zusammengebaut werden können. Wer noch nie einen 3D-Drucker gebaut hat, und relativ unerfahren ist braucht eher 70.

Deltaplatform mit noch zu langen Kabeln

Die Deltaplatform mit etwas zu langen Kabeln. Die müssen noch gekürzt werden, stören aber nicht.

Ecken mit Stahlverstärkung

Eine Ecke Metall verstärkt um eine höhere Präzision zu gewährleisten.

Elektromagnetischer Selbststabilisierender Kopf

Die elektromagnetische Selbst-justierende Aufnahme des Druckkopfes. Dadurch ist gewährleistet das der Druckkopf nirgends anstößt und auch bei einem 300mm Druck keine frustation aufkommt falls der Druckkopf sich durch schon fertige teile schmilzt. Er wird elegant hinübergleiten.

Extruder oben auf dem Deltatower

Der Extruder auf der Ecke

Heiße Grundplatte für einen schenllen Startprozess

Die Grundfläche lässt sich erwärmen, und so kann man schneller mit dem Drucken anfange. Außerdem lässt sie sich auf den mm genau justieren um ein sauberen und genauen Druck hinzubekommen.

LCD Board für den Alleinbetrieb

Ein großes LCD Board (welches im Kit enthalten ist) um den Drucker ohne einen PC Betreiben zu können. Hier lassen sich Druckkopf, Temperatur, Geschwindigkeit einstellen. Und Druckaufträge von der SD Karte starten (4GB a 2500 Druck-Dateien)

Lineare Höhenverstellung

Die Führungen des Druckkopfes

Sketchup Modell von jeder Komponente

Ein CAD Modell von einer Komponente, welche alle mit dem Drucker Mitgeliefert werden.

Übersicht
Und eine kleine Übersicht des Druckers

Insgesamt ein ziemlich cooles Projekt um  günstigere 3D Drucker auf den Markt zu bringen. Rechnerisch lohnt es sich am meisten so einen Drucker mit mehreren zu kaufen, und dann für jeden einen zu bauen. Oder die kosten mit teilen bei ebay reinzuholen;) Das Gesamt Kit liegt bei 480€, aktuelle Deltaprinter liegen bei 502€ haben aber den Nachteil, dass sie sich nicht so einfach nachbauen lassen. (hier werden die CAD Dateien mitgeliefert)

Für jeden der sich mit 3D Druckern und CAD beschäftigt auf jedenfall einen Blick wert.

Das Projekt gibt es auf Kickstarter, und ein Canadier hat es entwickelt. Die Finanzierung ist schon weit über das Ziel heraus geschossen. Er hat nach 30.000$ gefragt und bis jetzt 130.00$ bekommen. Das sind 73 405,2705 € mehr als erwartet….

Infos:
Seine Website: http://bootsindustries.com/
Kickstarter: http://www.kickstarter.com/projects/1784037324/bi-v20-a-self-replicating-high-precision-3d-printe

 

Hinterlasse einen Kommentar

Tuning the perf…

Tuning the performance of Naiad. Part 1: the network
by Rebecca Isaacs
We have recently been talking about Naiad’s low latency (see, for example, Derek’s presentation at SOSP). If you have ever tried to coax good performance out of a distributed system, you may be wondering exactly how we get coordination latencies of less than 1ms across 64 computers with 8 parallel workers per machine. In this series of posts we’re going to reveal to our curious – and possibly skeptical – users what we did in gory detail.

One of the most significant sources of performance issues is the network. Here, I will explain how we configured our cluster and TCP settings to deal with the problems we encountered, and also try to give you some insight as to why we had these problems in the first place.

The cluster
Our test cluster consists of 64 dual quad-core 2.1 GHz AMD Opteron processors, each with 16GB of memory and an Nvidia NForce Gigabit Ethernet NIC. The machines are running Windows Server 2012 and are placed on two racks, 32 on each rack. Each rack has a Blade Networks G8052 switch, with a 40Gbps uplink to the core switch (a Blade Networks G8124) shared by the rest of the data center. There is no interference from non-Naiad traffic.

Naiad establishes all-to-all TCP connections between the cluster machines, so that each computer handles 63 simultaneous connections. The software is structured so there are two threads dedicated to each TCP connection, one for sending, the other for receiving.

Network traffic patterns
Every worker in Naiad executes the entire program over some partition of the data. For more description of how this works, I urge you to read an earlier post. Since programs are represented as dataflow graphs, and there can be synchronization between stages of the graph, it is typically the case that each worker executes a vertex from the same stage at roughly the same time (modulo skew in the data, stragglers, etc.). In between stages the workers will often, but not always, exchange a bunch of data records over the network at roughly the same time.

In addition to high-volume data exchange, the computers involved in a Naiad program also participate in a distributed coordination protocol (see the “progress protocol” in the paper for details). If coordination messages are not sent and received promptly, then actual progress is slowed down. For example, a batching operator like the Min vertex not only needs to receive all of the input data records before it can run, but it also needs to receive the information that those records have all arrived.

Thus, we have two types of traffic with conflicting requirements: high throughput and low latency. And not only that, because the data exchanges between workers tend to happen almost simultaneously, we also potentially see transient congestion in the network and/or an incast problem, where buffers at the switch or at the receiver’s NIC can’t absorb the spike in traffic volume and overflow, dropping packets.

To summarize, the entire Naiad program will temporarily stall when either a data exchange is slow to complete, or progress protocol messages are delayed, between any pair of machines. These pauses are examples of what we term micro-stragglers, and they can collectively add substantial overhead to the running time of the program. Micro-stragglers have numerous causes throughout the software stack, but in the network they are primarily produced by packet drops at the switch or receiver, as well as congestion avoidance and back-off mechanisms.

Configuring switches and NICs
To try and avoid network congestion at the physical layer, we enabled Ethernet flow control on the rack switches and configured the host NICs to use large receive buffers. Despite this, we still saw a non-trivial amount of packet loss at the receivers, most likely because of the incast problem that is inherent in Naiad’s traffic pattern. The reason increasing the receive buffers didn’t alleviate the packet loss is because TCP is designed to keep the buffers full, so any sudden increase in load to a single destination will inevitably overwhelm the receiver. Therefore we decided to tackle the problem using TCP’s end-to-end congestion control as described below.

We also took some care to constrain the CPU costs of network processing. We enabled TCP offload on the NICs, and configured Receive Side Scaling (RSS) to balance the load across multiple cores. In practice, RSS will be more important at 10Gbps than at the 1Gbps of our network. It is worth noting that since we published the SOSP paper, some new RSS options for low latency scenarios have been introduced on Windows Server 2012, which could possibly have some benefit for Naiad.

Configuring TCP
Nagle’s algorithm for TCP reduces per-packet overheads by coalescing small payloads: if there is an unacknowledged segment in flight, TCP will wait until it has enough data to send a full-sized packet. Essentially, Nagle’s algorithm increases throughput at the expense of latency for small packets, which can be made worse by the well-known poor interaction with delayed acknowledgements. In Naiad, where small packets are typically involved in the progress protocol, this is exactly the wrong trade-off and so we disable Nagling with the TCP_NODELAY socket option.

The minimum retransmit timeout (MinRto) is another important TCP setting. Round trip times in our cluster are on the order of tens of microseconds, but the default TCP configuration for Windows Server 2012 sets the minimum timeout to 300ms – orders of magnitude larger! This doesn’t matter much when lots of segments are in flight and packet loss can be detected by the continuous stream of acknowledgements. The problem for Naiad is when the message only comprises a single packet, as is usually the case for progress protocol notifications, and that single packet is lost. Since the protocol is highly delay sensitive, it’s critical to set the MinRto and the delayed acknowledgement timer to their minimum values of 20ms and 10ms respectively, which we do using the following PowerShell script:

# Read the list of machines into a string array
$mcs = Get-Content .\cluster.lst
# Make a new session
$cluster = New-CimSession $mcs
# Set Rto to 20ms on every machine
Set-NetTCPSetting -CimSession $cluster -MinRtoMs 20 -InitialRtoMs 300 –DelayedAckTimeoutMs 10
# Confine effects to TCP port 2666
New-NetTransportFilter -CimSession $cluster -RemotePortStart 2666 -RemotePortEnd 2666 -LocalPortStart 0 -LocalPortEnd 65535
The script also sets the initial retransmit time to the minimum allowed value of 300ms, which helps the Naiad job to set up its all-to-all mesh of TCP connections faster. One unexpected problem we ran into as we scaled up to use more machines was difficulty in establishing these connections. Apparently the sudden onslaught of connection requests was triggering Memory Pressure Protection (MPP), which protects against TCP denial of service attacks by dropping SYN packets and closing existing connections. The feature can be disabled in the PowerShell script above using the option –MemoryPressureProtection Disabled.

What about a modern congestion control technique?
Ideally we would also use Data Center TCP (DCTCP) for congestion control, but that requires Explicit Congestion Notification (ECN), which our rack switches don’t support. DCTCP uses ECN packet marks to indicate the extent of congestion, which allows the sender to react by reducing its TCP window size in proportion to the fraction of packets that are marked. As a result, packet loss is not needed to signal congestion, as in regular TCP, which leads to shorter queues and better end-to-end latency. Since low latency is one of our requirements, DCTCP should be beneficial – we plan to try it out if we can find a large enough cluster with ECN-capable switches.

Decoupling control and data
It is tempting to decouple the control plane traffic requiring low latency (i.e. the progress protocol) from the data plane traffic requiring high throughput. We initially tried this by adding a high priority queue for outgoing protocol messages, but there was a catch: at the time, it was not safe for the progress protocol messages to overtake the data to which they pertain, so the planes could not operate completely independently without compromising the integrity of the computation. With the current version of the progress protocol this safety concern would no longer apply, but eagerly sending progress messages removes the opportunity for an optimization that reduces the volume of protocol traffic and the approach was abandoned. If you are interested in more detail on the safety properties of the progress protocol, the paper published at the 2013 Conference on Formal Techniques for Distributed Systems presents a formal specification. Our SOSP 2013 paper describes some of the progress protocol optimizations and their impact on overall performance.

Although TCP offers the in-order, reliable transmission that we need, in many respects its throughput-oriented mechanisms for controlling congestion and achieving high utilization are inappropriate for Naiad. In particular, short progress protocol messages require timely and reliable delivery, but will never contribute significantly to congestion nor stress network capacity. Therefore, we added a command-line option to optimistically send progress protocol messages over multicast UDP and then to transmit them again over unicast TCP. The first transmission is unreliable, but may arrive sooner since sending a single multicast UDP datagram is very cheap compared to sending a message on each of 64 TCP connections.

Decoupling control and data in this way is yet another incarnation of the familiar bandwidth-latency tradeoff. We saw tangible improvements in running time for some programs, but not others, and it remains future work to systematically tease out the circumstances under which decoupling is the best option.

Debugging tools
A note on how we detect and debug network issues is warranted. Many of the pathological behaviors are triggered by the specific characteristics of Naiad jobs, and micro-benchmarks do not expose the problem. Therefore we need a visualization of how the entire system is executing, from packets on the wire right through to the causal relationships between the progress protocol and vertex execution in the Naiad program itself.

Fortunately, Windows ships with a high-performance, low-overhead tracing system called Event Tracing for Windows (ETW). It is straightforward to post events from your own code, and almost every product that Microsoft ships is extensively instrumented already, including most OS components and services, and the .NET runtime. Out-of-the-box tools, downloadable from MSDN, such as Windows Performance Analyzer (WPA) and PerfView can interpret an ETW trace in useful ways.

In diagnosing and debugging performance we used a variety of publicly available tools and we also wrote our own that provide the detailed visualizations of execution that we need for Naiad. This gave us a very powerful suite of tools for performance debugging, which I will write about in a future blog post.

Conclusion
In the end, we were able to achieve pretty satisfactory performance on our cluster. Here is a plot from the SOSP paper showing the latency distributions for a global coordination micro-benchmark for up to 512 worker threads on 64 computers. Note the impact of micro-stragglers revealed in the 95th percentile values as the cluster size increases.

Plot showing latency distributions for a global coordination micro-benchmark
Latency distributions for a global coordination micro-benchmark

Although good, Naiad performance is by no means a solved problem, essentially because we haven’t been able to completely eliminate loss. If we are unlucky, TCP’s exponential back-off can result in 5s stalls, which is clearly way off the chart in terms of the latencies that we aspire to. If we are really unlucky, we see occasional TCP connection resets (RSTs) caused by the maximum number of retransmits being exceeded and as a result the entire job fails. In ongoing work we are moving to high-performance, reliable networking technology like 40Gb Ethernet, and using the Winsock RIO (Registered Input/Output) API, DCTCP, RoCE (RDMA over Converged Ethernet), Data Center Bridging etc.

I have not described in this post all of the things we tried and discarded. Some of the configuration options that sounded promising, for instance, setting TCP’s congestion window restart option, appeared to have little impact, but we do not have a good explanation as to why. It is most likely that any improvements were dominated by some other effect, but it is impractical to systematically explore the space of options manually, especially when the main focus is to develop a performant system.

Fortunately, this leads to some fun opportunities for future research. To start with we could make the configuration task easier with better diagnostics for the impact of, and interactions between, different options. A more advanced research question that arises from all this gore is whether it is possible to automate the end-to-end configuration for a particular network, TCP stack, and traffic pattern. There was an interesting paper at Sigcomm this year describing a program that automatically generates the TCP congestion control algorithm, given a specification of the network and an objective. Could we go further and automatically tune the network to the traffic in order to get the best bandwidth-latency tradeoff?

Making Naiad perform well involved much more than just the network and in future posts we will cover other parts of the system. Coming next: how we optimized synchronization between I/O and worker threads on individual machines.

Quelle: „http://bigdataatsvc.wordpress.com/“

Hinterlasse einen Kommentar

You Can’t Forge…

You Can’t Forget What You Can’t Remember

In order to forget something, first you need to remember it. That simple premise will cause organizations a great deal of pain as consumer privacy legislation takes effect.

The concern about consumer data privacy is at an all-time high. 70% of Europeans are concerned about the reuse of their personal data.[1] 86% of Americans are concerned with data collection from internet browsing and how it is used to generate personalized banner advertisements.[2] Their primary concern is how that data may be used for other purposes, or packaged and resold to other entities. With data breaches and issues such as the NSA’s collection of private data making headlines each week, it’s no wonder that consumer sensitivity is heightened.

This will present a very large problem for companies, because law makers are starting to take action. The European Union announced changes to the 1995 Data Protection Directive to take effect starting in 2014.[3] It contains one very logical and innocent looking directive – “the right to be forgotten” which means that upon request from a consumer, an organization must delete all of their personal data. That sounds simple. It’s actually a wildly complex problem, because of the premise above – you cannot forget what you cannot remember. And most organizations aren’t particularly good at remembering their customers.

Click here to READ THE REST OF THIS BLOG ON http://www.ibmbigdatahub.com – or click or go to http://ibm.co/1bEaJGS

Watch a video discussion on this topic here – http://ibm.co/1gWKqwF

[1] Forrester Research. EU Regulations And Public Opinion Shift The Scope Of Data Governance

by Henry Peyret, October 17, 2013

[2] Perfect Storm For Behavioral Advertising:

How The Confluence Of Four Events In 2009 May Hasten Legislation (And What This Means For Companies Which Use Behavioral Advertising) By: Susan E. Gindin

[3] Forrester Research. EU Regulations And Public Opinion Shift The Scope Of Data Governance

by Henry Peyret, October 17, 2013

Quelle: „http://corrigandavid.wordpress.com/“