blob: f25ad94aa2815282e872916b0eb031543f45b3fb [file] [log] [blame]
<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
xmlns:content="http://purl.org/rss/1.0/modules/content/"
xmlns:wfw="http://wellformedweb.org/CommentAPI/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:atom="http://www.w3.org/2005/Atom"
xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
>
<channel>
<title>Apache Spot</title>
<atom:link href="http://spot.incubator.apache.org/feed/" rel="self" type="application/rss+xml" />
<link>http://spot.incubator.apache.org/</link>
<description></description>
<lastBuildDate>Tue, 27 Sep 2016 18:37:15 +0000</lastBuildDate>
<language>en-US</language>
<sy:updatePeriod>hourly</sy:updatePeriod>
<sy:updateFrequency>1</sy:updateFrequency>
<item>
<title>Strength in Numbers: Why Consider Open Source Cybersecurity Analytics</title>
<link>http://spot.incubator.apache.org/blog/strength-in-numbers-why-consider-open-source-cybersecurity-analytics/</link>
<comments>http://spot.incubator.apache.org/blog/strength-in-numbers-why-consider-open-source-cybersecurity-analytics/#respond</comments>
<pubDate>Fri, 21 Oct 2016 15:48:33 +0000</pubDate>
<dc:creator><![CDATA[oni-admin]]></dc:creator>
<category><![CDATA[Uncategorized]]></category>
<guid isPermaLink="false">http://spot.incubator.apache.org/?p=149</guid>
<description><![CDATA[By Rob Kent, Vice President of Marketing at Cybraics Competition is widely considered to be a healthy and positive thing, traditionally viewed as providing options for customers and fueling innovation. In the world of cybersecurity there is no shortage of competition, in fact cybersecurity is one of the most crowded and fast-growing areas of technology.... <a class="excerpt-read-more" href="http://spot.incubator.apache.org/blog/strength-in-numbers-why-consider-open-source-cybersecurity-analytics/" title="Read Strength in Numbers: Why Consider Open Source Cybersecurity Analytics">Read more &#187;</a>]]></description>
<content:encoded><![CDATA[<p>By Rob Kent, Vice President of Marketing at Cybraics</p>
<p>Competition is widely considered to be a healthy and positive thing, traditionally viewed as providing options for customers and fueling innovation. In the world of cybersecurity there is no shortage of competition, in fact cybersecurity is one of the most crowded and fast-growing areas of technology. The problem is, with so much competition, are we losing sight of the real goal: protecting our customers against the adversary? With so much focus on competing and winning customers, are we negating one of the most fundamental advantages that we could have in the fight against cybercrime? Cooperation. Our adversaries are not shy about working together… the community is strong and growing, and while there is no doubt some healthy competition, the sharing of tools and techniques is certainly far more common than in the commercial world. Like Open Source software, security can only benefit and grow through peer-reviewed submissions. Citing Linus Law by Eric S. Raymond from his book The Cathedral and the Bazaar, &#8220;Given a large enough beta-tester and co-developer base, almost every problem will be characterized quickly and the fix will be obvious to someone.&#8221; </p>
<p>This, of course, is not new information. People have been pointing this out for years. What is encouraging is that a few organizations are finally stepping up to the plate to try and change this paradigm in cybersecurity. When it comes to cybersecurity, many have touted big data analytics as one of the key initiatives to combat adversaries. The problem is that big data in itself has its own set of challenges, one of the issues is that it is often looked at as a costly problem to store and scale as opposed to being used as another tool in an organization’s arsenal. Apache Spot, a project pioneered by Intel and Cloudera, is aiming to fix this problem. </p>
<p>By creating an open source community, the hope is that identifying security threats within large data sets will become a manageable task for all organizations, despite their size or scale. Organizations overwhelmed with their data or those that aren’t seeing results will have a community to turn to along with a common reference to compare results. By working together on Apache Spot, organizations will be able to share their experiences on how they’ve tackled or how they need to tackle these issues with a common system to reference. Data is never going to get smaller across organizations’ core systems, so in many ways an organization just now beginning to dig into their data stores is basically starting from scratch. Having the ability to reference an open common model gives the opportunity for both those starting out and those seasoned, to have an open exchange of varying knowledge. </p>
<p>By supporting Apache Spot, Cybraics is hoping to learn as much as we contribute while we collaborate and take part in this Open Source Initiative. With the end result a set of guidelines for organizations to deploy a data analytics platform to find threats within their ecosystem, and a community that will share their experiences with the next generation of data and security professionals, perhaps we can all start actually working together to start leveling the playing field. </p>
]]></content:encoded>
<wfw:commentRss>http://spot.incubator.apache.org/blog/strength-in-numbers-why-consider-open-source-cybersecurity-analytics/feed/</wfw:commentRss>
<slash:comments>0</slash:comments>
</item>
<item>
<title>Jupyter Notebooks for Data Analysis</title>
<link>http://spot.incubator.apache.org/blog/jupyter-notebooks-for-data-analysis/</link>
<comments>http://spot.incubator.apache.org/blog/jupyter-notebooks-for-data-analysis/#respond</comments>
<pubDate>Thu, 22 Sep 2016 21:28:12 +0000</pubDate>
<dc:creator><![CDATA[oni-admin]]></dc:creator>
<category><![CDATA[Uncategorized]]></category>
<guid isPermaLink="false">http://spot.incubator.apache.org/?p=136</guid>
<description><![CDATA[Why Does Apache Spot Include iPython notebooks? The project team wants Apache Spot needs to be a versatile tool that can be used by anyone. This means that data scientists and developers need to be able to query and handle the source data to find all the information they need for their decision making. The... <a class="excerpt-read-more" href="http://spot.incubator.apache.org/blog/jupyter-notebooks-for-data-analysis/" title="Read Jupyter Notebooks for Data Analysis">Read more &#187;</a>]]></description>
<content:encoded><![CDATA[<p><strong>Why Does Apache Spot Include iPython notebooks? </strong></p>
<p>The project team wants Apache Spot needs to be a versatile tool that can be used by anyone. This means that data scientists and developers need to be able to query and handle the source data to find all the information they need for their decision making. The iPython Notebook is an appropriate platform for easy data exploration. One of its biggest advantages is that it provides parallel and distributed computing to enable code execution and debugging in an interactive environment – thus the ‘i’ in iPython.</p>
<p>The iPython notebook is a web based interactive computational environment that provides access to the Python shell. While iPython notebooks were originally designed to work with the Python language, they support a number of other programming languages, including Ruby, Scala, Julia, R, Go, C, C++, Java and Perl. There are also multiple additional packages that can be used to get the most out of this highly-customizable tool.</p>
<p>Starting on version 4.0, most notebook functionalities are now part of the Project Jupyter, while iPython remains as the kernel to work with Python code in the notebooks.</p>
<img src="http://spot.incubator.apache.org/library/images/iPython-1.png" alt="ipython" class="aligncenter size-full wp-image-140" />
<p><strong>IPython with Apache Spot for Network Threat Detection</strong></p>
<p><em>NOTE:  This is not intended to be a step-by-step tutorial on how to code a threat analysis in Apache Spot, but more like an introduction on how to approach the suspicions of a security breach.</em></p>
<p>Although machine learning (ML) will do most of the work detecting anomalies in the traffic, Apache Spot also includes two notebook templates that can get you started on this. The <em>Threat_Investigation_master.ipynb</em> is designed to query the raw data table to find all connections in a day that are related to any threat you select – even connections that were not necessarily flagged as suspicious by ML on a first run. This gives us the chance to get a new data subset and here is where the fun begins.</p>
<p>If you suspect of a specific type of attack in your network, you can get the whole story by answering the Five ‘W’s.</p>
<p><strong><em>What? </em></strong></p>
<p>Maybe there’s been an increase in the logs collected by the system, which indicates abnormal amounts of communication in your network. Or, the amount of POST requests in your network have risen overnight. This is the mystery that needs to be solved by researching through the anomalies previously detected by ML.</p>
<p><strong><em>Who?</em></strong></p>
<p>Assuming you have a network context, you can identify the name of the infected machine inside the network, as well as the name of the IP or DNS on the other side of the connection (if it is a known host). If you don’t have a network context or are using DHCP, this can be a little tricky to detect using only Netflow logs. But, that’s where DNS and Proxy logs, come in handy. Including a network context file with Apache Spot is really simple and can go a long way when identifying a threat.</p>
<p><strong><em>When?</em></strong></p>
<p>To have a broader visibility on the attack, you can customize the queries on the Threat investigation notebook to review the data through a wider time lapse – instead of just checking through the current day. With this, you could find an increase of a certain type of requests to one (or many) URIs and predict its future behavior.</p>
<p><strong><em>Where?</em></strong></p>
<p>When working only with DNS, having a destination URL might not say much about where your information is going to, but Apache Spot allows you to connect with a geolocation database to identify the location of the suspected attackers IP. Taking advantage of this option, you can visually locate the other end of the connection on a map. You might find that it’s pointing to a country banned by your company, indicating a leak.</p>
<p><strong><em>Why?</em></strong></p>
<p>This answer to “why” will depend highly on the result of the analysis. For instance, an excessive amount of POST requests from one machine inside the network to an unidentified URI can indicate a data mining attack. Tracing back to patient zero, you can find that this could have originated with a phishing email, malicious software installed by an employee or a one-time visitor’s infected machine that connected to your network.</p>
<p><strong>How to Get Answers to the Five Ws Questions</strong></p>
<p>All of the previous questions can be answered by looking at the raw data collected. Although performing elaborated queries directly to your database can seem tempting, this type of analysis with Hive, or even Impala, can be very time consuming. A better approach would be to use Pandas to read and transform your dataset into a relational structured dataframe. This lets you work with as if it were an offline structured relational database.</p>
<p>Once you have your desired results and data subsets, you can use MatplotLib to easily graph your findings. (We cover this subject in more depth in another post.) Another advantage of the notebook is that you can download it as HTML or a PDF file to store locally and use it in a presentation – or just keep it for future reference.</p>
<p><strong>Wrap Up</strong></p>
<p>This post was meant to be just a brief introduction of how you can use iPython notebooks in Apache Spot to perform further data analysis and include it our executive report (in addition to the already included Story board). Although this is not the only way you can do this, it is a very interactive and fun way to do it. You’ll also see that the overall processing time is very short – thanks to the iPython notebook task parallelism ability.</p>
<p>We want to hear from YOU! Have you used iPython notebooks before? How do you feel about having this tool in Apache Spot? If you’re interested in further data analysis through interactive charts, a new post is coming soon on D3 and jQuery data visualization. Also, check back soon to read more on this and other Cybersecurity subjects.</p>
]]></content:encoded>
<wfw:commentRss>http://spot.incubator.apache.org/blog/jupyter-notebooks-for-data-analysis/feed/</wfw:commentRss>
<slash:comments>0</slash:comments>
</item>
<item>
<title>Apache Spot (Incubating) and Cybersecurity — Using NetFlows to Detect Threats to Critical Infrastructure</title>
<link>http://spot.incubator.apache.org/blog/apache-spot-and-cybersecurity-using-netflows-to-detect-threats-to-critical-infrastructure/</link>
<comments>http://spot.incubator.apache.org/blog/apache-spot-and-cybersecurity-using-netflows-to-detect-threats-to-critical-infrastructure/#respond</comments>
<pubDate>Mon, 08 Aug 2016 08:00:32 +0000</pubDate>
<dc:creator><![CDATA[oni-admin]]></dc:creator>
<category><![CDATA[Cybersecurity]]></category>
<guid isPermaLink="false">http://spot.incubator.apache.org/?p=117</guid>
<description><![CDATA[The “first” documented cybersecurity case was the worm replication, which was initiated by Robert T. Morris on November 2, 1988. Wow! Here we are in 2016, 28 years later, with viruses and worms giving way to Trojan horses and polymorphic code. Nowadays, we are also fighting against DDoS, phishing, spear phishing attacks, command and controls... <a class="excerpt-read-more" href="http://spot.incubator.apache.org/blog/apache-spot-and-cybersecurity-using-netflows-to-detect-threats-to-critical-infrastructure/" title="Read Apache Spot and Cybersecurity — Using NetFlows to Detect Threats to Critical Infrastructure">Read more &#187;</a>]]></description>
<content:encoded><![CDATA[<p>The “first” documented cybersecurity case was the worm replication, which was initiated by Robert T. Morris on November 2, 1988. Wow! Here we are in 2016, 28 years later, with viruses and worms giving way to Trojan horses and polymorphic code. Nowadays, we are also fighting against DDoS, phishing, spear phishing attacks, command and controls along with APTs such as Aurora, Zeus, Red October and Stuxnet. What happened with our security controls on each attack?</p>
<p>Despite heroic efforts, internal and external security controls, no matter if they are preventive, detective or corrective, can be bypassed by different situations or misconfigurations. Capabilities to detect bugs or vulnerabilities in code, protocol, etc. are still limited. When we consider how to close these gaps, we have two options:</p>
<ol>
<li>Collect information from each device that is part of the environment.</li>
<li>Collect information from the critical infrastructure that is used for most, if not all, of the systems.</li>
</ol>
<p>This blog will describe the second approach.</p>
<p>Critical infrastructure, including nationally significant infrastructure, can be broadly defined as the systems, assets, facilities and networks that provide essential services. For nations, this means protecting the national security, economic security and prosperity as well as the health and safety of their citizenry. If we extrapolate this definition on the IT enterprise environments, we can define the critical infrastructure as the service, or services, that need to be up and running properly 99.99999% of the time, most of the time to support other critical infrastructure, such as databases, HR systems, manufacturing systems, etc.</p>
<p>Effectively protecting critical infrastructure means that millions, if not billions, of different scenarios must be identified and monitored. To do this, the cybersecurity problem must be broken into small pieces.</p>
<p>Let’s consider DNS — your communications to the Web. How can you know which communications are being established by your critical servers? And, what if you want to do it for most of your infrastructure?</p>
<p>First idea: Use NetFlow, which is a network protocol that helps us collect IP traffic information and monitor network traffic. NetFlow has the details on the communications of all of your network traffic. However, the normal data on an enterprise environment includes billions of NetFlow events per day. To use this data to identify issues, it must be stored and analyzed. Storage alone is costly. Analyzing what amount Big Data stores is an entire other challenge.</p>
<p>Apache Spot offers a solution. It was designed to gather, store and analyze Big Data. In fact, Apache Spot is an ideal solution for this cybersecurity challenge. Apache Spot can integrate many different data sources in a data lake then add operational context to the data by linking configuration, inventory, service databases and other data stores. This helps you to prioritize the actions to take under different attack, malware, APT and hacking scenarios. With Apache Spot, attacks that bypass our external or internal security controls can be identified. By delivering risk-prioritized, actionable insights, Apache Spot can support the growing need for security analytics.</p>
<p>Not only can Apache Spot collect, store and analyze billions of NetFlow packets, but it can also be adapted to meet the unique requirements of your organization. How? Apache Spot is an open-source project.</p>
<p><strong>But Wait, There’s More</strong></p>
<p>Check out “<a href="http://spot.incubator.apache.org/blog/how-apache-spot-helps-create-well-stocked-data-lakes-and-catch-powerful-insights/"><u>How Apache Spot Helps Create Well-Stocked Data Lakes and Catch Powerful Insights</u></a>” to learn more about the underlying Apache Spot architecture.</p>
<p>This is the first of a series of blogs that we will be writing about cybersecurity, so check back to read more.</p>
]]></content:encoded>
<wfw:commentRss>http://spot.incubator.apache.org/blog/apache-spot-and-cybersecurity-using-netflows-to-detect-threats-to-critical-infrastructure/feed/</wfw:commentRss>
<slash:comments>0</slash:comments>
</item>
<item>
<title>How Apache Spot Helps Create Well-Stocked Data Lakes and Catch Powerful Insights</title>
<link>http://spot.incubator.apache.org/blog/how-apache-spot-helps-create-well-stocked-data-lakes-and-catch-powerful-insights/</link>
<comments>http://spot.incubator.apache.org/blog/how-apache-spot-helps-create-well-stocked-data-lakes-and-catch-powerful-insights/#respond</comments>
<pubDate>Mon, 08 Aug 2016 07:59:41 +0000</pubDate>
<dc:creator><![CDATA[oni-admin]]></dc:creator>
<category><![CDATA[Uncategorized]]></category>
<guid isPermaLink="false">http://spot.incubator.apache.org/?p=113</guid>
<description><![CDATA[About four years ago, the era of the Big Data analytics began. Paired with advanced analytics, massive volumes of data can be culled to not only inform critical decisions, but also to simulate sophisticated “what if” scenarios that allow companies to gain competitive advantages by generating and predicting different scenarios. For example, a financial services... <a class="excerpt-read-more" href="http://spot.incubator.apache.org/blog/how-apache-spot-helps-create-well-stocked-data-lakes-and-catch-powerful-insights/" title="Read How Apache Spot Helps Create Well-Stocked Data Lakes and Catch Powerful Insights">Read more &#187;</a>]]></description>
<content:encoded><![CDATA[<p>About four years ago, the era of the Big Data analytics began. Paired with advanced analytics, massive volumes of data can be culled to not only inform critical decisions, but also to simulate sophisticated “what if” scenarios that allow companies to gain competitive advantages by generating and predicting different scenarios. For example, a financial services company can more accurately determine what other products to offer a customer, and in what order, based on a wide variety of data, then use advanced analytics to gather insights. Creating a data lake that can be effectively used for predictive analytics raises tough questions — what data sources should we use? How should this data be collected and ingested? What are the best algorithms to analyze the data, and how should we present these results to our decision maker?</p>
<p>Apache Spot can help to solve most of these issues. Following is a description of the Apache Spot, which is designed to facilitate Big Data analytics scenarios like the financial services company’s question about the right product to offer customers.</p>
<a href="http://spot.incubator.apache.org/library/images/ONI_Architecture-Diagram_1300_v4.png"><img src="http://spot.incubator.apache.org/library/images/ONI_Architecture-Diagram_1300_v4.png" alt="oni_architecture-diagram_1300_v4" class="alignnone size-full wp-image-114" /></a>
<h3><strong>Apache Spot Core Components</strong></h3>
<p>The Apache Spot Core is composed of three main components — data integration (collectors), data store (HDFS here, but can also be a non-SQL database) and machine learning.</p>
<p>In this diagram, the top left shows Apache Spot Data Sources, which include the collection of the information that will be used to create a data lake. The process is simple. Define a pull or push from the source of information then capture this information on Apache Spot’s “collectors.” The collectors are processes that interpret the information that is sent, then write it to the HDFS system in the Apache Spot cluster. The HDFS stores the data lake and ensures that resources can grow while remaining economical at every size. The Apache Spot algorithms are part of machine learning and are used to detect the uncommon information in the data lake.</p>
<h3><strong>Operational Analytics</strong></h3>
<p>As part of operational analytics, Apache Spot executes different batch processes that add information to machine learning results to provide meaning and context. Using the financial services product example, basic customer data could be augmented with information about other customers in the same region along with information about which products those customers recommended or complained about. Basically, the data scientists can “play” with the data using different algorithms to identify insights.</p>
<h3><strong>Visualizing Results</strong></h3>
<p>The Apache Spot GUI displays the results that the machine learning algorithms generate. Results are represented such that it is easy to identify both the most common things as well as find the most suspicious or uncommon information that is part of the data lake.</p>
<h3><strong>Customizable Open Source</strong></h3>
<p>Because Apache Spot is an open-source project, most of the components depicted here can be modified by the end user.</p>
]]></content:encoded>
<wfw:commentRss>http://spot.incubator.apache.org/blog/how-apache-spot-helps-create-well-stocked-data-lakes-and-catch-powerful-insights/feed/</wfw:commentRss>
<slash:comments>0</slash:comments>
</item>
<item>
<title>Apache Spot (Incubating): Three Most-Asked Questions</title>
<link>http://spot.incubator.apache.org/blog/apache-spot-3-most-asked-questions/</link>
<comments>http://spot.incubator.apache.org/blog/apache-spot-3-most-asked-questions/#respond</comments>
<pubDate>Tue, 29 Mar 2016 05:48:19 +0000</pubDate>
<dc:creator><![CDATA[oni-admin]]></dc:creator>
<category><![CDATA[Security Analytics]]></category>
<category><![CDATA[github]]></category>
<category><![CDATA[open network insight]]></category>
<category><![CDATA[open source]]></category>
<guid isPermaLink="false">http://spot.incubator.apache.org/?p=62</guid>
<description><![CDATA[While this is not the first blog post about Apache Spot, it is the first one by a creator of the solution. As a security data scientist in Intel&#8217;s Data Center Group, I joined a small team to start thinking about solving really hard problems in cloud analytics. The team grew, and out of that... <a class="excerpt-read-more" href="http://spot.incubator.apache.org/blog/apache-spot-3-most-asked-questions/" title="Read Apache Spot: Three Most-Asked Questions">Read more &#187;</a>]]></description>
<content:encoded><![CDATA[<p>While this is not the first blog post about Apache Spot, it is the first one by a creator of the solution. As a security data scientist in Intel&#8217;s Data Center Group, I joined a small team to start thinking about solving really hard problems in cloud analytics. The team grew, and out of that effort, came Apache Spot. Since we started talking about the project, these are the three questions I am asked the most.</p>
<p><strong>What Is Apache Spot?</strong><br />
Apache Spot is an open source, flow and packet analytics solution built on Hadoop. It combines big data processing, at-scale machine learning, and unique security analytics to put potential threats in front of defenders. While I am a data scientist today, I was a security investigator just a few years ago. I wanted to develop a solution that would put new tools and technology in play for defenders, but without requiring them to walk away from security and get a math degree.</p>
<p>We wanted to start with the hard problems, so we looked at the emerging need to analyze data that was produced at a scale outside what a lot of security solutions could handle. The data is being created today, and lack of visibility into that data gives attackers a profound advantage. Also, in this new era of security, many defenders (public and private sector) have to answer to their citizens and customers when these threats occur. In other words, an event that says &#8220;this attack was blocked&#8221; is insufficient; an organization needs to see what happened before, during, and after a particular machine was attacked at a particular time. The problem is summarized in a slide from a <a href="http://www.youtube.com/watch?v=mOZjMuBLYyM" target="_blank">FloCon talk</a><br />
<a href="http://spot.incubator.apache.org/library/images/FloCon2015.png" rel="attachment wp-att-66"><img class="aligncenter size-full wp-image-66" src="http://spot.incubator.apache.org/library/images/FloCon2015.png" alt="open source packet and flow analytics" /></a></p>
<p>The gist is that while processing is a challenge at higher scales, the amount of insight gained is higher when analyzing flows and packets from key protocols (like DNS). And that&#8217;s how we got here.</p>
<p><strong>Why Intel?</strong></p>
<p>At Intel, I have worked in IT, for a security product company (McAfee), and in the Data Center Group. Intel IT was an early pioneer of the concept of proactive investigations to protect intellectual property. McAfee (now Intel Security Group) has a broad customer base in the realms of network, endpoint, and content security, to name only a few. And the Intel Data Center group has strategic partnerships with Cloudera and Accenture, as well as some pretty cool analytics efforts of their own. Add the performance benefits we achieve with Intel Architecture, especially the Intel MPI Library and Intel Math Kernel Library, and it certainly makes sense to me.</p>
<p><strong>Why Open Source?</strong></p>
<p>I learned from my earlier efforts in security analytics, that to invite collaboration from academia, the public sector, and the private sector, open source software is an excellent choice. We are now seeking to build a community of developers, data scientists, and security enthusiasts to grow Apache Spot into something we can all be proud of. We have also chosen an Apache software license, so that it can enrich commercial software offerings as well.</p>
<p>The greatest thing for me since we announced at RSA is to hear OTHER people talk about Apache Spot (formerly Open Network Insight or ONI), here are some of my favorites, from <a href="http://vision.cloudera.com/apache-spot-changing-infosec-data-science-forever/" target="_blank">a Data Scientist @ eBay </a>, <a href="https://newsroom.accenture.com/news/accenture-introduces-the-accenture-cyber-intelligence-platform-to-help-organizations-continuously-predict-detect-and-combat-cyber-attacks.htm" target="_blank">a Security Provider</a>, and <a href="http://vision.cloudera.com/adaptive-security-at-big-data-scale-for-next-generation-digital-security/" target="_blank">a Big Data company</a>.</p>
<p>Fork us on Github!</p>
<p>Grant Babb</p>
]]></content:encoded>
<wfw:commentRss>http://spot.incubator.apache.org/blog/apache-spot-3-most-asked-questions/feed/</wfw:commentRss>
<slash:comments>0</slash:comments>
</item>
</channel>
</rss>