Saturday, May 23, 2015

Introducing FIDO: Automated Security Incident Response

http://techblog.netflix.com/2015/05/introducing-fido-automated-security.html


We're excited to announce the open source release of FIDO (Fully Integrated Defense Operation - apologies to the FIDO Alliance for acronym collision), our system for automatically analyzing security events and responding to security incidents.

Overview

The typical process for investigating security-related alerts is labor intensive and largely manual. To make the situation more difficult, as attacks increase in number and diversity, there is an increasing array of detection systems deployed and generating even more alerts for security teams to investigate.

Netflix, like all organizations, has a finite amount of resources to combat this phenomenon, so we built FIDO to help. FIDO is an orchestration layer that automates the incident response process by evaluating, assessing and responding to malware and other detected threats.

The idea for FIDO came from a simple proof of concept a number of years ago. Our process for handling alerts from one of our network-based malware systems was to have a help desk ticket created and assigned to a desktop engineer for follow-up - typically a scan of the impacted system or perhaps a re-image of the hard drive. The time from alert generation to resolution of these tickets spanned from days to over a week. Our help desk system had an API, so we had a hypothesis that we could cut down resolution time by automating the alert-to-ticket process. The simple system we built to ingest the alerts and open the tickets cut the resolution time to a few hours, and we knew we were onto something - thus FIDO was born.

Architecture and Operation

This section describes FIDO's operation, and the following diagram provides an overview of FIDO’s architecture.




Detection

FIDO’s operation begins with the receipt of an event via one of FIDO’s detectors. Detectors are off the shelf security products (e.g. firewalls, IDS, anti-malware systems) or custom systems that detect malicious activities or threats. Detectors generate alerts or messages that FIDO ingests for further processing. FIDO provides a number of ways to ingest events, including via API (the preferred method), SQL database, log file, and email. FIDO supports a variety of detectors currently (e.g. Cyphort, ProtectWise, CarbonBlack/Bit9) with more planned or under development.

Analysis and Enrichment

The next phase of FIDO operation involves deeper analysis of the event and enrichment of the event data with both internal and external data sources. Raw security events often have little associated context, and this phase of operation is designed to supplement the raw event data with supporting information to enable more accurate and informed decision making.

The first component of this phase is analysis of the event’s target - typically a computer and/or user (but potentially any targeted resource). Is the machine a Windows host or a Linux server? Is it in the PCI zone? Does the system have security software installed and the latest patches? Is the targeted user a Domain Administrator? An executive? Having answers to these questions allows us to better evaluate the threat and determine what actions need to be taken (and with what urgency). To gather this data, FIDO queries various internal data sources - currently supported are Active Directory, LANDesk, and JAMF, with other sources under consideration.

In addition to querying internal sources, FIDO consults external threat feeds for information relevant to the event under analysis. The use of threat feeds help FIDO determine whether a generated event may be a false positive or how serious and pervasive the issue may be. Another way to think of this step is ‘never trust, always verify.’ A generated alert is simply raw data - it must be enriched, evaluated, and corroborated before actioning. FIDO supports several threats feeds, including ThreatGrid and VirusTotal, with additional feeds under consideration.

Correlation and Scoring

Once internal and external data has been gathered about a given event and its target(s), FIDO seeks to correlate the information with other data it has seen and score the event to facilitate ultimate disposition. The correlation component serves several functions - first - have multiple detectors identified this same issue? If so, it could potentially be a more serious threat. Second - has one of your detectors already blocked or remediated the issue (for example - a network-based malware detector identifies an issue, and a separate host-based system repels the same item)? If the event has already been addressed by one of your controls, FIDO may simply provide a notification that requires no further action. The following image gives a sense of how the various scoring components work together.

Scoring is multi-dimensional and highly customizable in FIDO. Essentially, what scoring allows you to do is tune FIDO’s response to the threat and your own organization’s unique requirements. FIDO implements separate scoring for the threat, the machine, and the user, and rolls the separate scores into a total score. Scoring allows you to treat PCI systems different than lab systems, customer service representatives different than engineers, and new event sources different than event sources with which you have more experience (and perhaps trust). Scoring leads into the last phase of FIDO’s operation - Notification and Enforcement.

Notification and Enforcement

In this phase, FIDO determines and executes a next action based on the ingested event, collected data, and calculated scores. This action may simply be an email to the security team with details or storing the information for later retrieval and analysis. Or, FIDO may implement more complex and proactive measures such as disabling an account, ending a VPN session, or disabling a network port. Importantly, the vast majority of enforcement logic in FIDO has been Netflix-specific. For this reason, we’ve removed most of this logic and code from the current OSS version of FIDO. We will re-implement this functionality in the OSS version when we are better able to provide the end-user reasonable and scalable control over enforcement customization and actions.

Open Items & Future Plans

Netflix has been using FIDO for a bit over 4 years, and while it is meeting our requirements well, we have a number of features and improvements planned. On the user interface side, we are planning for an administrative UI with dashboards and assistance for enforcement configuration. Additional external integrations planned include PAN, OpenDNS, and SentinelOne. We're also working on improvements around correlation and host detection. And, because it's now OSS, you are welcome to suggest and submit your own improvements!
-Rob Fry, Brooks Evans, Jason Chan

Friday, May 22, 2015

Why tools like Docker, Vagrant, and Ansible are hotter than ever

http://opensource.com/business/15/5/why-Docker-Vagrant-and-Ansible

Tools in a tool box
Image credits : 
Photo by Peter (CC BY-SA 2.0), modified by Rikki Endsley
The complexity of application stacks keeps going up. Way, way up. Application stacks have always been complicated, but never like this. There are so many services, so many tools, so much more compute power available, so many new techniques to try, and always the desire, and the pressure, to solve problems in newer and cooler and more elegant ways. With so many toys to play with, and more coming every day, the toy chest struggles to contain them all.
If you're not familiar with stackshare.io, have a look at it. It's a great resource to see which pieces companies are using to build their applications. In addition to being useful, it also can be pretty entertaining.
Spend a few minutes browsing through some of the stacks out there and you'll see that some of the technology collections people have assembled are fascinating. Here's an example I particularly like: (deep breath) EC2 S3 Qubole MongoDB Memecached Redis Django Hadoop nginx Cassandra MySQL Google Analytics SendGrid Route53 Testdroid Varnish Zookeeper.
So that's web server, web application server, caching proxy server, discovery service, a few services-as-a-service, and six "databases" of various flavors and functions. (All of it either open source or proprietary service, of course. There tends to be very little in between anymore.)
It's highly unlikely that anyone ever stood in front of a whiteboard and wrote WE NEED SIX DATABASES!!! with a purple dry erase pen, but that's how things happen when your infrastructure expands rapidly to meet business demand. A developer decides that a new tool is best, rightly or wrongly, and that tool makes its way into production. At that moment, the cool new tool instantly becomes a legacy application, and you have to deal with it until you refactor it (ha!) or until you quit to go do something else and leave the next poor sucker to deal with it.

How to cope

So how can developers possibly cope with all of this complexity? Better than one might expect, as it turns out.
That awesome nextgen location-aware online combo gambling/dating/sharing economy platform is going to require a lot of different services and components. But every grand plan has a simple beginning, and every component of any ultrascalable mega-solution starts its life as a few chunks of code somewhere. For most teams, that somewhere is a few humble developer laptops, and a git repository to bind them.
We talk about the cloud revolution, but we tend to talk less about the laptop revolution. The developer laptop of today, combined with advances in virtualization and containerization, now allow complex multi-system environments to be fully modeled on a laptop. Multiple "machines" can now be a safe default, because these multiple, separate "machines" can all be trivially instantiated on a laptop.
The upshot: The development environment for a complex, multisystem application stack can now be reliably and repeatably installed on a single laptop, and changes to any of the environment, or all of the environment, can be easily shared among the whole team, so that everyone can rebuild identical environments quickly. For example, ceph-ansible is a tool to deploy and test a multi-node Ceph cluster on a laptop, using multiple VMs, built by Vagrant and orchestrated by Ansible, all with a single command: vagrant up. Ceph developers are using this tool right now.
This kind of complex multi-node deployment is already becoming commonplace, and it means that modeling the relationships between machines is now just as important as managing what's on those individual machines.
Docker and Vagrant are successful because they are two simple ways of saying, "This is what's on this machine, and here's how to start it." Ansible is successful with both because it's a simple way of saying, "This is how these machines interact, and here's how to start them." Together, they allow developers to build complex multi-machine environments, in a way that allows them to be described and rebuilt easily.
It's often said that DevOps, at its heart, is a conversation. This may be true, but it's a conversation that's most successful when everyone speaks the same language. Vagrant, Docker, and Ansible are seeing success because they allow people to speak the same languages of modeling and deployment.

Varnish Goes Upstack with Varnish Modules and Varnish Configuration Language

http://highscalability.com/blog/2015/5/6/varnish-goes-upstack-with-varnish-modules-and-varnish-config.html

This is a guest post by Denis Brækhus and Espen Braastad, developers on the Varnish API Engine from Varnish Software. Varnish has long been used in discriminating backends, so it's interesting to see what they are up to.
Varnish Software has just released Varnish API Engine, a high performance HTTP API Gateway which handles authentication, authorization and throttling all built on top of Varnish Cache. The Varnish API Engine can easily extend your current set of APIs with a uniform access control layer that has built in caching abilities for high volume read operations, and it provides real-time metrics.
Varnish API Engine is built using well known components like memcached, SQLite and most importantly Varnish Cache. The management API is written in Python. A core part of the product is written as an application on top of Varnish using VCL (Varnish Configuration Language) and VMODs (Varnish Modules) for extended functionality.
We would like to use this as an opportunity to show how you can create your own flexible yet still high performance applications in VCL with the help of VMODs.

VMODs (Varnish Modules)

VCL is the language used to configure Varnish Cache. When varnishd loads a VCL configuration file, it will convert it into C code, compile it and then load it dynamically. It is therefore possible to extend functionality of VCL by inlining C code directly into the VCL configuration file, but the preferred way to do it since Varnish Cache 3 has been to use Varnish Modules, or VMODs for short, instead.
The typical request flow in a stack containing Varnish Cache is:
fig showing normal varnish workflow
The client sends HTTP requests which are received and processed by Varnish Cache. Varnish Cache will decide to look up the requests in cache or not, and eventually it may fetch the content from the backend. This works very well, but we can do so much more.
The VCL language is designed for performance, and as such does not provide loops or external calls natively. VMODs, on the other hand, are free of these restrictions. This is great for flexibility, but places the responsibility for ensuring performance and avoiding delays on the VMOD code and behaviour.
The API Engine design illustrates how the powerful combination of VCL and custom VMODs can be used to build new applications. In Varnish API Engine, the request flow is:
fig showing workflow with sqlite and memcached VMODs
Each request is matched against a ruleset using the SQLite VMOD and a set of Memcached counters using the memcached VMOD. The request is denied if one of the checks fail, for example if authentication failed or if one of the request limits have been exceeded.

Example application

The following example is a very simple version of some of the concepts used in the Varnish API Engine. We will create a small application written in VCL that will look up the requested URL in a database containing throttling rules and enforce them on a per IP basis.
Since testing and maintainability is crucial when developing an application, we will use Varnish's integrated testing tool: varnishtest. Varnishtest is a powerful testing tool which is used to test all aspects of Varnish Cache. Varnishtest's simple interface means that developers and operation engineers can leverage it to test their VCL/VMOD configurations.
Varnishtest reads a file describing a set of mock servers, clients, and varnish instances. The clients perform requests that go via varnish, to the server. Expectations can be set on content, headers, HTTP response codes and more. With varnishtest we can quickly test our example application, and verify that our requests are passed or blocked as per the defined expectations.
First we need a database with our throttle rules. Using the sqlite3 command, we create the database in /tmp/rules.db3 and add a couple of rules.
$ sqlite3 /tmp/rules.db3 "CREATE TABLE t (rule text, path text);"
$ sqlite3 /tmp/rules.db3 "INSERT INTO t (rule, path) VALUES ('3r5', '/search');"
$ sqlite3 /tmp/rules.db3 "INSERT INTO t (rule, path) VALUES ('15r3600', '/login');"
These rules will allow 3 requests per 5 seconds to /search and 15 requests per hour to /login. The idea is to enforce these rules on a per IP basis.
For the sake of simplicity, we’ll write the tests and VCL configuration in the same file, throttle.vtc. It is, however, possible to include separate VCL configuration files using include statements in the test files, to separate VCL configuration and the different tests.
The first line in the file is optionally used to set the name or the title of the test.
varnishtest "Simple throttling with SQLite and Memcached"
Our test environment consists of one backend, called s1. We will first expect one request to a URL without a rule in the database.
server s1 {
  rxreq
  expect req.url == "/"
  txresp
We then expect 4 requests to /search to arrive according to our following expectations. Note that the query parameters are slightly different, making all of these unique requests.
  rxreq
  expect req.url == "/search?id=123&type=1"
  expect req.http.path == "/search"
  expect req.http.rule == "3r5"
  expect req.http.requests == "3"
  expect req.http.period == "5"
  expect req.http.counter == "1"
  txresp
  rxreq
  expect req.url == "/search?id=123&type=2"
  expect req.http.path == "/search"
  expect req.http.rule == "3r5"
  expect req.http.requests == "3"
  expect req.http.period == "5"
  expect req.http.counter == "2"
  txresp
  rxreq
  expect req.url == "/search?id=123&type=3"
  expect req.http.path == "/search"
  expect req.http.rule == "3r5"
  expect req.http.requests == "3"
  expect req.http.period == "5"
  expect req.http.counter == "3"
  txresp
  rxreq
  expect req.url == "/search?id=123&type=4"
  expect req.http.path == "/search"
  expect req.http.rule == "3r5"
  expect req.http.requests == "3"
  expect req.http.period == "5"
  expect req.http.counter == "1"
  txresp
} -start
Now it is time to write the mini-application in VCL. Our test environment consists of one varnish instance, called v1. Initially, the VCL version marker and the VMOD imports are added.
varnish v1 -vcl+backend {
  vcl 4.0;
  import std;
  import sqlite3;
  import memcached;
VMODs are usually configured in vcl_init, and this is true for sqlite3 and memcached as well. For sqlite3, we set the path to the database and the field delimiter to use on multi column results. The memcached VMOD can have a wide variety of configuration options supported by libmemcached.
  sub vcl_init {
      sqlite3.open("/tmp/rules.db3", "|;");
      memcached.servers("--SERVER=localhost --BINARY-PROTOCOL");
  }
In vcl_recv, the incoming HTTP requests are received. We start by extracting the request path without query parameters and potential dangerous characters. This is important since the path will be part of the SQL query later. The following regex will match the req.url from the beginning of the line up until any of the characters ? & ; “ ‘ or whitespace.
  sub vcl_recv {
      set req.http.path = regsub(req.url, {"^([^?&;"' ]+).*"}, "\1");
The use of {" "} in the regular expression enables handling of the " character in the regular expression rule. The path we just extracted is used when the rule is looked up in the database. The response, if any, is stored in req.http.rule.
      set req.http.rule = sqlite3.exec("SELECT rule FROM t WHERE path='" + req.http.path + "' LIMIT 1");
If we get a response, it will be on the format RnT, where R is the amount of requests allowed over a period of T seconds. Since this is a string, we need to apply more regex to separate those.
      set req.http.requests = regsub(req.http.rule, "^([0-9]+)r.*$", "\1");
      set req.http.period = regsub(req.http.rule, "^[0-9]+r([0-9]+)$", "\1");
We do throttling on this request only if we got proper values from the previous regex filters.
      if (req.http.requests != "" && req.http.period != "") {
Increment or create a Memcached counter unique for this client.ip and path with the value 1. The expiry time we specify is equal to the period in the throttle rule set in the database. This way, the throttle rules can be flexible regarding time period. The return value is the new value of the counter, which corresponds to the amount of requests this client.ip has done this path in the current time period.
          set req.http.counter = memcached.incr_set(
              req.http.path + "-" + client.ip, 1, 1, std.integer(req.http.period, 0));
Check if the counter is higher than the limit set in the database. If it is, then abort the request here with a 429 response code.
          if (std.integer(req.http.counter, 0) > std.integer(req.http.requests, 0)) {
              return (synth(429, "Too many requests"));
          }
      }
  }
In vcl_deliver we set response headers showing the throttle limit and status for each request which might be helpful for the consumers.
  sub vcl_deliver {
      if (req.http.requests && req.http.counter && req.http.period) {
          set resp.http.X-RateLimit-Limit = req.http.requests;
          set resp.http.X-RateLimit-Counter = req.http.counter;
          set resp.http.X-RateLimit-Period = req.http.period;
      }
  }
Errors will get the same headers set in vcl_synth.
  sub vcl_synth {
      if (req.http.requests && req.http.counter && req.http.period) {
          set resp.http.X-RateLimit-Limit = req.http.requests;
          set resp.http.X-RateLimit-Counter = req.http.counter;
          set resp.http.X-RateLimit-Period = req.http.period;
      }
  }
The configuration is complete, and it is time to add some clients to verify that the configuration is correct. First we send a request that we expect to be unthrottled, meaning that there are no throttle rules in the database for this URL.
client c1 {
  txreq -url "/"
  rxresp
  expect resp.status == 200
  expect resp.http.X-RateLimit-Limit ==
  expect resp.http.X-RateLimit-Counter ==
  expect resp.http.X-RateLimit-Period ==
} -run
The next client sends requests to a URL that we know is a match in the throttle database, and we expect the rate-limit headers to be set. The throttle rule for /search is 3r5, which means that the three first requests within a 5 second period should succeed (with return code 200) while the fourth request should be throttled (with return code 429).
client c2 {
  txreq -url "/search?id=123&type=1"
  rxresp
  expect resp.status == 200
  expect resp.http.X-RateLimit-Limit == "3"
  expect resp.http.X-RateLimit-Counter == "1"
  expect resp.http.X-RateLimit-Period == "5"
  txreq -url "/search?id=123&type=2"
  rxresp
  expect resp.status == 200
  expect resp.http.X-RateLimit-Limit == "3"
  expect resp.http.X-RateLimit-Counter == "2"
  expect resp.http.X-RateLimit-Period == "5"
  txreq -url "/search?id=123&type=3"
  rxresp
  expect resp.status == 200
  expect resp.http.X-RateLimit-Limit == "3"
  expect resp.http.X-RateLimit-Counter == "3"
  expect resp.http.X-RateLimit-Period == "5"
  txreq -url "/search?id=123&type=4"
  rxresp
  expect resp.status == 429
  expect resp.http.X-RateLimit-Limit == "3"
  expect resp.http.X-RateLimit-Counter == "4"
  expect resp.http.X-RateLimit-Period == "5"
} -run
At this point, we know that requests are being throttled. To verify that new requests are allowed after the time limit is up, we add a delay here before we send the next and last request. This request should succeed since we are in a new throttle window.
delay 5;
client c3 {
  txreq -url "/search?id=123&type=4"
  rxresp
  expect resp.status == 200
  expect resp.http.X-RateLimit-Limit == "3"
  expect resp.http.X-RateLimit-Counter == "1"
  expect resp.http.X-RateLimit-Period == "5"
} -run
To execute the test file, make sure the memcached service is running locally and execute:
$ varnishtest example.vtc
#     top  TEST example.vtc passed (6.533)
Add -v for verbose mode to get more information from the test run.
Requests to our application in the example will receive the following response headers. The first is a request that has been allowed, and the second is a request that has been throttled.
$ curl -iI http://localhost/search
HTTP/1.1 200 OK
Age: 6
Content-Length: 936
X-RateLimit-Counter: 1
X-RateLimit-Limit: 3
X-RateLimit-Period: 5
X-Varnish: 32770 3
Via: 1.1 varnish-plus-v4
$ curl -iI http://localhost/search
HTTP/1.1 429 Too many requests
Content-Length: 273
X-RateLimit-Counter: 4
X-RateLimit-Limit: 3
X-RateLimit-Period: 5
X-Varnish: 32774
Via: 1.1 varnish-plus-v4
The complete throttle.vtc file outputs timestamp information before and after VMOD processing, to give us some data on the overhead introduced by the Memcached and SQLite queries. Running 60 requests in varnishtest on a local vm with Memcached running locally returned the following timings pr operation (in ms):
  • SQLite SELECT, max: 0.32, median: 0.08, average: 0.115
  • Memcached incr_set(), max: 1.23, median: 0.27, average: 0.29
These are by no means scientific results, but hints to performance that should for most scenarios prove to be fast enough. Performance is also about the ability to scale horizontally. The simple example provided in this article will scale horizontally with global counters in a pool of Memcached instances if needed.
fig showing horizontally scaled setup

Further reading

There are a number of VMODs available, and the VMODs Directory is a good starting point. Some highlights from the directory are VMODs for cURL usage, Redis, Digest functions and various authentication modules.
Varnish Plus, the fully supported commercial edition of Varnish Cache, is bundled with a set of high quality, support backed VMODs. For the open source edition, you can download and compile the VMODs you require manually.

Related Articles