After Gizmodo's investigation into the data smart homes expose about our lives, many of you asked how you could monitor the digital emissions from your own homes. Well, you're in luck.
Here, I'll explain the methodology involved in that story, for which Kashmir Hill set up her home with a slew of internet connected gadgets, and I set up a system to monitor all the data her smart home transmitted to her internet service provider.
Before I dive in, I want to warn DIYers that this post is intended for a technical audience - people who are familiar with operating a computer from the command line, who know what Node.js is and how to run scripts that use it. A basic understanding of computer networks and how packets travel through them will also be helpful.
Additionally, this set-up was designed to work for us internally, so it is by no means the best or only way to do this. But hopefully it will give you some insights and starter code on how to approach this problem for yourself.
Our main objective was to monitor the traffic coming and going from Kashmir's home continuously and without interruption. This meant that we needed a way to capture the traffic and then put it somewhere that could easily store a vast quantity of information. An additional challenge was that Kashmir lives in San Francisco and I live in New York, so we needed to store the data in a place where it could be accessed remotely.
The approach with the best balance of convenience, robustness, and cost was to build a router to which Kashmir could connect all her smart devices so that it could capture their network activity. We built ours using a Raspberry Pi 3 and wrote a custom script to capture the traffic and send it to Amazon Web Services's S3 data store.
If you are interested in doing this yourself, you'll need to buy a Raspberry Pi. You might want to check out this W3 Schools tutorial for how to get it set up with Node.js. If you've never used a Pi before, watch this video to see how to hook it up to a screen and keyboard. You will also need an internet connection to download the script we run. The easiest way to do this is to connect your Pi to your home router via the Ethernet port.
The Raspberry Pi 3 comes with wi-fi hardware built in, so it's fairly easy to configure it to work as a router. If you haven't done this before, this tutorial should be helpful (skip the mitmproxy part).
The Raspberry Pi we used. The black USB cable goes to power and the yellow Ethernet cable was used to provide internet access to the Pi.
Once you have the router set up, give it a unique name, and you can use your smartphone or laptop to see if it works. Check for nearby wi-fi networks. If you see the one you just created, that's a good sign! (I named mine "iotea.") Connect to the network from your device and see if the internet works normally. If it does, hooray! You are halfway there.
We called the network "iotea"
Now that the Pi is setup to act like a wi-fi router, it's time to add the script to monitor the network traffic. In order to do this you'll need to know how to use Git and Github. (Here's a tutorial in case you've never done that before.) You can download our code from the Github repository, which also has information on how it needs to be configured.
(It's worth noting that there are existing tools like Wireshark or Mitmproxy that already do this and a lot more. While these tools are very powerful, installing them on the Pi and monitoring them remotely is non-trivial.)
Usually what you would do at this point is set up a server somewhere. Then you'd store the traffic by configuring the script on the Pi to point to that server. What we ended up doing was using AWS's Kinesis service, which basically took care of setting up the server for us and streamed the data to the S3 data store. The service provides you with the URL to which you'll have your Pi send your data.
This is not essential; you can send this data to your own server and monitor it as you wish. You can also modify the script to simply log the data to a file on the Pi itself, though you will probably have to copy it to another location periodically, as it can fill up the memory quickly.
The reason we used Kinesis is that it allowed us to run each incoming packet through AWS's Lambda Service, which basically allowed us to parse this data and feed it into a database without spinning up another server. In other words, while it may have been overkill, using these AWS services reduced the number of moving parts that we needed to maintain and also allowed us to look up the incoming traffic in real time without much effort.
Once I built the Raspberry Pi, I mailed it to Kashmir and she plugged it into her Netgear Router using an ethernet cord. She then connected all of her devices to its "iotea" wi-fi network.
Since we weren't sure what we were going to be collecting, it was important to minimise network down time. There are several thousand ports that can be used by a networked device to send data. Many ports are already assigned to existing services; you can view the full list here. For our experiment, we were only interested in HTTP and HTTPS traffic, which usually get sent over ports 80 and 443. We did this for the sake of simplicity and because based on early tests it seemed like most devices used these ports to communicate with their servers.
We then had to figure out the MAC address for each device so that we could track what information was being sent by which devices. It helped to have Kashmir note the time that she connected various devices and when she used them.
Summary view of front end interface
For analysis, we first determined the amount of traffic being sent to and from the devices, and to which domains it was being sent. This allowed us to get a sense of how chatty each of the devices were, and also what backend infrastructure they used. Unsurprisingly, most of the devices used Amazon's AWS servers, so we saw a lot of traffic going to those domains. For devices that were sending unencrypted information we first analysed the requests themselves to see if we could determine what was being asked for. If the information being requested were images or other assets, we then pulled those down separately. This is how we were able to collect all the images from Netflix. As a quick reminder, here are some of the shows I know Netflix suggested to Kashmir based on the images being requested:
So that is how we did it. Hopefully there is enough information in this post to help you get started and allow you to monitor your smart home. Let us know what you find!