
China’s Great Firewall is often spoken about but is rarely understood. It is one of the most sophisticated and opaque censorship systems on the planet, and it shapes how over a billion people interact with the global internet,
Loading summary
Narrator
China's Great Firewall, or gfw, is often spoken about but is rarely understood. It is one of the most sophisticated and opaque censorship systems on the planet and it shapes how over a billion people interact with the global Internet, influences the design of privacy and proxy tools worldwide, and continues to evolve in ways that challenge researchers, developers and policymakers alike. Jackson Sip is a PhD researcher at the University of Colorado Boulder whose work focuses on uncovering how national scale censorship systems operate. Jackson recently helped lead a groundbreaking study analyzing a previously undocumented GFW technique that quietly broke fully encrypted proxy protocols across China for more than a year. In this episode, Jackson joins Gregor Van to discuss how the Great Firewall works at a technical level. The 2021-2023 blocking event, the pop count based detection algorithm. His team reverse engineered the cat and mouse ecosystem of censorship circumvention and what these findings mean for the future of the open Internet. Gregor Vand is a security focused technologist, having previously been a CTO across cybersecurity, cyber insurance and general software engineering companies. He is based in Singapore and can be found via his profile at Vand HK or on LinkedIn.
Gregor Vand
Foreign. Hello and welcome to Software Engineering Daily. My guest today is Jackson Sip.
Jackson Sip
Hey, happy to be here.
Gregor Vand
Yeah. So we're going to be getting into a pretty interesting topic with Jackson today around what is called the gfw. So the Great Firewall. And we'll get into what even that is. But also Jackson's done a lot of research into how it operates and actually sort of how it operates through the years because it's been changing as well. So before we get into all that, as we like to do, maybe just sort of. Jackson, what is your background? How did you get into this kind of research in the first place?
Jackson Sip
Yeah, sure. So yeah, I am currently a PhD student at the University of Colorado Boulder and I've been here for about five years now. I'm advised by Dr. Eric Wistrow who's been doing this work for always longer than that. And as I first started my PhD, I got involved with an organization called the GFW Report that has focused on censorship within China. And so through working with them, I've done a number of different projects sort of exploring the relationship between China and censorship.
Gregor Vand
Awesome. So maybe let's just get into what is. I always kind of get the acronym mixed up because I think there should be a C in there somewhere because it's sort of like the Great Chinese Firewall perhaps. But yeah, maybe what is the gfw.
Jackson Sip
So the GFW is the Great Firewall, as it's known, tends to refer to the censorship mechanism in China. That acronym has sort of expanded as it's gained popularity. Sometimes people refer to Iran's gfw. So we often refer to it as China's GFW now. And it's a collection of different techniques and deployments that are spread around China in order to filter citizens In China's access to the Internet. It affects a number of different protocols. We see it on DNS traffic, on TLS traffic, on QUIC traffic these days. And it's also been used almost as a weapon against organizations or services that haven't complied with the censorship desire of the nation. So one of my favorite facets of the GFW is a tool called the Great Cannon, which is sort of uncommon, something that not many people have heard of. But it was this technique that they deployed in 2015 against GitHub, actually. So GitHub was allowing for proxies to sort of run through GitHub pages, I think it was. And the idea was, is China couldn't get them to comply with, hey, we don't want you to be this proxy source. And so what they did instead is they looked at all of the HTTP traffic, unencrypted traffic that they could going to and from China, and they would inject a JavaScript request in that traffic that would then make a request to GitHub. Right. So effectively they designed the largest denial of service attack ever by injecting these JavaScript requests into every HTTP request that went across the GFW.
Gregor Vand
Wow, okay. I think I had heard of GitHub playing a part somewhere, but yeah, that's new to me as well. So I guess looking at the fact that you've done a lot of research in this space, again, before we get into what the research turned up, again, just for the audience, some people in the audience may have been to China may understand roughly what happens from the Internet perspective, but equally, I think, let's just assume zero knowledge. So why is this kind of research difficult in the first place?
Jackson Sip
Sure. So the goal of the GFW is to restrict Chinese citizens, specifically their access to information outside of China that they deem undesirable. Right. So this might be like Western entertainment that maybe goes against the goals of the ccp, the Chinese Communist Party. And what we really struggle with is finding ways to detect that censorship not being citizens located in China. Right. So one of the biggest kind of logistical issues we have is getting access to vantage points. Right. Places in China that we can connect to and then send requests that may or may not get blocked by the gfw. Additionally, because we aren't part of the CCP or the Chinese government. Right. We don't necessarily know when we've reached ground truth. Right. At the end of the day, this is sort of a black box system and we can to some extent only speculate. Right. It takes a number of experiments to really determine whether or not what we have observed is what we think it is or just some other effect of the network.
Gregor Vand
Okay, so before we kind of get into what the research was, there's also kind of like a time frame thing here, which is helpful. So something kind of happened in, I believe it was November 2021. Let's start there. What happened then? Maybe did that kind of almost kick off why the research was then undertaken or you can help us out?
Jackson Sip
Absolutely. Yeah, sure. So I think that the incident you're referring to in November 2021 was the deployment of a new technique by the GFW to detect and then block fully encrypted protocols. And I think to understand that, we need to sort of get back to how Chinese citizens have responded to the gfw. Right. Which is there are plenty of people in China who disagree with it. Right. They would like to be able to access maybe whatever is latest on Netflix or something else. And so they will use proxies in order to circumvent this censorship and connect out to whatever content they otherwise wouldn't be able to access. Right. And these proxies use a number of different techniques to be censorship resistant. Right. If you think about a proxy that you've accessed before, or maybe like a VPN that you've connected to before that traffic, it's obvious, Right. We can look at it in tools like wireshark or observe it on the network somehow and see that that is clearly a VPN connection or some type of standard proxy connection. So the GFW has designed ways of detecting and blocking that. Right. And so proxy developers kind of had to get more creative and they had to develop protocols that would be less obvious to some type of network observer like the gfw. And these tools took two different forms, the first of which being mimicry. Right. We are going to try and make our proxy look as close to all the other traffic on the Internet. And if you aren't familiar, almost all the traffic on the Internet looks like tls, because everybody's doing something that involves a website. Right. And so the first Thing they did is tried to mimic tls, right, and look as close to normal TLS as possible. But with TLS kind of having sharp edges, there were always these issues that would exist and the sensor could then block that traffic because they would see that, say, like the encryption scheme was wrong. Right. They picked the wrong cipher suite, something like that. And so this led to a different approach. If we can't look exactly like everything else, let's try to look like nothing. Let's try and blend into the background. And that led to these protocols that we refer to as fully encrypted. And a fully encrypted protocol involves some type of key exchange between the client and the server. And then after that key exchange has occurred, every single byte that is sent from the client to the server and vice versa is encrypted. There's no protocol header, there's no exchange of information about size or anything like that. Right. It is just going to be encrypted bytes as the payload of like a TCP connection. And the idea was, okay, this just sort of blends in. There's nothing to worry about. They won't be looking for this type of traffic. But these tools became pretty prolific. They became the standard, right? And every proxy had sort of implemented their own version of this fully encrypted protocol. And so In November of 2021, users all of a sudden just couldn't access their proxies. Right. And it wasn't just one. Right. It wasn't just one particular implementation, but it was widespread. It was all of these different proxy developers, right. Shadowsocks and V2Ray are two names that are pretty popular in the Chinese censorship circumvention space. And both of their services had sort of fallen over as a result of this attack. And this was alarming because this had been a very stable approach for years and users were trying to understand what had happened. And there was no clear answer of, oh, this one developer has decided to pull the site, things like that.
Gregor Vand
Yeah, this is very interesting because on the layman side, I do remember there was just like a point in time when. So I was living in Hong Kong, yeah, I guess in November 2021, and beforehand. And there was this watershed moment where people started saying, oh, my VPN just doesn't work in China anymore. And it was sort of like, make sure you get maybe the quote correct VPN before you get there because you've got no chance of getting it whilst you're there, that kind of thing. So it kind of became this, which one works and some people, I've even seen posts today where I think people still go into mainland China and haven't fully anticipated the kind of tools, if you want to call it that, that they'll need to just be able to access things the way that they would hope to be accessing them.
Jackson Sip
Absolutely.
Gregor Vand
Yeah. Great. So let's move to the research that you and your colleagues did. I think this is fairly groundbreaking research into how this all worked. Yeah, let's just talk through that. I believe it was a sort of six month setup that you had to put in place. But I mean, let's kind of start there. Where did the research begin and how did you go about it?
Jackson Sip
Yeah, so as I said, I've worked extensively with this organization called GFW Report. They were sort of the leads on this project. They were the first people to be notified of what was going on. And there was no explanation to start. Right. It was sort of this widespread event where no one knew exactly what tools were working, what tools weren't. And it started with effectively just seeing, okay, which of these tools have all of a sudden stopped working and what techniques were they using. Right. And that led to this realization that, okay, it is in fact these encrypted protocols that seem to be the issue. Right. If we send a regular TLS connection or TLS style proxy connection, there doesn't seem to be a. But the fully encrypted traffic is an issue. And so what we ended up doing is setting up a number of VPSs in China, primarily in the Tencent and Baidu clouds. And then we set up some sync servers in the U.S. right. So we had a number of different university partners, Boulder as well as University of Maryland and the University of Massachusetts Amherst. And we set up servers at all of these universities that would just sort of listen to any TCP traffic that came to them. Nothing particularly interesting on that end. But what it allowed us to do is we could now send traffic from inside of China outside to the U.S. right, crossing that border and effectively triggering any censorship that may occur there. And while this sort of set up our architecture, we then sort of had to go into, okay, how are we going to determine what exactly is fully encrypted traffic? Right. This is a hard question. And so what is encrypted traffic and how did we figure it out? Well, the first thing we did is started sending in these TCP payloads. Just random bytes, right? See that? This is getting blocked. Okay, that makes sense. What if we send all 0 bytes or all 1 bytes? Right. And I'm sitting in my advisor's office late at night. We're trying to figure this situation out. And we see that certain bytes would get by when others wouldn't. Right. When we just sent a long string of the same byte. And ultimately what we found was that they were determining the entropy of the payload based off of the number of bits that had been set in the total payload. Right. So they were doing a count, what we refer to as a pop count, on each byte. And if it was roughly 50% of bits set 0 and 1, then they would consider that to be high entropy traffic and then they would ultimately block those connections.
Gregor Vand
So we're going to get into PopCount shortly, just kind of just sticking on, I guess, the setup here. Were there any kind of risks with this? I don't know. Did you ever think that effectively within China something would get triggered and sort of there might be any risks associated with this, or is that just something that comes along with the research and you have to just accept that that may be the case?
Jackson Sip
Yeah, that's a great question. So the risks with this particular work were relatively low. In our sort of threat model, if you will, we try to reduce the amount of information that we leave on these servers. Right. We assume that at any point in time they could get flagged as being owned or controlled by us and just shut down and we'd lose access to whatever was on them. So we are very careful in how we access those machines, making sure that all the data that we're collecting is removed quickly so that we don't, of course, have setbacks, as well as reducing what may be accessible to an adversarial nation like China, were they to conduct forensics on one of these servers. But aside from the sort of technical risks there, there's also kind of these risks to the people doing the research. Right. In this particular work, again, there wasn't much of a concern. We were just doing this sort of observation of what was occurring. But in some of our other work, we have really struggled with these concerns and even struggled to have people accept the research that we have done due to the legal risks associated with it, some to some of our work as being an offensive attack against these services that are deployed in China, which is
Gregor Vand
very interesting because at least from what you described so far, you're simply allowing traffic to leave China and go, in this case, I guess, into the US There is no sort of threat data there. It's not like you're doing a kind of reverse pen test or something. It's just information.
Jackson Sip
Exactly. You're exactly right in this case and now some of our other research. One particular attack we found memory leak in their DNS censorship that allowed us to leak vast amounts of memory off of these servers. That was one of the attacks that we were maybe a little more concerned about. But it sort of goes to show how just observing this traffic or sending traffic back and forth isn't something to be as concerned about.
Narrator
Yeah, why is there always a meeting bot in your Zoom call? Blame Recall AI Recall AI powers the meeting bots and desktop recording apps behind products like Cluli, HubSpot and ClickUp. They handle the hard infrastructure work capturing clean recordings, transcripts and metadata across Zoom, Google Meet Microsoft Teams in person, meetings and more so developers don't have to build it themselves. If you're building a meeting note taker or anything involving conversation data, Recall AI is the API for meeting recording. Get started today with $100 in free credits at recall AI software in Mobile application security Good enough is a risk Guard Square uses advanced multilayered code hardening techniques and automated runtime, application self protection and mobile application security testing combined with real time threat monitoring to deliver the highest level of mobile app security. Discover how Guard Square brings all these together to provide mobile app security for your Android and iOS apps without compromise at www.guardsquare. if you're an engineering leader, you know this cycle your team's focused on building product. But someone in Ops needs a dashboard, marketing needs an admin panel, finance needs a custom workflow. The requests pile up, you can't get to them all, so people start building their own solutions. Shadow IT spreads, and eventually you're the one stuck cleaning up tools that were built with duct tape and good intentions. Retool breaks that cycle. Their AI appgen platform gives teams a governed place to build the tools they need so everything stays secure and under your control. Someone could type build me a customer admin panel that manages accounts from Postgres and they'd get a real production ready app with proper permissions built in, your teams get unblocked and you don't inherit a pile of technical debt down the road. So if you're tired of being the cleanup crew for Shadow IT, head to retool.com sedaily and see how other engineering teams are democratizing app building without creating chaos. Because honestly, we could all use a better way to handle internal tools. Sometimes you just need Retool.
Gregor Vand
So you mentioned PopCount. That is an interesting concept as you've just touched on it refers to sort of the density or the bit density in the payloads. But let's talk a bit more about that because it seems like that was such a critical kind of heuristic for how the DFW now operates.
Jackson Sip
Yeah, absolutely. So pop count is kind of a technical term or an implementation of hamming weight, right? So with hamming weight, we're concerned with like, the number of non 0 symbols in some type of data. And with pop count, we are just counting the number of non zero bits in some form of data, right? So for each byte, we simply count, okay, how many bits have been set and, you know, eight bits in a byte. So if we see four of those bits set, we can assume that this is relatively high entropy. Right. The closer we are to 50%, the higher likelihood that this is a packet containing high entropy and likely encrypted data will cluster around that number. Right. Because if we have to flip a coin eight times. Right. We would expect four heads and four tails. Same thing applies when setting bits in a byte. And ultimately this sort of speaks to how the GFW wants to execute this censorship, right? They are doing the best they can to come up with crude but effective techniques. Right? So this is a relatively simple thing to do. Right? We're just counting bits. But that's the sort of technique that they have to deploy in order to stay relatively optimized and keep up with the vast amounts of traffic that they are constantly ingesting.
Gregor Vand
Okay, and just to back up for a second, the sort of overall thing that was kind of discovered here was the idea that rather than selectively blocking specific things, it's kind of the other way around now, which is that everything is blocked. Unless, I guess, in this case, the bit density can be determined to be. Well, I'm looking at the notes here, which is that it has to fall within a very specific range, is that right? For it to be.
Jackson Sip
So it's funny, the range was exactly 3.4 to 4.6. Was the count within that threshold, they would consider that traffic to be high entropy, and they would ultimately block it. How they came up with this number, we're not exactly sure. Right. That was just in our test. That's what we observed, is the exact threshold for which they would consider it to be high entropy.
Gregor Vand
Right. And so then for the GFW to sort of accept that this data is random and should pass effectively, as you touched on, that sort of clusters around, like the 4.0 of bit density, is that correct?
Jackson Sip
Exactly, exactly. So count all the bits that you have in the packet and see how many are set to 1. And then, so divide the number set to 1 by the total number of bits, right? And if that value shows up between 3.4 and 4.6, then you can say, okay, this is clearly an encrypted payload and we're going to block it. Otherwise, we allow it.
Gregor Vand
Gotcha. Yeah. So then there are kind of like other, I guess, exemption rules that I think you discovered as well, around ASCII and protocol fingerprinting. So maybe if we could just dive into those as well.
Jackson Sip
Absolutely. And so I think what's important to understand here is that that entropy count actually came last. Right. So we refer to our rules as exemptions, right? So we're trying to find a way to exempt traffic from blocking. Sort of like you said, block everything and then decide what not to block. And when we think about pop count, right, we have to look at every single bit in that payload in order to conduct that entropy test. Right. And that's. That can be considerably resource intensive when you're considering the scale of traffic they're observing. So they tried to identify other tests that they could run first to sort of filter off that traffic. And there were two different ways that they did this. The first being that they would look for large amounts of ascii. So if they saw a continuous string of ASCII characters encoded ASCII characters, they would exempt that traffic. If it started with ASCII characters, they would exempt that traffic. And that was effectively just saying, okay, this is unlikely to be encrypted data. Right? So we can, even though we don't know what this protocol is, we can allow it. Right. The other approach they took to filter off traffic before they reached that entropy measurement was through protocol fingerprinting. It was really a crude protocol fingerprint, but effectively they chose a handful of protocols that they knew they saw a lot of. For example, TLS looked at what the first few bytes of a TLS connection were, and then if they saw those bytes at the start of any packet, they would say, this isn't going to be considered a fully encrypted packet. Right. And that did two things. In the case of tls, you're able to filter off the vast majority of your traffic. And the measurements that we conduct here in that regard, it would be about 80% of traffic will get exempt right there with looking for TLS fingerprints. But in addition to that, if you think about tlspacket, right, we've got a couple of header bytes at the top, and then we have an encrypted payload. So if you aren't filtering off that traffic, you will have an insanely high false positive rate because so much of the Internet is in fact encrypted. It's just encrypted in a way that we're okay with. Right. As long as it's tls, that's not a problem. Or as long as it's ssh, that's not a problem.
Gregor Vand
Yeah. I wanted to then touch on false positive effectively as sort of, you call it sort of collateral damage to like, legitimate traffic, if you like. I believe in the research estimated around 0.6% of the normal Internet traffic would be false positives. How did you, I guess, arrive at that validation?
Jackson Sip
Absolutely. So the false positive rate we concluded was 0.6%. You're exactly right. And the way that we did that is by taking these rules that we had determined and testing them against benign traffic that we had here at the University of Colorado. So we are fortunate to have a resource here that allows us to see all of the network traffic going to and from the university. And we rolled out these rules and basically looked at the traffic that wouldn't be blocked or that would be blocked if it was conducted here at cu. And what that allowed us to do is say, okay, no one at university in the United States is going to be using censorship circumvention tools or proxies. Right. Because there's no blocking occurring. Why would they go to the trouble? And so we could sort of infer that this may be the false positive rate. And that allowed us to identify a couple of other exemptions that they were doing. We took some of the bytes that we observed in those packets that would have gotten blocked here at cu. And we found some other TLS resumption headers that they were allow listing. And we found that the large majority of the traffic that would have gotten blocked here at CUZ ultimately belonged to torrent services, those specific protocols for torrenting. And so we believe that there may have been some reason that they still wanted to block that traffic anyway. So their false positive rate might have been seen as even lower based off of that.
Gregor Vand
Yeah, very interesting. I guess that's sort of an interesting piece of doing it within a university. There's obviously a lot of, just a lot of traffic generally. So, yeah, I mean, was there any sort of, like, if you were to do this again, for example, would there be any benefit to doing it in a sort of more siloed environment, if you like, or do you think that would not really affect the results.
Jackson Sip
Do you mean for calculating the false positive rate?
Gregor Vand
Yeah, exactly.
Jackson Sip
Well, I don't think so. And the reason being is that having access to this sort of network that has just a large amount of random traffic should be, in theory, representative of the type of traffic that would be coming and going across that national border. Right. And so our hope is that we're modeling that as close as possible.
Gregor Vand
Yeah, that's a really good point. Okay, so if we look at the idea of popcount manipulation, so I think I'm going to let you take us through this. So, I mean, you are a programmer, and I believe a lot of this was done in Rust. So I know we've got a lot of Rust listeners as well on the podcast. So could you maybe just take us through sort of what even is popcorn manipulation? And why was that part of the research?
Jackson Sip
Absolutely. So once we had sort of determined, okay, this is how they're doing this blocking, Right? What can we do to help proxy developers get around this blocking? Right. This is one of the most common techniques for censorship, circumvention. We've got to get it working again. And our solution was, we knew that they didn't have some fancy metric for determining whether or not it was encrypted, right? It was just this pop count. So what if we stuff the payload full of additional bits? Right? We can check the pop count of our own packet that's about to go across the network. And if we see that that payload is highly entropic, which of course it's going to be because it's encrypted, we can instead add some ones or some zeros, depending on which way we need to go and get outside of that encrypted threshold range. Right. And so that was one of the more sophisticated techniques that we deployed in order to stop this censorship from being effective, was we would add bits to the payload using a scheme based off of the key for the encryption, so that it was pseudo random, Right? There wasn't just a bunch of ones at the beginning or a bunch of zeros at the beginning or something, and then we would add a few bytes at the end that would tell you exactly how many bits you need to remove. Right? And so then when the server or the other host receives this packet, they can decrypt how many bits they need to remove, and then they can determine which bits are supposed to be removed based off of that key. And ultimately this got implemented in shadowsocks Rust implementation, as well as the Shadow Socks Android implementation. And has been working to this day, I believe.
Gregor Vand
Wow. And what kind of performance hit or overhead would you say this adds, if any?
Jackson Sip
Sure. So we estimate that the overhead is about 17%. So if you assume a worst case scenario of you have a pop count of exactly four. Right, meaning we had four zeros, four ones in every single byte on average. What we need to do is add just enough bits. In this case we'd be adding ones right. To exceed the pop count of 4.6. And what that works out to is about a 17.6% overhead, which we find to be very tolerable. Right. Given the overhead that's already required for doing this sort of multi layered proxy. We believe that a 17% overhead is something that you could apply to every single packet and not necessarily just the first few packets of the connection or to get around the censorship.
Gregor Vand
So Is this like shadowsocks rust to your knowledge? Is this being used within VPNs, like commercially available VPNs or how has this sort of been taken up beyond the sort of research stage?
Jackson Sip
Yeah, so shadowstocks Android is an app that you can download and you can use as a client, right. Shadowstocks Rust is a, they provide like a command line implementation, but it's also designed as a library that can be added into other VPNs. Off the top of my head, I don't know the names of like the commercial ones that do use it, but it is heavily used by more user friendly applications.
Gregor Vand
Yeah, this is something I'm probably post show going to go and do some extra research on, you know where certain I wouldn't sort of name names but like certain VPNs have like said oh we've got this like new protocol and it's is faster, better can get through anything else that might not have been working. So I'm curious if maybe this is what actually they've been referring to building into their commercial products. So that's a little note for me to go on afterwards. I think you also came up with other sort of circumvention methods. One is sounds very simple but I mean there must be a lot more to it like adding an HTTP or TLS header. How does that even work?
Jackson Sip
You're exactly right, it is as simple as it sounds. And so we found that there were these exemption basically byte strings, right. That would be the first four bytes of a TLS connection. Right. And so we said okay, can we just add those four bytes to the start of a fully encrypted or fully random payload and the Answer was yes. And we were like, whoa. Well, that was really simple. And so what we ultimately ended up doing is we provided this information immediately to the proxy developers, right? We said, hey guys, we're working on this. We're still not exactly sure how it works, but we do know that if you add these couple of bytes to the start of every payload, your traffic will get through no problem. And so we sort of use this as a emergency solution, right? A little band aid to give out that would allow people to get their services back up and running while we developed the kind of more sophisticated approach behind the popcount manipulation.
Gregor Vand
I mean, is there any evidence that the GFW is like actively probing suspected proxies like this or.
Jackson Sip
Absolutely. So active probing was a technique that was around long before this particular incident. And it's one of the primary concerns of proxy servers, right, Is that the GFW will see a connection to an IP address, right? And they'll be like, that connection looks a little weird. We think it might be this proxy implementation, but we don't want it to go block an IP address willy nilly. So what we're going to do is we're going to make a couple of connections to that IP address and see if we can get it to talk to us in a way that is representative of one of these proxies, right? And so this is a common technique that they were using and it was actually they were doing active probing with sort of this entropy check prior to the passive detection that was going on in this paper. And I do think that that's an important distinction to make, is that this was a passive attack where they were just blocking any traffic that met these parameters. And it wasn't affecting specific IPs. There were subnets that were specifically targeted, but it wasn't like this specific IP is a proxy and we're not going to allow traffic to it.
Gregor Vand
And like, I mean, is there any sort of defense against this or is this just kind of part of the landscape? And it's going to be for anyone that's like trying to do this. Is this just sort of a cat and mouse thing?
Jackson Sip
Well, and cat and mouse is the term that we frequently use, right? Is how are we going to get just one step ahead this week? And there are a couple of different defenses or strategies that we can use to prevent active probing. There are techniques like trying to behave the same for every single connection, right? So you might try to have a standard web server, right? And it will receive standard TLS connections, standard HTTP Requests, Right. It will respond to those. But when it gets special connections, right, proxy connections, it will handle those differently. And the hope is, is that by using some type of cryptographic key that's sent by the client, the sensor will never be able to determine that it is in fact a proxy. Right. They won't have access to that key in order to get that host to respond in a specific way. There are other approaches, kind of similar vein in which use some sort of other application as a front, right? So the naive proxy is known to basically use Chromium as its sort of front application and then the proxy sort of runs on top of Chromium, right. So everything looks like it's a standard Google Chrome TLS connection, but in reality it's a tunnel for naive proxy. The third approach that we also use here is and has been a large chunk of our research is refraction networking. And in this case what we do is we say, okay, you can send a connection to any IP in this range that we can observe, right? Say like the University of Colorado's IP range. And using this like network tap that we have access to, we'll just look for connections that look like they belong to us and when we see one, we'll pick that connection up, we will start our proxy tunnel with that connection and redirect all that traffic to its intended destination. Right. And why that's so advantageous and sort of defeats the cat and mouse is now the sensor has to decide, okay, in order to block this proxy, we have to block every single IP address associated with this American university. And so we really raise the stakes of the false positives associated with blocking that sort of traffic.
Gregor Vand
Interesting. Okay, so if we look at deployment, once this was kind of all discovered, if you like, did you have to do responsible disclosure and who do you even make that to? And how does that then I guess filter through into sort of actual non academic users? I guess, yeah, sure.
Jackson Sip
So responsible disclosure is a funny subject to bring up when you're attacking a system like the gfw, Right. There are kind of two sides to that coin. Right. We are disclosing our findings to the circumvention tool developers. But then there's also this idea of, well, do we disclose something to the gfw? Like, is that something that needs to happen? And in this particular case, no. Right. We were just reverse engineering this technique that they had. And it's not usually something that we're particularly interested in doing is helping them figure out what we've learned. But we do work closely with the circumvention tool developers. Right. So, like I said, in regards to the sort of prepending bytes, that seems very simple. That was information that went out very early in this process. Like January of 2022 is when we ultimately sort of found that and started giving that information out to these developers so that they could start patching their tools as early as possible. And that sort of gets into what you were asking about. How do we filter this out of the academic world and into the hands of, you know, people on the ground. And it's those same relationships, Right, where we already work closely with the developers of these tools and when we find something out, we can let them know before it goes public and they can try and get to their users as quickly as possible.
Gregor Vand
Yeah, I do find the whole, as you say, responsible disclosure and then thinking, would you sort of disclose this to what, the ccp? It's a strange one to sort of think about. That said, do you have any understanding that this research has been taken in on that side and any lines of communication around it or sort of implicit or explicit changes that you've seen as a result?
Jackson Sip
So with this particular research, no, there wasn't really anything we told them and there wasn't really anything that we could observe. We'll get into some more of the timeline, I think, in a little bit. But on other works, like I mentioned, we had this offensive attack in their DNS injector, and we did try to disclose that to them. We sent emails to C Insert. That was who we decided was the best point of contact. We never heard back. But what we did find is that there was patching of this vulnerability after we had sort of made this disclosure. Right. So we do believe that they to some extent are listening and some extent are paying attention to what we're doing. But it's still that black box where you don't have 100% confirmation.
Gregor Vand
And yeah, as you alluded to, there is sort of another timeline marker here, which is, I think it's March 2023 that the GFW actually stops dynamic blocking. Could you talk to us about what you understand from that?
Jackson Sip
Yeah. And again, this is one of those parts where we can speculate a lot and I can give you some of my thoughts on why, but the facts are we don't know. Right. What we do know is that, you know, the blocking began on November 6th of 2021 and ended on March 15th of 2023. In that timeframe, there were a number of political things going on in China. Xi Jinping was running for reelection in the beginning of March of 2023. I believe he was like, reelected two or three days before the blocking stopped. And we've seen that pattern occur not just in China, but in a number of other countries where when politically sensitive events are occurring, they will sort of ramp up their censorship. When those events sort of settle down, they tend to turn off their censorship or ramp it back down. What's interesting with China is usually when they find a technique that they like, they stick with it, they keep it going kind of ongoing. And so what was different about this technique is unclear. We continue to do longitudinal measurements to see if this censorship has been turned back on. We still haven't seen any evidence that it has been re enabled, but we can only speculate that that potentially it was too computationally intensive or that the vantage points required were too valuable to be using for this technique.
Gregor Vand
Yeah, and I'm thinking of the dates again, so 2021 and then 2023. Do you think the intertwines with politics as you've touched on. But the pandemic, that was obviously quite a sensitive time for China from an information standpoint, because it seemed like at least. So I was living in Hong Kong at the time, which had people often refer to Hong Kong and the lockdowns. And I can say categorically there were no lockdowns in Hong Kong, but there were for sure lockdowns in China sort of around being able to leave your house and so forth. And I think it was quite sensitive as to people within China being able to see what the rest of the world was doing, especially towards kind of maybe towards the end of the. The rest of the world would call the end of the pandemic, because that seemed to be kind of then the worst part for China. And as you say, it was ramping up into a reelection, if you want to call it that, of Xi Jinping. So anything there that you think could be related as well?
Jackson Sip
I think it's very possible. Again, that's what really one of the things that makes this research the most challenging, though, is you never really know. You don't get copies of the memos sent around the government that say, hey, turn on this censorship on this day for this reason. Right. And so you're exactly right. It could have been certain events going on, like sort of the tail end of the pandemic that was causing the censorship to ramp up. It could have been the political timing.
Gregor Vand
It's really hard to say, yeah, China in a nutshell.
Jackson Sip
Exactly.
Gregor Vand
Yeah. And like today, here we are in 2025 and just like what is. Is the current status the same effectively that it's off, or have you seen anything change?
Jackson Sip
Yeah. In regards to this specific technique, we haven't seen any activity. Right. That's not to say the GFW is turned off. Right. This is one small piece of the GFW and they're still deploying a number of other techniques. But the idea of passively observing and blocking fully encrypted traffic, we are not seeing any signs of today.
Gregor Vand
Looking at just speculation on how this was actually operating, even though as you've just touched on it, this specific part is probably turned off. But general GFW architecture and just a speculation on how that all works. What do you think it is?
Jackson Sip
Yeah. So from a technical perspective, there are a couple of different vantage points that they could have access to. Right. And the two that sort of apply here is what we would refer to as like an in path sensor or vantage point versus an on path vantage point. Right. And an in path vantage point sits quite literally in the path. It is one of the hops that your traffic takes between the two hosts. Right. And what's special about that vantage point is that it has the ability to tamper with that connection. Right. Could change the payload or it could go so far as to just start dropping the packets. And because of the blocking technique that we observed in this research, we believe that they were using these in path vantage points because they were just dropping the packets afterwards. The other side of that is what's called an on path vantage point. Right. This is a less sophisticated or less valuable vantage point where they get a copy of all of your traffic and they can even send packets back to the hosts, but they aren't sitting in the path where they can manipulate and potentially drop that traffic. Right. And it's an important distinction because those empath vantage points, if something goes wrong, right, that could lead to Internet outages. Right. And you could have major issues there. So I believe those are potentially sensitive assets and they are very careful of what they allow run on those. Most of the censorship that we observe ongoing tends to be more of an on path attack.
Gregor Vand
And do you think there's any ML machine learning kind of going on there? Or it's kind of more of a static, I guess, implementation?
Jackson Sip
Yeah. So I think that for specifically what we observed, they had to have deterministic code that was running, they had if statements that were checking those bytes and checking that entropy. Because ML inference is just too slow to be be doing line rate. However, it could be possible that they were using ML to try and determine some of those heuristics. Right. It's hard to say though. We have confirmed that the GFW is using machine learning in other aspects. There's been a massive leak of information in the past couple of months coming out of a Chinese corporation called Gedge and a research lab called mesa. And we have found evidence in those repositories that they have been using machine learning to classify data and determine what traffic is coming from a sensor and not. However, I think that most of that research is being used in those active attacks. Right. Those attacks where they're making the connections. Because to try and do it passively at line rate is just unrealistic.
Gregor Vand
Gotcha. So something that kind of comes up in my line of work is is again, I sit in Singapore and so I'm working currently with products that do get used within China and people sometimes come to me and say like, oh, do you think there's going to be a problem with this team using. When I say problem, I mean like a performance or just access problem with this team using the product within China. I give a very kind of this point, layman answer. The last time I went into mainland china is probably 10 years ago at this point. I used to go quite often and then just stopped. But my understanding is always just that there is this massive speed, massive latency around traffic that goes in and out of the perimeter, if you want to call it that. And that's usually my kind of. Again, standard piece of advice is just if this is not something that has infra based within the gfw. So that's usually why people strike up deals. I believe Cloudflare has a product that's kind of like a deal where they've had to JV joint venture with a Chinese partner. So that that's a way to enable your product to get over this latency. I think you sent me a little piece on this before we chatted today. It's called the Great Bottleneck. So yeah, maybe could you just talk to us about that? Because I think I would also love to learn a little bit more around what actually is this, rather than me just saying, oh, there's just a latency problem and that's what you should assign.
Jackson Sip
Sure, yeah. We've got to get more creative with our naming.
Gregor Vand
I mean, the GB something. Yeah, exactly.
Jackson Sip
Yeah, yeah, exactly. It's always got to be the great.
Gregor Vand
Yeah.
Jackson Sip
But yeah, so this has been an issue for. It seems like a long Time. And it's an issue that affects, you know, more people than just the citizens of China, Right. Who are trying to access things they aren't supposed to. Right. The great bottleneck is this sort of widespread idea that that traffic going outside of China is going to be slow or it's going to have some type of network issue. Right. And the research that you're referring to I think found that was primarily on the download side. Right. So when we are pulling traffic from other places, we have an issue or pulling data from other outside of the country in we have an issue. The upload doesn't seem to be as big of an issue. I think that this primarily isn't necessarily a censorship technique. Right. I think that if it was, if it was a result of the censorship that we've observed, if they didn't have the compute resources to effectively censor this traffic, we would see it more on that upload side. Right. When that HTTP request gets made, that's when you would see the slowdown because that's the traffic that you are trying to observe for potential censorship things. So I think it is truly just a lack of international infrastructure in China. I think that, I think the reason that they are doing that is because they sort of have this goal of creating this isolated ecosystem, right. They want for their users or their citizens to primarily be accessing Chinese based sources, right? Chinese tools, Chinese services, what have you. And so I think that even though they could probably afford to spend the money to improve that international infrastructure, build some more cables under RC cables or what have you, they are choosing not to almost as a way to incentivize both, both companies and users to use national services. And so yeah, I don't think it's necessarily malicious, but it's more pushing towards this goal of isolation.
Gregor Vand
Yeah, the cloud providers are sort of another proxy battle I guess, where you've got, you know, like Alibaba cloud and others within China and then obviously the big three that we know in the US and the rest the of of the world. So I think that's an interesting take on it, which is, yeah, this probably has as much of a commercial leaning as it does. I think that's another thing where people go into China and I asked a friend who goes more often than I do, I said, oh, my family member going to China, what should I advise them? And he just said, oh, you must download this specific app. Yes, it's half in Chinese, but this is how you'll be able to do stuff like call a cab or and I Mean, maybe seems obvious, you go to another country and okay, like in Singapore say hey, get the grab app because Uber is not a thing here. But this is like a whole layer extra which is just sort of like you're going to be able to achieve nothing unless you actually download this super app and has maps and transport and food and all this kind of stuff. Everything else just isn't going to help you out, basically.
Jackson Sip
I think you're exactly right. This isn't something that's unique to China. I mean, and that sort of super app seems an extreme example. But just like you said, there tend to be ride sharing apps that exist in certain regional areas or Even like the US trying to block TikTok unless it gets sold to an American entity is really the same thing. Right. It's this kind of idea of we are going to prioritize national services before international services.
Gregor Vand
Yeah, absolutely. And yeah, I mean I just a carve out with this is that Hong Kong still doesn't really experience this bottleneck effect. At least that was my when I was last there, I think last year. There is no sort of GFW per se within Hong Kong. And is it used as any kind of proxy or something? I think we were touching on this before we started recording.
Jackson Sip
Yeah. So there is sort of this idea that Hong Kong tends to have a freer Internet in our perspective of things. I would definitely defer to you on any lived experience, but we do see that a number of proxies will kind of use that as their first hop. And there's even some speculation on if there are gray market links that exist between China and Hong Kong specifically for the transit of proxy traffic that's ultimately going to go other places. Some people even believe that a lot of the proxy traffic exists simply as a way to get around the bottleneck. Right. They might not be trying to access undesirable content. They just have some type of maybe business need to get outside of the country. They aren't at the enterprise level where they can afford these contracts with a Cloudflare and so they use these proxies as an alternative means.
Gregor Vand
Yeah. And this crosses into the realms of dark fiber, I think.
Jackson Sip
Yeah.
Gregor Vand
So I mean my kind of anecdote on this is just like I could be sitting in a Starbucks here and I'm overhearing conversation of someone who's clearly come from Hong Kong to sell a dark fiber contract to someone in Singapore. Could you maybe just touch on that briefly? How does that cross into this?
Jackson Sip
I'm not much of an expert on that, I've only heard sort of, like you said, these anecdotal experiences of somebody has access to some type of link that goes from mainland China to Hong Kong, and then they're selling access to that to people in China. China. But I would love to know more.
Gregor Vand
This looked. I mean, at least to me, it kind of looked legit. But equally, I don't know, why is it being held in a Starbucks and not in someone's office? Maybe that's all we need to know. If we sort of. Then just look at any other countries that you at least believe may have adopted these techniques. I mean, maybe there's some obvious ones that people are already thinking of. But just from where you sit, which other places have you maybe seen adopt the GFW approach?
Jackson Sip
Yeah. So there are a couple of different things to consider there specifically for the GFW approach. I think it's interesting or pertinent to get back to that sort of leak that I was referring to. And you may have heard about this. It sort of hit some of the mainstream news sources in that this massive trove of data got out of this Chinese corporation called gedge. And this corporation's goal is effectively to commercialize censorship software and hardware and then distribute it to other nations. And in that leak, they found that Kazakhstan, Ethiopia, Pakistan, and Myanmar are all confirmed clients or customers of this organization, that is gej. And so we know that they to some extent have the capabilities of the GFW through these business dealings. However, those aren't the countries that we typically think of. Right. When we think of other nations that are doing censorship, I think one of the big ones that comes to mind is Iran. Right. We don't know if there's any relationship between Iran and China's censorship infrastructure. Iran, they all tend to have their own sort of character, if you will. Iran is known to be very dynamic. They are constantly changing what's getting censored, how it's getting censored, where it's getting censored. And two neighboring ISPs in Iran can have completely different censorship experiences. And then another one is Russia. I mean, Russia, again, has sort of its own interesting culture where they have more of a capitalist spin, if you will, on their infrastructure. Right. In China, you kind of have two or three major ISPs. They tend to have heavy involvement from the government. And so deploying that censorship infrastructure is very much a state activity. But when you look at Russia, the censorship is basically the state saying to all of these sort of independent ISPs hey, look, you're going to block this list of sites. We don't care how you do it. Figure it out. Right. And so Russia has its own way of determining or you see these weird characteristics where one ISP will look different from another and they seem to be doing almost this bare minimum approach. Right. Where it's very easy to get around and it's very patchwork. So, yeah, the censorship landscape is fascinating. How much of a shift you see between all of these different countries?
Gregor Vand
Yeah, for sure. And I guess just looking ahead, if you want to call it that, any predictions where things will go? I mean, we've touched on a lot of the whole idea here is that things just seem to keep changing and there might not be any rhyme or reason, at least from the outside, as to why. But yeah, just if you were to kind of think ahead, anything that you suspect might change or happen, it's hard to say.
Jackson Sip
It is a dynamic landscape. But I do believe that censorship is going to become a little easier for the censors and it's going to become harder to circumvent. We're seeing, especially in China, this shift of middle boxes closer and closer to the end user, where due to sort of a scarcity of IP addresses, they're implementing these NATs to where a local or residential address won't have a dedicated ip. Right. And so as we sort of see more and more middle box infrastructure getting closer and closer to that end user, it will become easier for them to deploy more of a distributed censorship network that can kind of implement more advanced techniques because it's taking on less.
Gregor Vand
So just as we sort of start to wrap up here, what about your own research? Are you doing more into this kind of going forwards or are you kind of leaving this here and moving on to like a different part of the censorship landscape or. Yeah. Where are you taking things from here?
Jackson Sip
Yeah, so I've been looking extensively into that leaked information I referred to. That's a relatively new finding for us. And so we've been looking at that to sort of of be that ground truth that I said doesn't exist. Right. Confirm some of these statements that we have made in all of the research we've done so far, as well as trying to understand better what role machine learning can play in this space. Right. So a lot of my research is focused around that network tap here at cu and having access to this massive corpus of Internet traffic that is real, and finding ways that we can better classify that traffic or even generate synthetic traffic is sort of the direction I'm headed in next.
Gregor Vand
Awesome. Well, that all sounds fascinating. I hope maybe we get to catch up in a couple of years, hear what you've been up to since then, and just want to really thank you for your time, Jackson. And I've learned a ton today. I'm sure the audience has as well. So obviously, you're doing great work. It's very important that this information is understood, at least outside of China. So, yeah, thanks so much.
Jackson Sip
Absolutely. Thank you so much for having me.
Podcast: Software Engineering Daily
Date: February 19, 2026
Guest: Jackson Sip (PhD Researcher, University of Colorado Boulder)
Host: Gregor Vand
This episode explores the architecture, evolution, and recent breakthroughs in understanding China’s Great Firewall (GFW)—one of the world’s most sophisticated systems for internet censorship. Jackson Sippe, a leading researcher in internet censorship, joins host Gregor Vand to demystify how the GFW detects and blocks traffic, particularly focusing on a novel technique that disrupted encrypted proxies from 2021–2023. The conversation also delves into the cat-and-mouse landscape of circumvention tools, technical and political dynamics, and the global export of censorship technology.
Memorable Quote:
"One of my favorite facets of the GFW is a tool called the Great Cannon..."
— Jackson Sip (03:36)
Quote:
"We can... only speculate. Right. It takes a number of experiments to really determine whether or not what we have observed is what we think it is or just some other effect of the network."
— Jackson Sip (05:57)
Quote:
"Users all of a sudden just couldn't access their proxies. And it wasn't just one particular implementation, but it was widespread."
— Jackson Sip (09:38)
Key Segment:
"If that value shows up between 3.4 and 4.6, you can say, okay, this is clearly an encrypted payload and we're going to block it."
— Jackson Sip (21:36)
Quote:
"We found that the large majority of the traffic that would have gotten blocked here at CU... belonged to torrent services."
— Jackson Sip (25:37)
Notable Moment:
"Can we just add those four bytes to the start of a fully encrypted or fully random payload? The answer was yes. And we were like, whoa."
— Jackson Sip (31:50)
Quote:
"It's... one of those parts where we can speculate... but the facts are we don't know."
— Jackson Sip (39:22)
Quote:
"This isn't something that's unique to China... US trying to block TikTok unless it gets sold to an American entity is really the same thing."
— Jackson Sip (50:20)
| Timestamp | Segment | |-----------|--------------------------------------------------------------| | 02:59 | What is the GFW? | | 06:47 | The November 2021 blocking event explained | | 11:37 | Research setup and challenges | | 19:16 | Popcount blocking algorithm detailed | | 22:13 | ASCII/protocol fingerprint exemptions | | 24:30 | False positive analysis | | 27:05 | Circumvention: Popcount manipulation and protocol masking | | 31:36 | Quick fix: TLS header pre-pending | | 39:02 | Political context: Why was dynamic blocking turned off? | | 47:01 | The Great Bottleneck (international latency in China) | | 50:39 | Hong Kong and the bottleneck bypass | | 52:32 | Export of censorship tech to other countries | | 55:16 | Future trends in censorship and circumvention |
This episode offers a rare, technical, and accessible window into how China’s Great Firewall operates and adapts. It highlights significant advances in reverse-engineering national censorship tools, the rapid evolution of circumvention tactics, and the broader implications of digital control systems—both in China and globally. For software engineers, policymakers, or anyone interested in internet freedom, Jackson Sippe’s insights provide a timely, inside look at an ever-changing digital frontier.