Automatic Parallelization
of Software Network Functions
Maestro is a tool that analyzes a sequential implementation of a Software Network Function (NF) and automatically generates an enhanced parallel version that carefully configures the NICs RSS mechanism to distribute traffic across cores, while preserving semantics. When possible, Maestro orchestrates a shared-nothing architecture, with each core operating independently, without requiring shared memory coordination, maximizing performance. Otherwise, Maestro choreographs a fine-grained read-write locking mechanism that optimizes operation for typical Internet traffic.
We analyzed 9 NFs to parallelize them: a simple forwarder (NOP), a static bridge, a dynamic MAC learning bridge, a policer, a firewall (FW), a network address translator (NAT), a connection limiter (CL), a port scan detector (PSD), and a load-balancer (LB). We present here our findings on each NF, and some example RSS configurations found by Maestro that enable shared-nothing parallel solutions (when available).
NOP
The NOP NF keeps no state between packets, limiting its function to just forwarding them between ports. As such, Maestro finds that its shared-nothing parallel implementation is trivially implemented by generated random RSS keys.
Shared-nothing solution: code.
Lock-based solution: code.
TM-based solution: code.
Shared-nothing RSS configuration:
uint8_t LAN[RSS_HASH_KEY_LENGTH] = { 0x9a, 0x21, 0xdd, 0x85, 0x4f, 0x59, 0x4e, 0xe2, 0x5c, 0xd2, 0xf7, 0x47, 0xfb, 0xe3, 0x07, 0x51, 0x80, 0x5b, 0x2f, 0x56, 0x44, 0x9b, 0x75, 0x8d, 0x44, 0x88, 0xd9, 0xe9, 0x79, 0x27, 0x63, 0x14, 0x48, 0x40, 0x99, 0x97, 0x9a, 0xe7, 0x79, 0xf6, 0xba, 0x71, 0x3d, 0xb5, 0x54, 0x44, 0x06, 0xd4, 0xa0, 0x35, 0x2a, 0xe4 }; uint8_t WAN[RSS_HASH_KEY_LENGTH] = { 0x63, 0x8b, 0x27, 0x9f, 0xdd, 0x8e, 0x4e, 0x87, 0xd1, 0xdc, 0x85, 0x20, 0xfa, 0xcf, 0x2c, 0xc4, 0x33, 0x61, 0x23, 0x03, 0xaf, 0xb7, 0x0f, 0x40, 0xcf, 0xf9, 0xd5, 0x7f, 0x0b, 0x83, 0xc8, 0x6e, 0x0f, 0xef, 0x0d, 0xec, 0x7d, 0x5b, 0x73, 0x4e, 0x38, 0xf8, 0x6e, 0x32, 0xc7, 0x9a, 0xf7, 0xfb, 0xfb, 0x1a, 0xfe, 0xaa };
Static bridge
The static bridge is configured with a list of MAC addresses associated with ports. For every packet, it finds the destination MAC address in its table, and forwards the packet to the corresponding ports. If it fails to find the MAC address in its persistent state, it broadcasts the packet.
Maestro finds that the state that persists across packets is, in fact, always read and never modified. This makes this NF quite easy to parallelize, as there is no need for careful sharding of state. Maestro infers this conclusion and generates a random RSS configuration for this NF, allowing shared-nothing parallelization.
Shared-nothing solution: code.
Lock-based solution: code.
TM-based solution: code.
Shared-nothing RSS configuration:
uint8_t LAN[RSS_HASH_KEY_LENGTH] = { 0x7e, 0x08, 0xb8, 0x0d, 0x5d, 0x90, 0xdb, 0x54, 0xad, 0xe4, 0x35, 0xd6, 0x17, 0x1d, 0xaa, 0x3d, 0x71, 0xf7, 0x53, 0x22, 0x70, 0x8d, 0x96, 0x5d, 0xe2, 0xf4, 0xb5, 0x6f, 0xe5, 0xd8, 0x7c, 0x63, 0xe1, 0x35, 0x71, 0x3e, 0xc5, 0x4c, 0x92, 0x72, 0x30, 0xc8, 0x48, 0x47, 0xe5, 0xf3, 0x84, 0x56, 0xea, 0xd7, 0x78, 0x5a }; uint8_t WAN[RSS_HASH_KEY_LENGTH] = { 0x2e, 0xa9, 0xd0, 0xbd, 0x54, 0xc8, 0xa9, 0xb4, 0x12, 0x6c, 0x20, 0x8c, 0xe6, 0x12, 0x8f, 0x3a, 0xbd, 0xd4, 0x34, 0x9b, 0x9a, 0x28, 0x88, 0x6f, 0xbd, 0x3a, 0xbf, 0x93, 0x12, 0xa8, 0xa5, 0x40, 0x51, 0x75, 0xfd, 0xa5, 0x3e, 0xa7, 0x5a, 0x50, 0x13, 0x7a, 0xdc, 0xf9, 0x8d, 0x6b, 0x33, 0x4a, 0x3f, 0x67, 0xe6, 0xd9 };
Dynamic bridge
This NF is very similar to the static bridge, with a key difference: it learns new MAC addresses from incoming packets, and associates them with the port that received the packet.
Due to RSS limitations (more specifically, the fact that it cannot be configured with MAC addresses), Maestro deems this NF unsuitable to a shared-nothing parallel model, and generates instead a lock-based implementation.
Shared-nothing solution: none.
Lock-based solution: code.
TM-based solution: code.
Shared-nothing RSS configuration: none.
Policer
The policer performs rate-limiting on incoming packets by limiting each user's download rate, identifying users by their IPv4 addresses. When Maestro analyzes this NF, it finds that state is indexed by the destination IP address, implying that packets with the same destination address must be sent to the same core. Because this constraint uses the destination IP address, the chosen RSS packet field options must contain this field. However, although DPDK allows RSS packet field options containing only IP addresses, our NICs do not support this option. Maestro thus chooses a packet field option that includes IP addresses and TCP/UDP ports.
Shared-nothing solution: code.
Lock-based solution: code.
TM-based solution: code.
Shared-nothing RSS configuration:
uint8_t LAN[RSS_HASH_KEY_LENGTH] = { 0x37, 0xd4, 0x9c, 0xa3, 0x66, 0x85, 0x6a, 0x95, 0x1b, 0x0a, 0xd1, 0x7b, 0xe2, 0xd2, 0x92, 0x39, 0xe1, 0x8f, 0x59, 0xd6, 0x9e, 0x21, 0x20, 0x67, 0x2b, 0x6e, 0xdb, 0x15, 0xb3, 0x8b, 0xfa, 0xea, 0x5f, 0x97, 0x8d, 0xc5, 0x1c, 0xf8, 0x5b, 0x38, 0x02, 0x2c, 0xb3, 0xe4, 0xfe, 0x45, 0x1d, 0xdf, 0xd4, 0x77, 0xb6, 0x72 }; uint8_t WAN[RSS_HASH_KEY_LENGTH] = { 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x01, 0x8b, 0x8d, 0x5f, 0xa6, 0xca, 0x43, 0x99, 0xc2, 0x32, 0xd2, 0x53, 0xc7, 0x5b, 0x94, 0xd5, 0xa6, 0x13, 0x66, 0x94, 0xbc, 0xa5, 0xc3, 0x21, 0xcc, 0x0c, 0x7c, 0x95, 0x10, 0xf9, 0x0a, 0xbb, 0x85, 0x97, 0x1a, 0x2b, 0x61 };
Firewall
Our firewall only forwards packets from the WAN that correspond to flows started in the LAN. To keep track of ongoing flows, it stores flow information in a map. Packets from the WAN lookup flow information symmetrically relative to packets from the LAN, naturally swapping source and destination fields.
It indexes state with typical flow information on the LAN (source and destination addresses and ports), and symmetrically on the WAN. Maestro generates a shared-nothing implementation that shards state by the flow information, sending WAN packets corresponding to symmetric LAN sessions to the same core as these.
Shared-nothing solution: code.
Lock-based solution: code.
TM-based solution: code.
Shared-nothing RSS configuration:
uint8_t LAN[RSS_HASH_KEY_LENGTH] = { 0xa1, 0x24, 0x00, 0x15, 0x00, 0x14, 0xa1, 0x24, 0xa1, 0x24, 0x00, 0x14, 0xa1, 0x24, 0x00, 0x15, 0xa7, 0xfa, 0x11, 0x22, 0x6f, 0xd3, 0xf0, 0x42, 0x1b, 0x6c, 0xeb, 0x14, 0x62, 0x02, 0xa3, 0x44, 0x24, 0x90, 0xf8, 0x1c, 0x43, 0x99, 0xe7, 0xaf, 0x80, 0x73, 0x15, 0xfe, 0x29, 0x5a, 0x73, 0xd0, 0x55, 0x85, 0xf2, 0xc4 }; uint8_t WAN[RSS_HASH_KEY_LENGTH] = { 0x00, 0x14, 0xa1, 0x24, 0xa1, 0x24, 0x00, 0x15, 0x00, 0x14, 0xa1, 0x24, 0x00, 0x14, 0xa1, 0x24, 0x6a, 0xe3, 0xac, 0x86, 0x3e, 0xcb, 0x7e, 0x73, 0x83, 0x15, 0xcb, 0x75, 0xc4, 0x73, 0x2c, 0xda, 0xdb, 0x05, 0x31, 0x46, 0xdb, 0xd4, 0x76, 0x5a, 0xa8, 0x20, 0x9d, 0x0a, 0x44, 0x7a, 0xc6, 0xae, 0x5d, 0x72, 0x34, 0x9c };
NAT
A NAT translates addresses between a LAN and a WAN, allowing multiple clients in the LAN to share a single public IP in the WAN. It keeps track of flows initiated in the LAN, but to aid with translation it associates a unique external port with each flow. Reply packets from the WAN are checked to see if their address and port match those on record before subsequently translating the destination address and port to match those of the client.
Maestro notices that the NAT associates flows with external ports using a map. However, it also finds that packets from the WAN are only translated if they target the hosts that started the session in the first place. This constraint allows for sharding based on the external server’s IP address and port.
Shared-nothing solution: code.
Lock-based solution: code.
TM-based solution: code.
Shared-nothing RSS configuration:
uint8_t LAN[RSS_HASH_KEY_LENGTH] = { 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x01, 0x38, 0x0f, 0x46, 0xb6, 0x9a, 0x03, 0x52, 0xc8, 0xe4, 0xbe, 0x1a, 0xa9, 0x6f, 0xd8, 0xe8, 0x0b, 0x7b, 0xb7, 0x09, 0x0c, 0x1f, 0x13, 0xa8, 0xe1, 0xd7, 0x7a, 0x3b, 0x8c, 0xe3, 0x58, 0xd7, 0x1b, 0x67, 0x1d, 0xd1, 0x02 }; uint8_t WAN[RSS_HASH_KEY_LENGTH] = { 0x00, 0x00, 0x00, 0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x7f, 0x0d, 0x25, 0x0c, 0x35, 0x55, 0x19, 0x60, 0x8a, 0xec, 0x67, 0xae, 0x3e, 0xc2, 0xba, 0xcc, 0x20, 0x48, 0x83, 0x6c, 0xfa, 0x6f, 0x63, 0x39, 0x50, 0xf2, 0x2c, 0x97, 0xd1, 0x17, 0x67, 0x50, 0x25, 0x8c, 0x5c, 0x5a };
CL
A Connection Limiter (CL) aims to limit how many connections any single client (source IP) can make to any single server (destination IP) over a wider time frame (e.g. several days). Given the longer time frames involved, this NF uses a memory-efficient count-min sketch to estimate the connection count from each client to each server. For new connections, the source and destination IPs are used to index the sketch, indexing a configurable number of entries based on different hashes (5 by default in our case). If all entries surpass the connection limit, the packet is dropped, preventing the new connection. Otherwise, each entry is incremented.
Maestro finds two different access patterns: the 5-tuple indexes a connection tracking map, while the source and destination IPs index the sketch. The latter constraint subsumes the former and Maestro shards based on source and destination IPs.
Shared-nothing solution: code.
Lock-based solution: code.
TM-based solution: code.
Shared-nothing RSS configuration:
uint8_t LAN[RSS_HASH_KEY_LENGTH] = { 0xce, 0xc0, 0x01, 0x20, 0x00, 0xb0, 0x10, 0x08, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x01, 0xd1, 0xa8, 0x02, 0xb2, 0x4b, 0x19, 0x40, 0x6c, 0x48, 0xd4, 0x5e, 0x2a, 0x6f, 0x34, 0x7b, 0x3d, 0xf6, 0x00, 0xee, 0xf5, 0xb9, 0x26, 0x6d, 0xdc, 0xa5, 0x96, 0x58, 0xfc, 0xd0, 0x94, 0x9f, 0xa2, 0x3c, 0xa1, 0x54, 0x87 }; uint8_t WAN[RSS_HASH_KEY_LENGTH] = { 0x16, 0x3a, 0x73, 0xfa, 0xc8, 0xc5, 0xb5, 0x64, 0xab, 0x81, 0xe0, 0x5f, 0xf3, 0x9d, 0x41, 0x96, 0xe9, 0xa1, 0x4d, 0x86, 0x13, 0xb6, 0x74, 0x86, 0x1e, 0x13, 0xe8, 0xbc, 0x86, 0x9e, 0xf1, 0x9c, 0xd9, 0x64, 0x96, 0xa1, 0x29, 0x4b, 0x06, 0xd5, 0xcc, 0xe6, 0x34, 0xbf, 0x83, 0x75, 0x55, 0x6d, 0x16, 0xa2, 0xf3, 0x2a };
PSD
A Port Scan Detector (PSD) counts how many distinct destination TCP/UDP ports each host (source IP) has touched within a given time frame. Above a threshold, connections to new ports are blocked, preventing port scans.
Maestro analyzes the PSD and finds that it uses only the source IP to access one map, but also the source IP and destination port to access another. As such, the constraints for accessing the first map subsume those of the second and Maestro finds an RSS key that shards based only on source IPs.
Shared-nothing solution: code.
Lock-based solution: code.
TM-based solution: code.
Shared-nothing RSS configuration:
uint8_t LAN[RSS_HASH_KEY_LENGTH] = { 0x30, 0x34, 0x10, 0x04, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x5e, 0xe6, 0x01, 0x10, 0x26, 0x03, 0x93, 0x28, 0x86, 0xb8, 0x09, 0x3c, 0x98, 0xfa, 0xe2, 0xbc, 0x0e, 0xad, 0xe6, 0x06, 0xeb, 0xdf, 0x19, 0x2f, 0x19, 0x90, 0xae, 0x0f, 0x44, 0xac, 0x00, 0xa3, 0x92, 0x01, 0xb3, 0xb8 }; uint8_t WAN[RSS_HASH_KEY_LENGTH] = { 0x7b, 0x4a, 0x11, 0x21, 0x0b, 0x92, 0x62, 0x43, 0x65, 0xb4, 0x62, 0x94, 0x02, 0x3c, 0x12, 0x73, 0x83, 0x66, 0x5c, 0xba, 0x24, 0x29, 0x78, 0x7c, 0xee, 0x08, 0x67, 0x7a, 0x4d, 0xd6, 0xc2, 0xc9, 0x20, 0xd3, 0xea, 0x2b, 0x66, 0x4c, 0x6e, 0xcb, 0x01, 0xd0, 0x5f, 0x03, 0x0c, 0x71, 0x77, 0x8f, 0xd7, 0xd3, 0x49, 0xfb };
LB
LB is a Maglev-like load balancer. Its main goal is to distribute traffic coming from the WAN to a series of identical servers on the LAN. LB registers new servers when it receives their packets coming from the LAN, and matches packets coming from the WAN with previously registered servers, keeping track of flows to ensure the same server handles packets from the same flow.
In order to maintain semantic equivalency between a shared-nothing parallel implementation and a sequential implementation, packets that find an available server in the sequential implementation must also find it available in the other. This ultimately means that all cores would need to have all backends registered in their local state. That said, packets coming in from the LAN in such a parallel implementation would only be able to be registered in a single core, preventing packets that arrive at other cores from seeing it.
With this limitation in mind, it becomes impossible for multiple cores to hold an identical set of backend servers without coordination, thus preventing the use of a shared-nothing model. The Maestro analysis detects this issue and, lacking a better alternative, issues a warning and opts for a read/write lock based approach.
Shared-nothing solution: none.
Lock-based solution: code.
TM-based solution: code.
Shared-nothing RSS configuration: none.
We evaluated Maestro extensively, subjecting its automatically generated NFs to different traffic patterns, packet sizes, and churn intensities. We also evaluated Maestro against the widely used VPP framework.
We present here the scalability study under uniform traffic with minimum sized packets. For details regarding the other experiments, please refer to our paper. Our goal is to understand how the Maestro's parallel NFs scale with the number of cores.

Key takeaways:
- All NFs amenable to shared-nothing parallelization scale linearly until bottlenecked by the PCIe bus, an ideal outcome.
- The lock-based implementations still scale fairly well but more slowly than their shared-nothing counterparts, not always reaching the PCIe bottleneck with 16 cores.
- Lock-based solutions are a great alternative for when shared-nothing implementations are not available (notice the dynamic bridge and the load balancer).
- For simpler NFs, parallel implementations using transactional memory perform quite well, scaling linearly with the number of cores, though still operating more slowly than both shared-nothing and lock-based alternatives.
- For more complex NFs TM performs abysmally, as the likelihood of a transaction aborting increases.
Francisco Pereira, Fernando M. V. Ramos, and Luis Pedrosa. 2024. Automatic Parallelization of Software Network Functions. In Proceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24). USENIX Association, USA
@inproceedings{pereira2024maestro, author = {Francisco Pereira and Fernando M.V. Ramos and Luis Pedrosa}, title = {Automatic Parallelization of Software Network Functions}, booktitle = {21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24)}, year = {2024}, address = {Santa Clara, CA}, url = {https://www.usenix.org/conference/nsdi24/presentation/pereira}, publisher = {USENIX Association}, month = apr }
Watch our talk at NSDI'24, on Abril 18!
All code and benchmarks used in the paper are available on GitHub. This includes:
- Instructions on how to build and run Maestro (check the README).
- The original source code of each NF (here).
- All of Maestro's automatically generated NFs, both sequential and all the different parallel implementations for each NF (here).
- Scripts to generate the pcaps we used to evaluate Maestro.
- Scripts to generate the plots of the paper.
- Scripts to replicate our testing methodology.
The scripts we used to generate our pcaps can be found here.
Maestro builds on top of the Vigor framework, using the KLEE symbolic execution engine and the Z3 theorem prover.
Maestro development was partially supported by the European Union (ACES project, 101093126), INESC-ID (via UIDB/50021/2020), and the SALAD-Nets CMU-Portugal/FCT project (2022.15622.CMU). Francisco Pereira was supported by the FCT scholarship PRT/BD/152195/2021.



© 2024, Maestro Authors