pf: Limit the maximum number of fragments per packet
Similar to the network stack issue fixed in r337782 pf did not limit the number
of fragments per packet, which could be exploited to generate high CPU loads
with a crafted series of packets.
Limit each packet to no more than 64 fragments. This should be sufficient on
typical networks to allow maximum-sized IP frames.
This addresses the issue for both IPv4 and IPv6.
Security: CVE-2018-5391
Sponsored by: Klara Systems
pf: Fix deadlock with route-to
If a locally generated packet is routed (with route-to/reply-to/dup-to) out of
a different interface it's passed through the firewall again. This meant we
lost the inp pointer and if we required the pointer (e.g. for user ID matching)
we'd deadlock trying to acquire an inp lock we've already got.
Pass the inp pointer along with pf_route()/pf_route6().
PR: 228782
pf: Improve ioctl validation
Ensure that multiplications for memory allocations cannot overflow, and
that we'll not try to allocate M_WAITOK for potentially overly large
allocations.
pf: Improve ioctl validation for DIOCRGETTABLES, DIOCRGETTSTATS, DIOCRCLRTSTATS and DIOCRSETTFLAGS
These ioctls can process a number of items at a time, which puts us at
risk of overflow in mallocarray() and of impossibly large allocations
even if we don't overflow.
Limit the allocation to required size (or the user allocation, if that's
smaller). That does mean we need to do the allocation with the rules
lock held (so the number doesn't change while we're doing this), so it
can't M_WAITOK.
pf: Improve ioctl validation for DIOCIGETIFACES and DIOCXCOMMIT
These ioctls can process a number of items at a time, which puts us at
risk of overflow in mallocarray() and of impossibly large allocations
even if we don't overflow.
There's no obvious limit to the request size for these, so we limit the
requests to something which won't overflow. Change the memory allocation
to M_NOWAIT so excessive requests will fail rather than stall forever.
pf: Improve ioctl validation for DIOCRADDTABLES and DIOCRDELTABLES
The DIOCRADDTABLES and DIOCRDELTABLES ioctls can process a number of
tables at a time, and as such try to allocate <number of tables> *
sizeof(struct pfr_table). This multiplication can overflow. Thanks to
mallocarray() this is not exploitable, but an overflow does panic the
system.
Arbitrarily limit this to 65535 tables. pfctl only ever processes one
table at a time, so it presents no issues there.
pf: Fix memory leak in DIOCRADDTABLES
If a user attempts to add two tables with the same name the duplicate table
will not be added, but we forgot to free the duplicate table, leaking memory.
Ensure we free the duplicate table in the error path.
Reported by: Coverity
CID: 1382111
Do not try to reassemble IPv6 fragments in "reass" rule.
ip_reass() expects IPv4 packet and will just corrupt any IPv6 packets
that it gets. Until proper IPv6 fragments handling function will be
implemented, pass IPv6 packets to next rule.
PR: 170604
pf: Cope with overly large net.pf.states_hashsize
If the user configures a states_hashsize or source_nodes_hashsize value we may
not have enough memory to allocate this. This used to lock up pf, because these
allocations used M_WAITOK.
Cope with this by attempting the allocation with M_NOWAIT and falling back to
the default sizes (with M_WAITOK) if these fail.
PR: 209475
Submitted by: Fehmi Noyan Isi <fnoyanisi AT yahoo.com>
pf: Avoid integer overflow issues by using mallocarray() iso. malloc()
pfioctl() handles several ioctl that takes variable length input, these
include:
- DIOCRADDTABLES
- DIOCRDELTABLES
- DIOCRGETTABLES
- DIOCRGETTSTATS
- DIOCRCLRTSTATS
- DIOCRSETTFLAGS
All of them take a pfioc_table struct as input from userland. One of
its elements (pfrio_size) is used in a buffer length calculation.
The calculation contains an integer overflow which if triggered can lead
to out of bound reads and writes later on.
Reported by: Ilja Van Sprundel <ivansprundel@ioactive.com>
Fix Dummynet AQM packet marking function ecn_mark() and fq_codel /
fq_pie schedulers packet classification functions in layer2 (bridge mode).
Dummynet AQM packet marking function ecn_mark() and fq_codel/fq_pie
schedulers packet classification functions (fq_codel_classify_flow()
and fq_pie_classify_flow()) assume mbuf is pointing at L3 (IP)
packet. However, this assumption is incorrect if ipfw/dummynet is
used to manage layer2 traffic (bridge mode) since mbuf will point
at L2 frame. This patch solves this problem by identifying the
source of the frame/packet (L2 or L3) and adding ETHER_HDR_LEN
offset when converting an mbuf pointer to ip pointer if the traffic
is from layer2. More specifically, in dummynet packet tagging
function, tag_mbuf(), iphdr_off is set to ETHER_HDR_LEN if the
traffic is from layer2 and set to zero otherwise. Whenever an access
to IP header is required, mtodo(m, dn_tag_get(m)->iphdr_off) is
used instead of mtod(m, struct ip *) to correctly convert mbuf
pointer to ip pointer in both L2 and L3 traffic.
Submitted by: lstewart
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D12506
Previously, GRE packets in IPv6 tunnels would be dropped by IPFW (unless
net.inet6.ip6.fw.deny_unknown_exthdrs was unset).
PR: 220640
Submitted by: Kun Xie <kxie@xiplink.com>
Fix the queue delay estimation in PIE/FQ-PIE when the timestamp
(TS) method is used. When packet timestamp is used, the "current_qdelay"
keeps storing the last queue delay value calculated in the dequeue
function. Therefore, when a burst of packets arrives followed by
a pause, the "current_qdelay" will store a high value caused by the
burst and stick to that value during the pause because the queue
delay measurement is done inside the dequeue function. This causes
the drop probability calculation function to calculate high drop
probability value instead of zero and prevents the burst allowance
mechanism from working properly. Fix this problem by resetting
"current_qdelay" inside the drop probability calculation function
when the queue length is zero and TS option is used.
Submitted by: Rasool Al-Saadi <ralsaadi@swin.edu.au>
The result of right shifting a negative signed value is implementation
defined. On machines without arithmetic shift instructions, zero bits
may be shifted in from the left, giving a large positive result instead
of the desired divide-by power-of-2. Fix this by operating on the
absolute value and compensating for the possible negation later.
Reverse the order of the underflow/overflow tests and the exponential
decay calculation to avoid the possibility of an erroneous overflow
detection if p is a sufficiently small non-negative value. Also
check for negative values of prob before doing the exponential decay
to avoid another instance of of right shifting a negative value.
Tested by: Rasool Al-Saadi <ralsaadi@swin.edu.au>
In dummynet(4), random chunks of memory are casted to struct dn_*,
potentially leading to fatal unaligned accesses on architectures with
strict alignment requirements. This change fixes dummynet(4) as far
as accesses to 64-bit members of struct dn_* are concerned, tripping
up on sparc64 with accesses to 32-bit members happening to be correctly
aligned there. In other words, this only fixes the tip of the iceberg;
larger parts of dummynet(4) still need to be rewritten in order to
properly work on all of !x86.
In principle, considering the amount of code in dummynet(4) that needs
this erroneous pattern corrected, an acceptable workaround would be to
declare all struct dn_* packed, forcing compilers to do byte-accesses
as a side-effect. However, given that the structs in question aren't
laid out well either, this would break ABI/KBI.
While at it, replace all existing bcopy(9) calls with memcpy(9) for
performance reasons, as there is no need to check for overlap in these
cases.
PR: 189219
dummynet: Use strlcpy to appease static checkers
Some dummynet modules used strcpy() to copy from a larger buffer
(dn_aqm->name) to a smaller buffer (dn_extra_parms->name). It happens that
the lengths of the strings in the dn_aqm buffers were always hardcoded to be
smaller than the dn_extra_parms buffer ("CODEL", "PIE").
Use strlcpy() instead, to appease static checkers. No functional change.
Reported by: Coverity
CIDs: 1356163, 1356165
Sponsored by: Dell EMC Isilon
pf: Fix possible incorrect IPv6 fragmentation
When forwarding pf tracks the size of the largest fragment in a fragmented
packet, and refragments based on this size.
It failed to ensure that this size was a multiple of 8 (as is required for all
but the last fragment), so it could end up generating incorrect fragments.
For example, if we received an 8 byte and 12 byte fragment pf would emit a first
fragment with 12 bytes of payload and the final fragment would claim to be at
offset 8 (not 12).
We now assert that the fragment size is a multiple of 8 in ip6_fragment(), so
other users won't make the same mistake.
Reported by: Antonios Atlasis <aatlasis at secfu net>
pf: Fix leak of pf_state_keys
If we hit the state limit we returned from pf_create_state() without cleaning
up.
PR: 217997
Submitted by: Max <maximos@als.nnov.ru>
Change several constants used by the PIE algorithm from unsigned to signed.
- PIE_MAX_PROB is compared to variable of int64_t and the type promotion
rules can cause the value of that variable to be treated as unsigned.
If the value is actually negative, then the result of the comparsion
is incorrect, causing the algorithm to perform poorly in some
situations. Changing the constant to be signed cause the comparision
to work correctly.
- PIE_SCALE is also compared to signed values. Fortunately they are
also compared to zero and negative values are discarded so this is
more of a cosmetic fix.
- PIE_DQ_THRESHOLD is only compared to unsigned values, but it is small
enough that the automatic promotion to unsigned is harmless.
Submitted by: Rasool Al-Saadi <ralsaadi@swin.edu.au>
pf: Fix rule evaluation after inet6 route-to
In pf_route6() we re-run the ruleset with PF_FWD if the packet goes out
of a different interface. pf_test6() needs to know that the packet was
forwarded (in case it needs to refragment so it knows whether to call
ip6_output() or ip6_forward()).
This lead pf_test6() to try to evaluate rules against the PF_FWD
direction, which isn't supported, so it needs to treat PF_FWD as PF_OUT.
Once fwdir is set correctly the correct output/forward function will be
called.
PR: 217883
Submitted by: Kajetan Staszkiewicz
Sponsored by: InnoGames GmbH
pf: use inet_ntoa_r() instead of inet_ntoa(); maybe fix IPv6 OS fingerprinting
inet_ntoa() cannot be used safely in a multithreaded environment
because it uses a static local buffer. Instead, use inet_ntoa_r()
with a buffer on the caller's stack.
This code had an INET6 conditional before this commit, but opt_inet6.h
was not included, so INET6 was never defined. Apparently, pf's OS
fingerprinting hasn't worked with IPv6 for quite some time.
This commit might fix it, but I didn't test that.
Relnotes: yes (if I/someone can test pf OS fingerprinting with IPv6)
Sponsored by: Dell EMC
pf: Fix a crash in low-memory situations
If the call to pf_state_key_clone() in pf_get_translation() fails (i.e. there's
no more memory for it) it frees skp. This is wrong, because skp is a
pf_state_key **, so we need to free *skp, as is done later in the function.
Getting it wrong means we try to free a stack variable of the calling
pf_test_rule() function, and we panic.
subrulenr is considered unset if it's set to -1, not if it's set to 1.
See contrib/tcpdump/print-pflog.c pflog_print() for a user.
This caused incorrect pflog output (tcpdump -n -e -ttt -i pflog0):
rule 0..16777216(match)
instead of the correct output of
rule 0/0(match)
PR: 214832
Submitted by: andywhite@gmail.com
pf: Add missing byte-order swap to pf_match_addr_range
Without this, rules using address ranges (e.g. "10.1.1.1 - 10.1.1.5") did not
match addresses correctly on little-endian systems.
PR: 211796
Obtained from: OpenBSD (sthen)
pf: Map hook returns onto the correct error values
pf returns PF_PASS, PF_DROP, ... in the netpfil hooks, but the hook callers
expect to get E<foo> error codes.
Map the returns values. A pass is 0 (everything is OK), anything else means
pf ate the packet, so return EACCES, which tells the stack not to emit an ICMP
error message.
PR: 207598
pf: Fix broken rule skip calculation
r289932 accidentally broke the rule skip calculation. The address family
argument to PF_ANEQ() is now important, and because it was set to 0 the macro
always evaluated to false.
This resulted in incorrect skip values, which in turn broke the rule
evaluations.
Fix problems in the FQ-PIE AQM cleanup code that could leak memory or
cause a crash.
Because dummynet calls pie_cleanup() while holding a mutex, pie_cleanup()
is not able to use callout_drain() to make sure that all callouts are
finished before it returns, and callout_stop() is not sufficient to make
that guarantee. After pie_cleanup() returns, dummynet will free a
structure that any remaining callouts will want to access.
Fix these problems by allocating a separate structure to contain the
data used by the callouts. In pie_cleanup(), call callout_reset_sbt()
to replace the normal callout with a cleanup callout that does the cleanup
work for each sub-queue. The instance of the cleanup callout that
destroys the last flow will also free the extra allocated block of memory.
Protect the reference count manipulation in the cleanup callout with
DN_BH_WLOCK() to be consistent with all of the other usage of the reference
count where this lock is held by the dummynet code.
Submitted by: Rasool Al-Saadi <ralsaadi@swin.edu.au>
Differential Revision: https://reviews.freebsd.org/D7174
Fix a race condition between the main thread in aqm_pie_cleanup() and the
callout thread that can cause a kernel panic. Always do the final cleanup
in the callout thread by passing a separate callout function for that task
to callout_reset_sbt().
Protect the ref_count decrement in the callout with DN_BH_WLOCK(). All
other ref_count manipulation is protected with this lock.
There is still a tiny window between ref_count reaching zero and the end
of the callout function where it is unsafe to unload the module. Fixing
this would require the use of callout_drain(), but this can't be done
because dummynet holds a mutex and callout_drain() might sleep.
Remove the callout_pending(), callout_active(), and callout_deactivate()
calls from calculate_drop_prob(). They are not needed because this callout
uses callout_init_mtx().
Submitted by: Rasool Al-Saadi <ralsaadi@swin.edu.au>
Differential Revision: https://reviews.freebsd.org/D6928
r300779 | truckman | 2016-05-26 14:40:13 -0700 (Thu, 26 May 2016) | 64 lines
Import Dummynet AQM version 0.2.1 (CoDel, FQ-CoDel, PIE and FQ-PIE).
Centre for Advanced Internet Architectures
Implementing AQM in FreeBSD
* Overview <http://caia.swin.edu.au/freebsd/aqm/index.html>
* Articles, Papers and Presentations
<http://caia.swin.edu.au/freebsd/aqm/papers.html>
* Patches and Tools <http://caia.swin.edu.au/freebsd/aqm/downloads.html>
Overview
Recent years have seen a resurgence of interest in better managing
the depth of bottleneck queues in routers, switches and other places
that get congested. Solutions include transport protocol enhancements
at the end-hosts (such as delay-based or hybrid congestion control
schemes) and active queue management (AQM) schemes applied within
bottleneck queues.
The notion of AQM has been around since at least the late 1990s
(e.g. RFC 2309). In recent years the proliferation of oversized
buffers in all sorts of network devices (aka bufferbloat) has
stimulated keen community interest in four new AQM schemes -- CoDel,
FQ-CoDel, PIE and FQ-PIE.
The IETF AQM working group is looking to document these schemes,
and independent implementations are a corner-stone of the IETF's
process for confirming the clarity of publicly available protocol
descriptions. While significant development work on all three schemes
has occured in the Linux kernel, there is very little in FreeBSD.
Project Goals
This project began in late 2015, and aims to design and implement
functionally-correct versions of CoDel, FQ-CoDel, PIE and FQ_PIE
in FreeBSD (with code BSD-licensed as much as practical). We have
chosen to do this as extensions to FreeBSD's ipfw/dummynet firewall
and traffic shaper. Implementation of these AQM schemes in FreeBSD
will:
* Demonstrate whether the publicly available documentation is
sufficient to enable independent, functionally equivalent implementations
* Provide a broader suite of AQM options for sections the networking
community that rely on FreeBSD platforms
Program Members:
* Rasool Al Saadi (developer)
* Grenville Armitage (project lead)
Acknowledgements:
This project has been made possible in part by a gift from the
Comcast Innovation Fund.
Submitted by: Rasool Al-Saadi <ralsaadi@swin.edu.au>
X-No objection: core
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D6388
[Remove some code that was added to the mq_append() inline function in
HEAD by r258457, which was not merged to stable/10. The AQM patch
moved mq_append() from ip_dn_io.c to the new file ip_dn_private.h, so
we need to remove that copy of the r258457 changes.]
------------------------------------------------------------------------
r300781 | truckman | 2016-05-26 14:44:52 -0700 (Thu, 26 May 2016) | 7 lines
Modify BOUND_VAR() macro to wrap all of its arguments in () and tweak
its expression to work on powerpc and sparc64 (gcc compatibility).
Correct a typo in a nearby comment.
MFC after: 2 weeks (with r300779)
------------------------------------------------------------------------
r300783 | truckman | 2016-05-26 15:03:28 -0700 (Thu, 26 May 2016) | 4 lines
Correct a typo in a comment.
MFC after: 2 weeks (with r300779)
------------------------------------------------------------------------
r300784 | truckman | 2016-05-26 15:07:09 -0700 (Thu, 26 May 2016) | 5 lines
Include the new AQM files when compiling a kernel with options DUMMYNET.
Reported by: Nikolay Denev <nike_d AT cytexbg DOT com>
MFC after: 2 weeks (with r300779)
------------------------------------------------------------------------
r300949 | truckman | 2016-05-29 00:23:56 -0700 (Sun, 29 May 2016) | 10 lines
Cast some expressions that multiply a long long constant by a
floating point constant to int64_t. This avoids the runtime
conversion of the the other operand in a set of comparisons from
int64_t to floating point and doing the comparisions in floating
point.
Suggested by: lidl
Submitted by: Rasool Al-Saadi <ralsaadi@swin.edu.au>
MFC after: 2 weeks (with r300779)
------------------------------------------------------------------------
r301162 | truckman | 2016-06-01 13:04:24 -0700 (Wed, 01 Jun 2016) | 9 lines
Replace constant expressions that contain multiplications by
fractional floating point values with integer divides. This will
eliminate any chance that the compiler will generate code to evaluate
the expression using floating point at runtime.
Suggested by: bde
Submitted by: Rasool Al-Saadi <ralsaadi@swin.edu.au>
MFC after: 8 days (with r300779 and r300949)
------------------------------------------------------------------------
r301180 | truckman | 2016-06-01 17:42:15 -0700 (Wed, 01 Jun 2016) | 2 lines
Belatedly bump .Dd date for Dummynet AQM import in r300779.
Relnotes: yes
Needed for anticipated dummynet AQM MFC next week.
r266941 | hiren | 2014-06-01 00:28:24 -0700 (Sun, 01 Jun 2014) | 9 lines
ECN marking implenetation for dummynet.
Changes include both DCTCP and RFC 3168 ECN marking methodology.
DCTCP draft: http://tools.ietf.org/html/draft-bensley-tcpm-dctcp-00
Submitted by: Midori Kato (aoimidori27@gmail.com)
Worked with: Lars Eggert (lars@netapp.com)
Reviewed by: luigi, hiren
r266955 | hiren | 2014-06-01 13:19:17 -0700 (Sun, 01 Jun 2014) | 5 lines
DNOLD_IS_ECN introduced by r266941 is not required.
DNOLD_* flags are for compat with old binaries.
Suggested by: luigi
Discussed with: hiren
Relnotes: yes
pf: Fix ICMP translation
Fix ICMP source address rewriting in rdr scenarios.
pf: Fix more ICMP mistranslation
In the default case fix the substitution of the destination address.
PR: 201519
Submitted by: Max <maximos@als.nnov.ru>
pf: Fix fragment timeout
We were inconsistent about the use of time_second vs. time_uptime.
Always use time_uptime so the value can be meaningfully compared.
Submitted by: "Max" <maximos@als.nnov.ru>
Add ALTQ(9) support for the CoDel algorithm.
CoDel is a parameterless queue discipline that handles variable bandwidth
and RTT.
It can be used as the single queue discipline on an interface or as a sub
discipline of existing queue disciplines such as PRIQ, CBQ, HFSC, FAIRQ.
Obtained from: pfSense
Sponsored by: Rubicon Communications (Netgate)
pf: Improve forwarding detection
When we guess the nature of the outbound packet (output vs. forwarding) we need
to take bridges into account. When bridging the input interface does not match
the output interface, but we're not forwarding. Similarly, it's possible for the
interface to actually be the bridge interface itself (and not a member interface).
Properly drain callouts in the IPFW subsystem to avoid use after free
panics when unloading the dummynet and IPFW modules:
- The callout drain function can sleep and should not be called having
a non-sleepable lock locked. Remove locks around "ipfw_dyn_uninit(0)".
- Add a new "dn_gone" variable to prevent asynchronous restart of
dummynet callouts when unloading the dummynet kernel module.
- Call "dn_reschedule()" locked so that "dn_gone" can be set and
checked atomically with regard to starting a new callout.
PR: 208171
Requested by: Franco Fichtner (opnsense.org)
Differential Revision: https://reviews.freebsd.org/D3855
pf: Fix possible out-of-bounds write
In the DIOCRSETADDRS ioctl() handler we allocate a table for struct pfr_addrs,
which is processed in pfr_set_addrs(). At the users request we also provide
feedback on the deleted addresses, by storing them after the new list
('bcopy(&ad, addr + size + i, sizeof(ad));' in pfr_set_addrs()).
This means we write outside the bounds of the buffer we've just allocated.
We need to look at pfrio_size2 instead (i.e. the size the user reserved for our
feedback). That'd allow a malicious user to specify a smaller pfrio_size2 than
pfrio_size though, in which case we'd still read outside of the allocated
buffer. Instead we allocate the largest of the two values.
Reported By: Paul J Murphy <paul@inetstat.net>
PR: 207463
Approved by: re (marius)
pf: Fix IPv6 checksums with route-to.
When using route-to (or reply-to) pf sends the packet directly to the output
interface. If that interface doesn't support checksum offloading the checksum
has to be calculated in software.
That was already done in the IPv4 case, but not for the IPv6 case. As a result
we'd emit packets with pseudo-header checksums (i.e. incorrect checksums).
This issue was exposed by the changes in r289316 when pf stopped performing full
checksum calculations for all packets.
Submitted by: Luoqi Chen
pf: Fix TSO issues
In certain configurations (mostly but not exclusively as a VM on Xen) pf
produced packets with an invalid TCP checksum.
The problem was that pf could only handle packets with a full checksum. The
FreeBSD IP stack produces TCP packets with a pseudo-header checksum (only
addresses, length and protocol).
Certain network interfaces expect to see the pseudo-header checksum, so they
end up producing packets with invalid checksums.
To fix this stop calculating the full checksum and teach pf to only update TCP
checksums if TSO is disabled or the change affects the pseudo-header checksum.
PR: 154428, 193579, 198868
Relnotes: yes
Sponsored by: RootBSD
Fix wrong formatting of 0.0.0.0/X table records in ipfw(8).
Add `flags` u16 field to the hole in ipfw_table_xentry structure.
Kernel has been guessing address family for supplied record based
on xent length size.
Userland, however, has been getting fixed-size ipfw_table_xentry structures
guessing address family by checking address by IN6_IS_ADDR_V4COMPAT().
Fix this behavior by providing specific IPFW_TCF_INET flag for IPv4 records.
PR: bin/189471,kern/200169
pf: Fix misdetection of forwarding when net.link.bridge.pfil_bridge is set
If net.link.bridge.pfil_bridge is set we can end up thinking we're forwarding
in pf_test6() because the rcvif and the ifp (output interface) are different.
In that case we're bridging though, and the rcvif the the bridge member on
which the packet was received and ifp is the bridge itself.
If we'd set dir to PF_FWD we'd end up calling ip6_forward() which is
incorrect.
Instead check if the rcvif is a member of the ifp bridge. (In other words, the
if_bridge is the ifp's softc). If that's the case we're not forwarding but
bridging.
PR: 202351
Reapply r196551 which was accidentally reverted by r223637 (update to
OpenBSD pf 4.5).
Fix argument ordering to memcpy as well as the size of the copy in the
(theoretical) case that pfi_buffer_cnt should be greater than ~_max.
This fix the failure when you hit the self table size and force it to be
resized.
Sponsored by: Rubicon Communications (Netgate)