It's possible to take a signal after pselect/ppoll have set their return
value, but before we actually return to userland. This results in
taking a signal without reflecting it in the return value, which weakens
the guarantees provided by these functions.
Switch both to restore the signal mask before we would deliver signals
on return to userland. If a signal was received after the wait was
over, then we'll just have the signal queued up for the next time it
comes unblocked. The modified signal mask is retained if we were
interrupted so that ast() actually handles the signal, at which point
the signal mask is restored.
des@ has a test case demonstrating the issue in D47738 which will
follow.
Note for MFC: TDA_PSELECT is a KBI break, we should just inline
ast_sigsuspend() in pselect/ppoll for stable branches. It's not exactly
the same, but it will be close enough.
Reported by: des
Reviewed by: des (earlier version), kib
Sponsored by: Klara, Inc.
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D47741
It may be the case that we want to avoid delivering signals that are
normally blocked by the thread's signal mask, in which case the syscall
should schedule this one instead to restore the mask prior to delivery.
This will be used by pselect/ppoll to avoid delivering signals that were
supposed to be blocked after the timeout has elapsed. The name was
chosen as this is the expected behavior of pselect/ppoll, while late
restoration of the mask is exceptional behavior for these specific
calls.
__FreeBSD_version bump as later TDA_* values have changed, third-party
modules that may be using MOD3/MOD4 need to be rebuilt.
Reviewed by: kib
Sponsored by: Klara, Inc.
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D47741
Segment base registers are at 8-byte intervals, while the register
write helper takes a byte-aligned offset. This fixes
DEV_TAB_HARDWARE_ERROR events and associated peripheral I/O failures
on an Epyc-based system with 8-segment device tables.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D47752
The livedumper triggers reports from both of these sanitizers since it
necessarily accesses uninitialized or freed memory. Add a flag to
silence reports from both sanitizers.
Reviewed by: mhorne, khng
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D47714
The T-HEAD custom PTE bits are defined in such a way that the
default/normal memory type is non-zero value. This _unthoughtful_ choice
means that, unlike the Svpbmt and non-Svpbmt cases, this field cannot be
left bare in our bootstrap PTEs, or the hardware will fail to proceed
far enough in boot (cache strangeness). On the other hand, we cannot
unconditionally apply the PTE_THEAD_MA_NONE attributes, as this is not
compatible with spec-compliant RISC-V hardware, and will result in a
fatal exception.
Therefore, in order to handle this errata, we are forced to perform a
check of the CPU type at the first moment possible. Do so, and fix up
the PTEs with the correct memory attribute bits in the T-HEAD case.
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D47458
Switch the boot argument registers to the unused s3 and s4. This ensures
the values will not be clobbered by SBI or function calls; they are
consumed late in the assembly routine.
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D47457
T-HEAD CPUs provide a spec-violating implementation of page-based memory
types, using PTE bits [63:59]. Add basic support for this "errata",
referred to in some places as an "extension".
Note that this change is not enough on its own, but a workaround is
needed for the bootstrap (locore) page tables as well.
Reviewed by: jhb
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D45472
This is the first major quirk we need to support in order to run on
current T-HEAD/XuanTie CPUs, e.g. the C906 or C910, found in several
existing RISC-V SBCs. With these custom dcache routines installed,
busdma can reliably communicate with devices which are not coherent
w.r.t. the CPU's data caches.
This patch introduces the first quirk/errata handling functions to
identcpu.c, and thus is forced to make some decisions about how this
code is structured. It will be amended with the changes that follow in
the series, yet I feel the final result is (unavoidably) somewhat
clumsy. I expect the CPU identification code will continue to evolve as
more CPUs and their quirks are eventually supported.
Discussed with: jrtc27
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D47455
Cache management operations were, for a long time, unspecified by the
RISC-V ISA, and thus these functions have been no-ops. To cope, hardware
with non-coherent I/O has implemented custom cache flush mechanisms,
either in the form of custom instructions or special device registers.
Additionally, the RISC-V CMO extension is ratified and these official
instructions will start to show up in hardware eventually. Therefore, a
method is needed to select the dcache management routines at runtime.
Add a simple set of function hooks, as well as a routine to install them
and specify the minimum dcache line size. The first consumer will be the
non-standard cache management instructions for T-HEAD CPUs.
The unused I-cache variables and macros are removed.
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D47454
Properly initialize setdf variable in ipsec_encap().
It is used for AF_INET6 case when IPv6 datagram is going to be
encapsulated into IPv4 datagram.
PR: 282535
Fixes: 4046178557
MFC after: 1 week
instead of constructing transient pte itself. This pre-set PG_A and
PG_M bits, avoiding atomic pte update on access and modification. Also
it set the nx bit, the mapping is not supposed to be used for executing.
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D47717
Pass the to-be-freed page to vm_page_iter_free as a parameter, rather
than computing it from the iterator parameter, to improve performance.
Sort declarations of page_iter functions in vm_page.h.
Reviewed by: alc
Differential Revision: https://reviews.freebsd.org/D47727
Without this patch, an all upper case user domain name
(as specified by nfsuserd(8)) would not work.
I believe this was done so that Kerberos realms were
not confused with user domains.
Now, RFC8881 specifies that the user domain name is a
DNS name. As such, all upper case names should work.
This patch fixes this case so that it works. The custom
comparison function is no longer needed.
PR: 282620
Tested by: jmmv
MFC after: 2 weeks
Notable upstream pull request merges:
#16643 -multiple Change rangelock handling in FreeBSD's zfs_getpages()
#1669746c4f2ce0 dsl_dataset: put IO-inducing frees on the pool deadlist
#16740 -multiple BRT: Rework structures and locks to be per-vdev
#16743a60ed3822 L2ARC: Move different stats updates earlier
#167588dc452d90 Fix some nits in zfs_getpages()
#16759534688948 Remove hash_elements_max accounting from DBUF and ARC
#167669a81484e3 ZAP: Reduce leaf array and free chunks fragmentation
#16773457f8b76e BRT: More optimizations after per-vdev splitting
#167820ca82c568 L2ARC: Stop rebuild before setting spa_final_txg
#16785d76d79fd2 zio: Avoid sleeping in the I/O path
#16791ae1d11882 BRT: Clear bv_entcount_dirty on destroy
#16796b3b0ce64d FreeBSD: Lock vnode in zfs_ioctl()
#16797d0a91b9f8 FreeBSD: Reduce copy_file_range() source lock to shared
Obtained from: OpenZFS
OpenZFS commit: d0a91b9f88
This is a retread of https://reviews.freebsd.org/D34449 which I think
will fix the issue for the remote side not supporting autoneg. We now
attempt an autoneg, and if that fails fall back to the current code
that forces the link speed/duplex.
The original intent of this patch is to inform the remote switch of
duplex settings when we (the client) are specifying a fixed 10 or 100
speed. Otherwise it may get the duplex setting wrong.
The tricky case is when the remote (switch) side is fixing its
speed AND duplex while disabling autoneg and we (client) need to do
the same, which still seems to be common enough at some ISPs.
Original commit message follows:
Currently if an e1000 interface is set to a fixed media configuration,
for gigabit, it will participate in auto-negotiation as required by
IEEE 802.3-2018 Clause 37. However, if set to fixed media configuration
for 100 or 10, it does NOT participate in auto-negotiation.
By my reading of Clauses 28 and 37, while auto-negotiation is optional
for 100 and 10, it is not prohibited and is, in fact, "highly
recommended".
This patch enables auto-negotiation for fixed 100 and 10 media
configuration, in a similar manner to that already performed for 1000.
I.e., the patch enables advertising of just the manually configured
settings with the goal of allowing the remote end to match the manually
configured settings if it has them available.
To be clear, this patch does NOT allow an em(4) interface that has been
manually configured with specific media settings to respond to
auto-negotiation by then configuring different parameters to those that
were manually configured. The intent of this patch is to fully comply
with the requirements of Clause 37, but for 100 and 10.
The need for this has arisen on an em(4) link where the other end is
under a different administrative control and is set to full
auto-negotiation. Due to the cable length GigE is not working well. It
is desired to set the em(4) end to "media 100baseTX mediatype
full-duplex" which does work when both ends are configured that way.
Currently, because em(4) does not participate in autoneg for this
setting, the remote defaults to half-duplex - i.e., there's a duplex
mismatch and things don't work. With this patch, em(4) would inform the
remote that it has only 100baseTX full, the remote would match that and
it will work.
Tested by: Natalino Picone <natalino.picone@nozominetworks.com>
Tested by: Franco Fichtner <franco@opnsense.org>
Tested by: J.R. Oldroyd <fbsd@opal.com> (previous version)
Sponsored by: Nozomi Networks
Sponsored by: BBOX.io
Differential Revision: https://reviews.freebsd.org/D47336
A device can be disabled via a hint after it is probed (but before it
is attached). The initial version of this marked the device disabled,
but left the device "alive" meaning that dev->driver and dev->desc
were untouched and still pointed into the driver that probed the
device. If that driver lives in a kernel module that is later
unloaded, device_detach() called from devclass_delete_driver() doesn't
do anything (the device's state is DS_ALIVE). In particular, it
doesn't call device_set_driver(dev, NULL) to disassociate the device
from the driver that is being unloaded.
There are several places where these stale pointers can be tripped
over. After kldunload, invoking the sysctl to fetch device info can
dereference dev->desc and dev->driver causing panics. Even without
kldunload, a system suspend request will call the device_suspend and
device_resume DEVMETHODs of the driver in question even though the
device is not attached which can cause some excitement.
To clean this up, more fully detach a device that is disabled by a
hint by clearing the driver and setting the state to DS_NOTPRESENT.
However, to keep the device name+unit combination reserved, leave the
device attached to its devclass.
This requires a change to 'devctl enable' handling to deal with this
updated state. It now checks for a non-NULL devclass to determine if
a disabled device is in this state and if so it clears the hint.
However, it also now clears the devclass before attaching the device.
This gives all drivers an opportunity to attach to the now-enabled
device.
Reported by: adrian
Discussed with: imp
Reviewed by: imp
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D47691
On a laptop with no other console devices than the screen, things
scroll of the screen faster than eye or camera can capture it.
This tunable slows the console down and makes it update synchronously,
so console output continues when timers or interrupts do not.
Differential Revision: https://reviews.freebsd.org/D47710
Remove the array of port module status and instead save module status
and module number.
At boot, for each PCI function driver get event from fw about module
status. The event contains module number and module status. Driver
stores module number and module status.. When user (ifconfig) ask for
modules information, for each pci function driver first queries fw to
get module number of current pci function, then driver compares the
module number to the module number it stored before and if it matches
and module status is "plugged and enabled" then driver queries fw for
the eprom information of that module number and return it to the
caller.
In fact fw could have concluded that required module number of the
current pci function, but fw is not implemented this way. current
design of PRM/FW is that MCIA register handling is only aware of
modules, not the pci function->module connections. FW is designed to
take the module number written to MCIA and write/read the content
to/from the associated module's EPROM.
So, based on current FW design, we must supply the module num so fw
can find the corresponding I2C interface of the module to write/read.
Sponsored by: NVidia networking
MFC after: 1 week
Ensure all allocated tags have a hardware context associated.
The hardware context allocation is moved into the zone import
routine, as suggested by kib. This is safe because these zone
allocations are always done in a sleepable context.
I have removed the now pointless num_resources tracking,
and added sysctls / tunables to control UMA zone limits
for these tls tags, as well as a tunable to let the
driver pre-allocate tags at boot.
MFC after: 2 weeks
- Don't allow FBT and kinst to instrument the KMSAN runtime.
- When fetching data from the traced thread's stack, mark it as
initialized. It may well be uninitialized, but as dtrace permits
arbitrary inspection of kernel memory, it isn't very useful to raise
KMSAN reports.
- Mark data copied in from userspace as initialized, as we do for
copyin() etc. using interceptors.
MFC after: 2 weeks
Reverse the first if() in pf_dummynet_route() to avoid an unneeded level of
indendation.
No functional change.
Sponsored by: Rubicon Communications, LLC ("Netgate")
We were previously allocating MAXCPU structures for several purposes,
but this is generally unnecessary and is quite excessive, especially
after MAXCPU was bumped to 1024 on amd64 and arm64. We already are
careful to allocate only as many per-CPU tracing buffers as are needed;
extend this to other allocations.
For example, in a 2-vCPU VM, the size of a consumer state structure
drops from 64KB to 128B. The size of the per-consumer `dts_buffer` and
`dts_aggbuffer` arrays shrink similarly. Ditto for pre-allocations of
local and global D variable storage space.
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D47667
This function is registered as a ifnet_link_event and so should have the
corresponding argument list.
PR: 282870
Reported by: nakayamakenjiro@gmail.com
MFC after: 1 week
Change cdev_mgtdev_page_free_page to take an iterator, rather than an
object and page, so that removing the page from the object radix tree
can take advantage of locality with iterators. Define a
general-purpose function to free all pages, which can be used in
several places.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D47692