polling-based firmware commands to event based firmware commands.
This is a direct commit.
Linux commit:
a7e1f04905e5b2b90251974dddde781301b6be37
Sponsored by: Mellanox Technologies
Implement enable_irq() and disable_irq() in the LinuxKPI and add checks for
valid IRQ tag before setting up or tearing down an interrupt handler in the
LinuxKPI. This is needed when the interrupt handler is disabled
before freeing the interrupt.
Submitted by: Johannes Lundberg <johalun0@gmail.com>
Sponsored by: Mellanox Technologies
Clear old MSIX IRQ numbers in the LinuxKPI.
When disabling the MSIX IRQ vectors for a PCI device through the
LinuxKPI, make sure any old MSIX IRQ numbers are no longer visible to
the linux_pci_find_irq_dev() function else IRQs can be requested from
the wrong PCI device.
Sponsored by: Mellanox Technologies
Improve copy-and-pasted versions of SIOCGIFADDR.
The original implementation used a reference to ifr_data and a cast to
do the equivalent of accessing ifr_addr. This was copied multiple
times since 1996.
Approved by: kib
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D14873
in ibcore. This resolves an interopability issue when using both iWarp(T6) and
RDMA(CX-4 and CX-5) devices at the same time.
The problem is IPv4 based sockets cannot be bound to an IPv6 based address
causing sobind() to fail preventing all use of IPv6 based addresses with RDMA
when an iWarp device is present.
This is a direct commit.
Tested by: KrishnamRaju ErapaRaju <Krishna2@chelsio.com>
Sponsored by: Mellanox Technologies
Handle case of class being set, but not parent when calling
device_register() in the LinuxKPI.
Requested by: Chelsio
Sponsored by: Mellanox Technologies
Make sure the IPv6 scope ID gets zeroed when exchanging CMA messages in ibcore.
Else the IPv6 address matching might fail. This change adds support for both
embedded and non-embedded IPv6 scope IDs when passing a IPv6 link-local socket
address to RDMA. Prior to this change only global IPv6 addresses would work
with RDMA.
Sponsored by: Mellanox Technologies
Multiple fixes for using IPv6 link-local addresses with RDMA.
1) Fail to resolve RDMA address if rtalloc1() returns the loopback
device, lo0, as the gateway interface.
2) Use ip_dev_find() and ip6_dev_find() to lookup network interfaces
with matching IPv4 and IPv6 addresses, respectivly.
3) In addr_resolve() make sure the "ifa" pointer is always set, also when
the "ifp" is NULL. Else a NULL pointer access might happen trying to
read from the "ifa" pointer later on.
Sponsored by: Mellanox Technologies
Make the dma_alloc_coherent() function in the LinuxKPI NULL safe with regard
to the "dev" argument.
Submitted by: Krishnamraju Eraparaju @ Chelsio
Sponsored by: Chelsio Communications
Unconditionally include "opt_inet6.h" in the LinuxKPI.
This makes sure the INET6 macro gets properly defined,
also for kernel module builds.
Sponsored by: Mellanox Technologies
The remote DMA TCP portspace selector, RDMA_PS_TCP, is used for both
iWarp and RoCE in ibcore. The selection of RDMA_PS_TCP can not be used
to indicate iWarp protocol use. Backport the proper IB device
capabilities from Linux upstream to distinguish between iWarp and
RoCE. Only allocate the additional socket required for iWarp for RDMA
IDs when at least one iWarp device present. This resolves
interopability issues between iWarp and RoCE in ibcore
Reviewed by: np @
Differential Revision: https://reviews.freebsd.org/D12563
Sponsored by: Mellanox Technologies
Implement LinuxKPI module parameters as SYSCTLs.
The bool module parameter is no longer supported, because there is no
equivalent in FreeBSD 10-stable. These are converted into "int" type.
There are two macros available which control the behaviour of the
LinuxKPI module parameters:
- LINUXKPI_PARAM_PARENT allows the consumer to set the SYSCTL parent
where the modules parameters will be created.
- LINUXKPI_PARAM_PREFIX defines a parameter name prefix, which is
added to all created module parameters.
The LinuxKPI module parameters also have a permissions value.
If any write bits are set we are allowed to modify the module
parameter runtime. Reflect this when creating the static SYSCTL
nodes.
The module_param_call() function is no longer supported.
Sponsored by: Mellanox Technologies
Add basic support for VIMAGE to the LinuxKPI and ibcore.
Support is implemented by mapping Linux's "struct net" into FreeBSD's
"struct vnet". Currently only vnet0 is supported by ibcore.
Sponsored by: Mellanox Technologies
Add helper function similar to ip_dev_find() to the LinuxKPI to lookup
a network device by its IPv6 address in the given VNET.
Sponsored by: Mellanox Technologies
Fix for mlx4en(4) to properly call m_defrag().
The m_defrag() function can only defrag mbuf chains which have a valid
mbuf packet header. In r291699 when the mlx4en(4) driver was converted
into using BUSDMA(9), the call to m_defrag() was moved after the part
of the transmit routine which strips the header from the mbuf chain.
This effectivly disabled the mbuf defrag mechanism and such packets
simply got dropped.
This patch removes the stripping of mbufs from a chain and loads all
mbufs using busdma. If busdma finds there are no segments, unload
the DMA map and free the mbuf right away, because that means all
data in the mbuf has been inlined in the TX ring. Else proceed
as usual.
Add a per-ring rounter for the number of defrag attempts and
make sure the oversized_packets counter gets zeroed while at it.
The counters are per-ring to avoid excessive cache misses in the
TX path.
Approved by: re (kib)
Submitted by: mjoras@
Differential Revision: https://reviews.freebsd.org/D11683
Sponsored by: Mellanox Technologies
Print maximum MTU when trying to set invalid MTU in the mlx4en(4) driver.
Useful for debugging.
Approved by: re (marius, gjb)
Submitted by: Sepherosa Ziehau <sephe@dragonflybsd.org>
Sponsored by: Mellanox Technologies
Add support for RX and TX statistics when the mlx4en(4) PCI device
is in VF or SRIOV mode typically in a virtual machine environment.
Approved by: re (kib)
Submitted by: Sepherosa Ziehau <sephe@dragonflybsd.org>
Sponsored by: Mellanox Technologies
Add support for constant pointer constructs to READ_ONCE() in the
LinuxKPI. When the type of the argument is constant the temporary
variable cannot be assigned after the barrier. Instead assign the
temporary variable by initialization.
Approved by: re (kib)
Sponsored by: Mellanox Technologies
Remove some dead statistics related code and a structure field from the
mlx4en driver which is used by its Linux counterpart, but not under
FreeBSD.
Sponsored by: Mellanox Technologies
Fix broken usage of the mlx4_read_clock() function:
- return value has too small width
- cycle_t is unsigned and cannot be less than zero
Sponsored by: Mellanox Technologies
Change reject message type when destroying cm_id in ibore.
This patch fixes an interopability issue between FreeBSD and non-FreeBSD
systems when the connection establishment is aborted. Refer to the
initial commit in Linux, drivers/infiniband/core/cm.c,
for a more detailed description.
Obtained from: Linux
Sponsored by: Mellanox Technologies
Make sure the mlx4en RX DMA ring gets stamped with software ownership
in order to prevent the flow of QP to error in the firmware once
UPDATE_QP is called.
Sponsored by: Mellanox Technologies
Use static device numbering instead of dynamic one when creating
mlx4en network interfaces. This prevents infinite unit number growth
typically when the mlx4en driver is used inside virtual machines which
support runtime PCI attach and detach.
Sponsored by: Mellanox Technologies
Free hardware queue resource after port is stopped in the mlx4en(4)
driver. Else if the port is up the resource might still be busy and
the MTT free will fail.
PR: 216493
Sponsored by: Mellanox Technologies
Allow communication between functions on the same host when using the
mlx4en(4) driver in SRIOV mode.
Place a copy of the destination MAC address in the send WQE only under
SRIOV/eSwitch configuration or when the device is in selftest. This
allows communication between functions on the same host.
PR: 216493
Sponsored by: Mellanox Technologies
Avoid NULL dereference in a couple of sysctl handlers in ibcore.
iw_cxgbe sets ib_device->dma_device to NULL (since r311880).
Sponsored by: Chelsio Communications
mlx4: Use the CQ quota for SRIOV when creating completion EQs
When creating EQs to handle CQ completion events for the PF or for
VFs, we create enough EQE entries to handle completions for the max
number of CQs that can use that EQ.
When SRIOV is activated, the max number of CQs a VF (or the PF) can
obtain is its CQ quota (determined by the Hypervisor resource
tracker). Therefore, when creating an EQ, the number of EQE entries
that the VF should request for that EQ is the CQ quota value (and not
the total number of CQs available in the firmware).
Under SRIOV, the PF, also must use its CQ quota, because the resource
tracker also controls how many CQs the PF can obtain.
Using the firmware total CQs instead of the CQ quota when creating EQs
resulted wasting MTT entries, due to allocating more EQEs than were
needed.
Sponsored by: Mellanox Technologies
Don't free uninitialized sysctl contexts in the mlx4en driver. This
can cause NULL pointer panics during failed device attach.
Differential Revision: https://reviews.freebsd.org/D8876
Sponsored by: Mellanox Technologies
Flexible and asymmetric allocation of EQs and MSI-X vectors for PF/VFs.
Previously, the mlx4 driver queried the firmware in order to get the
number of supported EQs. Under SRIOV, since this was done before the
driver notified the firmware how many VFs it actually needs, the
firmware had to take into account a worst case scenario and always
allocated four EQs per VF, where one was used for events while the
others were used for completions. Now, when the firmware supports the
asymmetric allocation scheme, denoted by exposing num_sys_eqs > 0 (-->
MLX4_DEV_CAP_FLAG2_SYS_EQS), we use the QUERY_FUNC command to query
the firmware before enabling SRIOV. Thus we can get more EQs and MSI-X
vectors per function. Moreover, when running in the new
firmware/driver mode, the limitation that the number of EQs should be
a power of two is lifted.
Obtained from: Linux (dual BSD/GPLv2 licensed)
Submitted by: Dexuan Cui @ microsoft . com
Differential Revision: https://reviews.freebsd.org/D8867
Sponsored by: Mellanox Technologies
Change mlx4 QP allocation scheme.
When using Blue-Flame, BF, the QPN overrides the VLAN, CV, and SV
fields in the WQE. Thus, BF may only be used for QPNs with bits 6,7
unset.
The current ethernet driver code reserves a TX QP range with 256b
alignment.
This is wrong because if there are more than 64 TX QPs in use, QPNs >=
base + 65 will have bits 6/7 set.
This problem is not specific for the Ethernet driver, any entity that
tries to reserve more than 64 BF-enabled QPs should fail. Also, using
ranges is not necessary here and is wasteful.
The new mechanism introduced here will support reservation for "Eth
QPs eligible for BF" for all drivers: bare-metal, multi-PF, and VFs
(when hypervisors support WC in VMs). The flow we use is:
1. In mlx4_en, allocate Tx QPs one by one instead of a range allocation,
and request "BF enabled QPs" if BF is supported for the function
2. In the ALLOC_RES FW command, change param1 to:
a. param1[23:0] - number of QPs
b. param1[31-24] - flags controlling QPs reservation
Bit 31 refers to Eth blueflame supported QPs. Those QPs must have bits
6 and 7 unset in order to be used in Ethernet.
Bits 24-30 of the flags are currently reserved.
When a function tries to allocate a QP, it states the required
attributes for this QP. Those attributes are considered "best-effort".
If an attribute, such as Ethernet BF enabled QP, is a must-have
attribute, the function has to check that attribute is supported
before trying to do the allocation.
In a lower layer of the code, mlx4_qp_reserve_range masks out the bits
which are unsupported. If SRIOV is used, the PF validates those
attributes and masks out unsupported attributes as well. In order to
notify VFs which attributes are supported, the VF uses QUERY_FUNC_CAP
command. This command's mailbox is filled by the PF, which notifies
which QP allocation attributes it supports.
Obtained from: Linux (dual BSD/GPLv2 licensed)
Submitted by: Dexuan Cui @ microsoft . com
Differential Revision: https://reviews.freebsd.org/D8868
Sponsored by: Mellanox Technologies
After r310171, the kernel version of sscanf() has format string checking
enabled. This results in a -Werror warning in mlx4ib:
sys/dev/mlx4/mlx4_ib/mlx4_ib_sysfs.c:90:22: error: format specifies type 'unsigned long long *' but the argument has type 'u64 *' (aka 'unsigned long *') [-Werror,-Wformat]
sscanf(buf, "%llx", &sysadmin_ag_val);
~~~~ ^~~~~~~~~~~~~~~~
Change sysadmin_ag_val to unsigned long long to avoid the warning.
Reviewed by: hselasky
Differential Revision: https://reviews.freebsd.org/D8831
cxgbe/iw_cxgbe: fix various double-close panics with iWARP sockets.
Sockets representing the TCP endpoints for iWARP connections are
allocated by the ibcore module. Before this revision they were closed
either by the ibcore module or the iw_cxgbe hardware driver depending on
the state transitions during connection teardown. This is error prone
and there were cases where both iw_cxgbe and ibcore closed the socket
leading to double-free panics. The fix is to let ibcore close the
sockets it creates and never do it in the driver.
- Use sodisconnect instead of soclose (preceded by solinger = 0) in the
driver to tear down an RDMA connection abruptly. This does what's
intended without releasing the socket's fd reference.
- Close the socket in ibcore when the iWARP iw_cm_id is destroyed. This
works for all kinds of sockets: clients that initiate connections,
listeners, and sockets accepted off of listeners.
Sponsored by: Chelsio Communications
297406,300875,300888,301158,301896,301897,304838:
Pull in most of the Chelsio and iWARP related changes from stable/11 into
stable/10. A few changes from 278886 (OFED 1.2) were also included though
the full merge is not:
- The find_gid_port() function in infiband/core/cma.c.
- Addition of the 'ord' and 'ird' fields to 'struct iw_cm_event'.
273806:
Userspace library for Chelsio's Terminator 5 based iWARP RNICs (pretty
much every T5 card that does _not_ have "-SO" in its name is RDMA
capable).
This plugs into the OFED verbs framework and allows userspace RDMA
applications to work over T5 RNICs. Tested with rping.
289103:
iw_cxgbe: fix for page fault in cm_close_handler().
This is roughly the iw_cxgbe equivalent of
be13b2dff8
-----------------
RDMA/cxgb4: Connect_request_upcall fixes
When processing an MPA Start Request, if the listening endpoint is
DEAD, then abort the connection.
If the IWCM returns an error, then we must abort the connection and
release resources. Also abort_connection() should not post a CLOSE
event, so clean that up too.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
-----------------
289201:
iw_cxgbe: MPA v2 is always available.
289338:
iw_cxgbe: use correct RFC number.
289578:
Merge LinuxKPI changes from DragonflyBSD:
- Define the kref structure identical to the one found in Linux.
- Update clients referring inside the kref structure.
- Implement kref_sub() for FreeBSD.
293185:
iw_cxgbe: Shut down the socket but do not close the fd in case of error.
The fd is closed later in this case. This fixes a "SS_NOFDREF on enter"
panic.
294474:
iw_cxgbe: fix a couple of problems int the RDMA_TERMINATE handler.
a) Look for the CPL in the payload buffer instead of the descriptor.
b) Retrieve the socket associated with the tid with the inpcb lock held.
294610:
Fix for iWARP servers that listen on INADDR_ANY.
The iWARP Connection Manager (CM) on FreeBSD creates a TCP socket to
represent an iWARP endpoint when the connection is over TCP. For
servers the current approach is to invoke create_listen callback for
each iWARP RNIC registered with the CM. This doesn't work too well for
INADDR_ANY because a listen on any TCP socket already notifies all
hardware TOEs/RNICs of the new listener. This patch fixes the server
side of things for FreeBSD. We've tried to keep all these modifications
in the iWARP/TCP specific parts of the OFED infrastructure as much as
possible.
297124:
iw_cxgbe/libcxgb4: Pull in many applicable fixes from the upstream Linux
iWARP driver and userspace library to the FreeBSD iw_cxgbe and libcxgb4.
This commit includes internal changesets 6785 8111 8149 8478 8617 8648
8650 9110 9143 9440 9511 9894 10164 10261 10450 10980 10981 10982 11730
11792 12218 12220 12222 12223 12225 12226 12227 12228 12229 12654.
297368:
cxgbe/iw_cxgbe: Fix for stray "start_ep_timer timer already started!"
messages.
297406:
Remove unnecessary dequeue_mutex (added in r294610) from the iWARP
connection manager. Examining so_comp without synchronization with
iw_so_event_handler is a harmless race.
300875:
iw_cxgbe: Use vmem(9) to manage PBL and RQT allocations.
300888:
iw_cxgbe: Plug a lock leak in process_mpa_request().
If the parent is DEAD or connect_request_upcall() fails, the parent
mutex is left locked. This leads to a hang when process_mpa_request()
is called again for another child of the listening endpoint.
301158:
iw_cxgbe: Fix panic that occurs when c4iw_ev_handler tries to acquire
comp_handler_lock but c4iw_destroy_cq has already freed the CQ memory
(which is where the lock resides).
301896:
Fix bug in iwcm that caused a panic in iw_cm_wq when krping is run
repeatedly in a tight loop.
301897:
iw_cxgbe: Make sure that send_abort results in a TCP RST and not a FIN.
Release the hold on ep->com immediately after sending the RST. This
fixes a bug that sometimes leaves userspace iWARP tools hung when the
user presses ^C.
304838:
Do not free an uninitialized pointer on soaccept failure in the iWARP
connection manager.
Submitted by: Krishnamraju Eraparaju @ Chelsio (original patch)
Sponsored by: Chelsio Communications
linuxkpi: Fix PCI BAR lazy allocation support.
FreeBSD supports lazy allocation of PCI BAR, that is, when a device
driver's attach method is invoked, even if the device's PCI BAR
address wasn't initialized, the invocation of bus_alloc_resource_any()
(the call chain: pci_alloc_resource() -> pci_alloc_multi_resource() ->
pci_reserve_map() -> pci_write_bar()) would allocate a proper address
for the PCI BAR and write this 'lazy allocated' address into the PCI
BAR.
This model works fine for native FreeBSD device drivers, but _not_ for
device drivers shared with Linux (e.g. dev/mlx5/mlx5_core/mlx5_main.c
and ofed/drivers/net/mlx4/main.c. Both of them use
pci_request_regions(), which doesn't work properly with the PCI BAR
lazy allocation, because pci_resource_type() -> _pci_get_rle() always
returns NULL, so pci_request_regions() doesn't have the opportunity to
invoke bus_alloc_resource_any(). We now use pci_find_bar() in
pci_resource_type(), which is able to locate all available PCI BARs
even if some of them will be lazy allocated.
Submitted by: Dexuan Cui <decui microsoft com>
Reviewed by: hps
Sponsored by: Microsoft
Differential Revision: https://reviews.freebsd.org/D8071
Set hardware stats flag to avoid double counting the number of incoming bytes.
Found by: Ben RUBSON <ben.rubson@gmail.com>
Sponsored by: Mellanox Technologies
The IORESOURCE_XXX defines should resemble a bitmask while SYS_RES_XXX
are not bitmasks. Fix return value of pci_resource_flags() to reflect
this change.
Sponsored by: Mellanox Technologies
Add support for setting blocking and non-blocking mode on /dev/rdma_cm
by returning success on FIONBIO and FIOASYNC IOCTLs. The actual flags
handling is done by the kern_ioctl() function.
Reported by: Alex Bowden <alex.bowden@outlook.com>
Sponsored by: Mellanox Technologies