af366d35 mjg Feb. 8, 2021, 7:15 p.m.
The C variant in libkern performs excessive branching to find the
non-zero byte instead of using the bsfq instruction. The same code
patched to use it is still slower than the routine implemented here
as the compiler keeps neglecting to perform certain optimizations
(like using leaq).

On top of that the routine can is a starting point for copyinstr
which operates on words instead of bytes.

Tested with glibc test suite.

Sample results (calls/s):

$(perl -e "print 'A' x 3"):
stock:	211198039
asm:	465609618

$(perl -e "print 'A' x 100"):
stock:	 83151997
patched: 98285919
asm:	120719888

$(perl -e "print 'A' x 3"):
stock:	282523617
asm:	491498172

$(perl -e "print 'A' x 100"):
stock:	114857172
asm:	112082057
3acea07c mjg Feb. 8, 2021, 7:15 p.m.
... lost in revert
81e074d5 mjg Feb. 8, 2021, 7:15 p.m.
3e2d96ac kevans Feb. 8, 2021, 6:41 p.m.
The basic issue here is that grep, when given -m 1, would stop all
line processing once it hit the match count and exit immediately.  The
problem with exiting immediately is that -A processing only happens when
subsequent lines are processed and do not match.

The fix here is relatively easy; when bsdgrep matches a line, it resets
the 'tail' of the matching context to the value supplied to -A and
dumps anything that's been queued up for -B. After the current line has
been printed and tail is reset, we check our mcount and do what's
needed. Therefore, at the time that we decide we're doing nothing, we
know that 'tail' of the context is correct and we can simply continue
on if there's still more to pick up.

With this change, we still bail out immediately if there's been no -A
flag. If -A was supplied, we signal that we should continue on. However,
subsequent lines will not even bothere to try and process the line.  We
have reached the match count, so even if the next line would match then
we must process it if it hadn't. Thus, the loop in procfile() can
short-circuit and just process the line as a non-match until
procmatches() indicates that it's safe to stop.

A test has been added to reflect both that we should be picking up the
next line and that the next line should be considered a non-match even
if it should have been.

PR:		253350
MFC-after:	3 days
f8ce8aed noreply Feb. 8, 2021, 5:15 p.m.
3d40b65 refactored zfs_vnops.c, which shared much code verbatim between
Linux and BSD.  After a successful write, the suid/sgid bits are reset,
and the mode to be written is stored in newmode.  On Linux, this was
propagated to both the in-memory inode and znode, which is then updated
with sa_update.

3d40b65 accidentally removed the initialization of newmode, which
happened to occur on the same line as the inode update (which has been
moved out of the function).

The uninitialized newmode can be saved to disk, leading to a crash on
stat() of that file, in addition to a merely incorrect file mode.

Reviewed-by: Ryan Moeller <>
Reviewed-by: Brian Behlendorf <>
Signed-off-by: Antonio Russo <>
Closes #11474 
Closes #11576
32bf05ad tsoome Feb. 8, 2021, 4 p.m.
vt is using static buffers for on screen data, the buffer size is
calculated based on maximum supported screen size and 8x16 font.

When using hi-res graphics and very smaller than 8x16 font, we
need to be careful not to overflow static buffers in vt.

Testing: I did test by building smaller buffers than vt currently is using,
royger was testing on actual 4k capable hardware.

MFC after: 1 week
Tested by: royger
c7fcb36f markj Feb. 8, 2021, 2:20 p.m.
MFC after:	1 week
db6b5644 markj Feb. 8, 2021, 2:19 p.m.
When performing encryption in software, the KTLS crypto callback always
locks the session to deliver a wakeup.  But, if we're handling the
operation synchronously this is wasted effort and can result in
sleepqueue lock contention on large systems.

Use CRYPTO_SESS_SYNC() to determine whether the operation will be
completed asynchronously or not, and select a callback appropriately.
Avoid locking the session to check for completion if the session handles
requests synchronously.

Reviewed by:	jhb
Sponsored by:	Ampere Computing
Submitted by:	Klara, Inc.
MFC after:	2 weeks
Differential Revision:
68f6800c markj Feb. 8, 2021, 2:19 p.m.
Currently, OpenCrypto consumers can request asynchronous dispatch by
setting a flag in the cryptop.  (Currently only IPSec may do this.)   I
think this is a bit confusing: we (conditionally) set cryptop flags to
request async dispatch, and then crypto_dispatch() immediately examines
those flags to see if the consumer wants async dispatch. The flag names
are also confusing since they don't specify what "async" applies to:
dispatch or completion.

Add a new KPI, crypto_dispatch_async(), rather than encoding the
requested dispatch type in each cryptop. crypto_dispatch_async() falls
back to crypto_dispatch() if the session's driver provides asynchronous

Similarly, add crypto_dispatch_batch() to request processing of a tailq
of cryptops, rather than encoding the scheduling policy using cryptop
flags.  Convert GELI, the only user of this interface (disabled by
default) to use the new interface.

Add CRYPTO_SESS_SYNC(), which can be used by consumers to determine
whether crypto requests will be dispatched synchronously. This is just
a helper macro. Use it instead of looking at cap flags directly.

Fix style in crypto_done(). Also get rid of CRYPTO_RETW_EMPTY() and
just check the relevant queues directly. This could result in some
unnecessary wakeups but I think it's very uncommon to be using more than
one queue per worker in a given workload, so checking all three queues
is a waste of cycles.

Reviewed by:	jhb
Sponsored by:	Ampere Computing
Submitted by:	Klara, Inc.
MFC after:	2 weeks
Differential Revision:
7509b677 markj Feb. 8, 2021, 2:19 p.m.
This makes it easier to refactor the GCM code to operate on
crypto_buffer_cursors rather than plain contiguous buffers, with the aim
of minimizing the amount of copying and zeroing done today.

No functional change intended.

Reviewed by:	jhb
MFC after:	1 week
Sponsored by:	Ampere Computing
Submitted by:	Klara, Inc.
Differential Revision:
0dc70760 markj Feb. 8, 2021, 2:19 p.m.
- We were only hashing up to the first 16 bytes of the AAD.
- When computing the digest during decryption, handle the case where
  len == trailer, i.e., len < AES_BLOCK_LEN, properly.

While here:

- trailer is always smaller than AES_BLOCK_LEN, so remove a pair of
  unnecessary modulus operations.
- Replace some byte-by-byte loops with memcpy() and memset() calls.
  In particular, zero the full block before copying a partial block into
  it since we do that elsewhere and it means that the memset() length is
  known at compile time.

Reviewed by:	jhb
Sponsored by:	Ampere Computing
Submitted by:	Klara, Inc.
MFC after:	3 days
Differential Revision:
b5aa9ad4 markj Feb. 8, 2021, 2:19 p.m.
Reviewed by:	gallatin, jhb
Sponsored by:	Ampere Computing
Submitted by:	Klara, Inc.
MFC after:	1 week
Differential Revision:
1755b2b9 markj Feb. 8, 2021, 2:18 p.m.
This makes it a bit more straightforward to add new counters when
debugging.  No functional change intended.

Reviewed by:	jhb
Sponsored by:	Ampere Computing
Submitted by:	Klara, Inc.
MFC after:	1 week
Differential Revision:
45d75e3a donner Feb. 8, 2021, 1:31 p.m.
Allocate the necessary memory for the conversion dynamically starting
with a value which is sufficient for almost all normal cases.

PR:		187835
Reviewed by:	kp
Differential Revision:
fb8c2f74 trasz Feb. 8, 2021, 10:46 a.m.
Microoptimize set_syscall_retval() for arm64 by predicting
the return value to be zero.  This is similar to what has
been done for other architectures

Reviewed By:	emaste, mhorne
Differential Revision: