Instead of always creating a new MonitoredPackage every time
PackageWatchdog#startObservingHealth is called, just update
the duration of an existing MonitoredPackage if one exists. This
means that the failure history will be preserved.
Test: atest PackageWatchdogTest
Bug: 150114865
Change-Id: I6d6e3e0e893a603fda50df833bc5b6ce1757b6ec
Instead of periodically syncing requests with the same information,
only call into the ExplicitHealthCheckController when the set
of packages with pending health checks has changed, or a new observer
has been registered. Add tests to verify that duplicate calls are not made.
Test: atest PackageWatchdogTest#testSyncHealthCheckRequests
Test: atest NetworkStagedRollbackTest
Bug: 150114865
Bug: 146767850
Change-Id: I2926e9c7689e0ac9c4a142263ffd50a4747d016f
It is possible for null to be returned by
ProcessRecord.getPackageListWithVersionCode on package failure. This
can cause a NPE in Package Watchdog. Ensure that the list of failing
packages is not null.
Test: atest PackageWatchdogTest
Bug: 151113966
Change-Id: Iab23cd6b4b8ae6b787df5f0b831b51e0ac8b3d31
Test the notifyHealthCheckPassed method to ensure that the expected
information is sent when an explicit health check passes.
Bug: 150638807
Test: atest ExplicitHealthCheckServiceTest
Change-Id: I98c1c3bf018a82ea769846b4212c295518814a18
Make Package Watchdog the component that receives calls
about boot events, and decides on whether or not to
perform mitigation action for a perceived boot loop.
The logic for selecting an observer to handle boot loops
is similar to how package failure is handled. The threshold
logic is the same as it was in Rescue Party (5 system server
boots in 10 minutes). Rescue Party maintains its own rescue
levels internally, which map to user impact levels.
Add optional onBootLoop() and executeBootLoopMitigation() methods
to PackageHealthObserver.
Add tests to handle the new cases handled by Package Watchdog.
Test: atest RescuePartyTest
Test: atest PackageWatchdogTest
Bug: 136135457
Change-Id: Ic435e60318e369509975c19a9888741e047803de
Integrate Rescue Party as an observer for Package
Watchdog, for managing package failures. Rescue Party
will be a persistent observer, meaning it may receive
failure calls for packages it has not explicitly asked
to observe.
Remove app failure calls and thresholding logic from
Rescue Party. Remove obsolete Rescue Party tests
and add persistent observer tests to
PackageWatchdogTest.
Test: atest PackageWatchdogTest
Test: atest RescuePartyTest
Test: atest StagedRollbackTest
Bug: 136135457
Change-Id: I55ec0de48acd5434255811feba758d38c9304478
For the sake of consolidating various error detection mechanisms,
move native crash detection to Package Watchdog. Add a method
to allow the traditional threshold logic to be bypassed in this
case. This method will be used in the future for prioritizing
explicit health check failures.
Test: atest StagedRollbackTest#testNativeWatchdogTriggersRollback
Bug: 145584672
Change-Id: I98eb9f45a6f4a6d15001650e31ba9c596905663a
This is a prerequisite for adding additional logging of
the Watchdog-triggered rollback reason. Add flags which
indicate the failure observed (native, crash, ANR, explicit
health check). These will be used in the future by
RollbackPackageHealthObserver to map the failure type to the
(new) set of available logging metrics.
Test: atest PackageWatchdogTest
Bug: 138782888
Change-Id: I7e7c5e5399011e2761dada2b989a95c2013307e9
Use factory method to create MonitoredPackage which will return null
when version code can't be resolved.
Bug: 141155222
Test: atest PackageWatchdogTest
Change-Id: I6c983872cbdfd02940d76f7307aa4a6a1062d438
The code doesn't work as intended. What we should do is:
1. set up so that health check duration is shorter than observation duration
2. move time forward so we fail the health check
3. check observer.mMitigatedPackages contains only APP_A
4. move time forward again to expire the observation duration
5. check APP_A is not notified again as a failed package
Also add a similar test where the observation duration is shorter than
the health check duration.
Bug: 141518951
Test: atest PackageWatchdogTest
Change-Id: Iba1cdc4fab8608982b416cdb463ed4b38d355c9f
Since startObservingHealth is called during boot, it is less desirable
to cause boot loops by an uncaught exception. We will fall back to
DEFAULT_OBSERVING_DURATION_MS when invalid durationMs is passed.
See b/140780361 for more details about the design decision.
Bug: 140780361
Test: atest PackageWatchdogTest
Change-Id: I2bcbecb2dc4c2448ef697001dd93aea5f50f9dbf
Use the sliding window algorithm to detect if there exists a window
containing failures equal to or above the trigger threshold.
Bug: 140841942
Test: atest PackageWatchdogTest
Change-Id: I34a20e4d3b98a093dffa05fc7c7c026905834b53
Since calls to raiseFatalFailure are always followed by
TestLooper#dispatchAll, we can combine them to reduce boilerplate code.
Bug: 140691154
Test: atest PackageWatchdogTest
Change-Id: I0ea23dc132f2ad26ced1119bc5278bc5d876949c
Following go/unit-test-practices, we split testRegistration into smaller
ones so each test focuses on one behavior at a time.
Note we will remove testRegistration in a later CL.
Bug: 140472424
Test: atest PackageWatchdogTest
Change-Id: I88e00a8fc43b953d575ee047979b7fe1d5fbd3ba
TestObserver#mHealthCheckFailedPackages is added to collect packages
when TestObserver#onHealthCheckFailed is called. It will be used to test
if resgistration/unregistration is done successfully.
TestController#mFailedPackages is also renamed to be distinguished from
mHealthCheckFailedPackages.
Bug: 140472424
Test: atest PackageWatchdogTest
Change-Id: I791e0a1b8e5d59ae766502b54a0782d509b209b5
TestLooper.moveTimeForward() changes the target delivery time of the messages
in the queue to simulate elapsed time. This allows tests to run faster in a
more deterministic way without incurring the indeterminism caused by Thread.sleep()
which is usually a source of flakiness and should be avoided when possible.
Bug: 140208026
Test: atest PackageWatchdogTest
Change-Id: I3365093838ec9fa2de5742359f6947379add7703
This bug is motivated by bug 140208026 where we want to replace
Thread.sleep() with TestLooper.moveTimeForward() in PackageWatchdogTest.java.
However, it turns out that PackageWatchdog uses SystemClock.uptimeMillis()
internally. The tests will fail if we don't forward PackageWatchdog's internal
clock accordingly.
We add a wrapper around SystemClock.uptimeMillis() so it is customizable
by the test case.
Bug: 140358475
Test: atest PackageWatchdogTest
Change-Id: Id26325a93dc4050c6468502347b0e7852ed1263f
Refactor NetworkStackClient class to move the module service binding &
network stack process death monitoring to a separate class. This class
will only instantiated in the SystemServer process.
The new class |SystemServerToNetworkStackConnector| will be used from
the client classes corresponding to each module running on the network
stack process (NetworkStackClient, WifiStackClient, etc)
This has 2 main advantages:
a) Reduces code duplication (Otherwise the various Client classes need
to replicate the service bindding & process death monitoring).
b) Central crash recovery for the network stack process (Otherwise the
various Client classes will trigger multiple recovery for a single
network stack process crash).
Bug: 135679762
Test: Device boots up & connects to wifi networks.
Change-Id: I673581b0067b9a3f72dd68a3ab622c18183ebd2e
Merged-In: I673581b0067b9a3f72dd68a3ab622c18183ebd2e
The test adds a dependency on mockito extended to be able to mock the
Context, PackageManager etc.
Test: atest PackageWatchdogTest#testNetworkStackFailure (+rest of class)
Bug: 133725814
Change-Id: Iba8a47f5e94b5dba49d6d395085e77285305ee7c
In addition to the NetworkStack app monitoring, have PackageWatchdog
register an observer to NetworkStackClient to receive severe failure
notifications, and attempt a rollback if available.
The callback is registered in onPackagesReady(), which is called in the
boot sequence just before starting the NetworkStack.
Test: installed new networkstack, killed it twice, observe rollback
Test: unit test in change on top
Bug: 133725814
Change-Id: I2cb4200b78c2482cacc4bfe2ace1581b869be512
Make PackageWatchdogTest compatible to the changes that added
DeviceConfig flags to PackageWatchdog. This includes:
* Make PackageWatchdog#setExplicitHealthCheckEnabled private and
use DeviceConfig mechanism for changing that value instead
* Disable TestLooper#startAutoDispatch
* Other minor refinements that solve compatibility issues
Bug: 129335707
Test: atest com.android.server.PackageWatchdogTest
Merged-In: I7323dc65ec2957aeab128224864441bdf63c6f81
Change-Id: I7323dc65ec2957aeab128224864441bdf63c6f81
1. Receiving List<PackageInfo>:
Since I29e2d619a5296716c29893ab3aa2f35f69bfb4d7, we now receive a
List of PackageInfo instead of Strings for packages supporting
explicit health checks. Now, we parse this List<PackageInfo> from
ExtServices instead of trying to parse List<String> and we use the
health check timeout in the PackageInfo as the health check expiry
deadline instead of using the total package expiry time.
2. Updating health check durations onSupportedPackages:
Before, we always updated the health check duration for a
package if the package is supported and the health check state is
not PASSED, this caused the health check duration for a package to
never reduce as long as we kept getting onSupportedPackages. Now, we
improved the readability of the state transitions onSupportedPackages.
We now correctly only update the health check duration for supported
packages in the INACTIVE state.
3. FAILED state:
Before we only had INACTIVE, ACTIVE and PASSED states. When a package
has failed the health check we could notify the observer multiple
times in quick succession and get into a bad internal state with
negative health check durations. Now we added check to ensure we
don't try to schedule with a Handler with a negative duration and we
defined a negative health check duration to be a new FAILED state if the
health check is not passed. This clearly defines the state transitions
as seen below:
+----------+ +---------+ +------+
| | | | | |
| INACTIVE +---->+ ACTIVE +--->+PASSED|
| | | | | |
+-----+----+ +----+----+ +------+
| |
| |
| |
| |
| +----v----+
| | |
+----------> FAILED |
| |
+---------+
4. Uptime state:
Everytime we pruned observers, we scheduled the next prune and stored
the current SystemClock#uptimeMillis. This allowed us determine how
much time had elapsed for the next prune. The uptime was not correclty
updated when starting to observe already observed packages. With the
following sequence of events:
-monitor package A for 1hr
-30mins elapsed
-monitor package A again for 1hr
A would expire 30mins from the last event instead of 1hr.
This was because the second time around, we
saved the new state to disk but did not reschedule so did not update
the uptime at last schedule, so 1hr from the first event, we would
prune packages with the original uptime and incorrectly expire A
earlier. Now we update all internal state, fixed this and added a test
for this case.
5. Readability
Improved method variable names, logging and comments.
Bug: 120598832
Test: Manual testing && atest PackageWatcdogTest
Change-Id: I1512d5938848ad26b668636405fe9b0db50d3a2e
We have always evaluated the explicit health check results on package
expiry. Since I29e2d619a5296716c29893ab3aa2f35f69bfb4d7 we now receive
explicit health check timeouts from ExtServices. This cl doesn't yet
use the timeout but it treats explicit health check timeouts as
different events from package expiry. This is in preparation to use
the timeouts from the cl mentioned above.
Improved readability: Logging, comments, variable and function names
Bug: 120598832
Test: atest PackageWatchdogTest
Change-Id: I8030dae1fef5b8fee42095c1eaf16861cc33ac59
Improvements:
1. Queuing PackageWatchdog requests to startObserving packages:
When observing packages with the watchdog, we needed to get
the packages supporting explicit health checks so we can decide if a
package should be passing or not. This prevents us from receiving
requests to monitor packages during early boot, before third party
packages are ready. In this change we don't depend on ExtServices to
be up to startObserving, we initially treat all package as failing a
health check and lazily syncRequests to request or cancel explicit
health checks based on the currently observed packages. When we receive
onSupportedPackages, we mark the packages that don't support health
checks as passing.
2. Lazy binding to the explicit health check service:
We were always bound to the explicit health check
service regardless of whether we are expecting requests or not, we need
to be able to bind and unbind dynamically to improve device resource
usage. In this change, we bind as soon as we make a request and are
expecting results, we unbind otherwise.
3. Fixed Races:
There were a couple of potential races that could lead to exceptions
that could bring the system server down, e.g when the service is
transitioning between disconnected and connected state (maybe it
crashed) or when ExtServices is being upated and is down or early
boot requests when third party apps are not ready. This change fixes such.
4. Logging:
We improved the logging wording and order and made it more consistent
Bug: 120598832
Test: Manual tests. Stress tested behavior by killing extservices and
making requests simultaneously
function killproc {
while true
do
local pid=$(adb shell pidof $1)
if [[ ! -z $pid ]]
then
echo $pid
adb shell kill $pid
fi
done;
}
adb install-multi-package -i com.android.shell --enable-rollback \
NetworkStack.apk ModuleMetadataGoogle.apk
Also switched between enabled and disabled states to verify packages
are handled correctly. Will automate these tests in later cl
atest PackageWatchdogTest
Change-Id: Iafaef553e95d107f700109f9a8328950a5e2bf71
PackageWatchdog now uses the ExplicitHealthCheckController introduced
in Ia030671c99699bd8d8273f32a97a1d3b7b015d3b when observing packages.
Bug: 120598832
Test: Manually tested that after an APEX update, the network stack
does not pass the explicit health check until WiFi is connected
successfully. If Wi-Fi is never connected and the network stack
monitoring duration is exceeded, the update is rolled back.
Change-Id: I75d3cc909cabb4a4eb34df1d5022d1afc629dac3
As part of extending PackageWatchdog with explicit health check support
in Ib4322c327bcb00ca9a3fbdc83579e7b5f2fd633b. Trigger the observers #execute
method if a package never passed explicit health check on expiry.
Bug: 120598832
Test: atest PackageWatchdogTest
Change-Id: I8e916a6ca115d3883fe29f66456da36cd0ed09fb
Allow PackageWatchdog to monitor packages with explicit health checks
enabled. In this case, at the end of a monitoring duration if a
passed-health-check callback is not triggered, the package would be
regarded as failed (in a later cl) and the observer is notified.
If monitoring without explicit health checks, the behavior is the same
as before, packages expire silently.
TODO: Implement the package failure trigger on expiry with failed
explict checks and enable added tests
Removed username from TODO comments
Bug: 120598832
Test: atest PackageWatchdogTest
Change-Id: Ib4322c327bcb00ca9a3fbdc83579e7b5f2fd633b
We now pass a VersionedPackage argument instead of passing separate
method arguments for packageName and versionCode.
Test: atest PackageWatchdogtest
Bug: 120598832
Change-Id: I8dd7e6d1e144251830108c58f4a752c411d7295b
PackageHealthObservers may need to verify that the package failure
notification they receive matches the expected package version code.
We now pass the version code along with the package name when notifying
observers.
Test: atest com.android.server.PackageWatchdogTest
Bug: 120598832
Change-Id: I272965d08a07240f3bde358039b52187ff2dd3cf
When a package fails health check, observers will report the impact of their
action on the user. Only the observer with the least user impact will be
allowed to take action.
Bug: 120598832
Test: atest PackageWatchdogTest
Change-Id: I15f358cd599431e1d7ea211aea5b1391f4aa33ab
Fixes:
1. Remove registered observer when removed from persisted file
2. Only call external observers after threshold is exceeded
3. Handle edge case where we reschedule package cleanup and elapsed time
is longer than scheduled duration
4. Modify code to allow easier testing
Bug: 120598832
Test: atest PackageWatchdogTest
Change-Id: I92181136fb5994a4d8ebe976be3138f210e853a5