gcp-doctor - Diagnostics for Google Cloud Platform

Overview

gcp-doctor - Diagnostics for Google Cloud Platform

code analysis badge test badge

gcp-doctor is a command-line diagnostics tool for GCP customers. It finds and helps to fix common issues in Google Cloud Platform projects. It is used to test projects against a wide range of best-practices and frequent mistakes, based on the troubleshooting experience of the Google Cloud Support team.

gcp-doctor is open-source and contributions are welcome! Note that this is not an officially supported Google product, but a community effort. The Google Cloud Support team maintains this code and we do our best to avoid causing any problems in your projects, but we give no guarantees to that end.

Installation

You can run gcp-doctor using a shell wrapper that starts gcp-doctor in a Docker container. This should work on any machine with Docker installed, including Cloud Shell.

curl https://storage.googleapis.com/gcp-doctor/gcp-doctor.sh >gcp-doctor
chmod +x gcp-doctor
gcloud auth login --update-adc
./gcp-doctor lint --auth-adc --project=[*MYPROJECT*]

Note: the gcloud auth step is not required in Cloud Shell.

Usage

Currently gcp-doctor mainly supports one subcommand: lint, which is used to run diagnostics on one or more GCP projects.

usage: gcp-doctor lint --project P [OPTIONS]

Run diagnostics in GCP projects.

optional arguments:
  -h, --help           show this help message and exit
  --auth-adc           Authenticate using Application Default Credentials
  --auth-key FILE      Authenticate using a service account private key file
  --project P          Project ID of project that should be inspected (can be specified multiple times)
  --billing-project P  Project used for billing/quota of API calls done by gcp-doctor
                       (default is the inspected project, requires 'serviceusage.services.use' permission)
  --show-skipped       Show skipped rules
  --hide-ok            Hide rules with result OK
  -v, --verbose        Increase log verbosity
  --within-days D      How far back to search logs and metrics (default: 3)

Authentication

gcp-doctor supports authentication using multiple mechanisms:

  1. Oauth user consent flow

    By default gcp-doctor can use a Oauth user authentication flow, similarly to what gcloud does. It will print a URL that you need to access with a browser, and ask you to enter the token that you receive after you authenticate there. Note that this currently doesn't work for people outside of google.com, because gcp-doctor is not approved for external Oauth authentication yet.

    The credentials will be cached on disk, so that you can keep running it for 1 hour. To remove cached authentication credentials, you can delete the $HOME/.cache/gcp-doctor directory.

  2. Application default credentials

    If you supply --auth-adc, gcp-doctor will use Application Default Credentials to authenticate. For example this works out of the box in Cloud Shell and you don't need to re-authenticate, or you can use gcloud auth login --update-adc to refresh the credentials using gcloud.

  3. Service account key

    You can also use the --auth-key parameter to specify the private key of a service account.

The authenticated user will need as minimum the following roles granted (both of them):

  • Viewer on the inspected project
  • Service Usage Consumer on the project used for billing/quota enforcement, which is per default the project being inspected, but can be explicitely set using the --billing-project option

The Editor and Owner roles include all the required permissions, but we recommend that if you use service account authentication (--auth-key), you only grant the Viewer+Service Usage Consumer on that service account.

Test Products, Classes, and IDs

Tests are organized by product, class, and ID.

The product is the GCP service that is being tested. Examples: GKE or GCE.

The class is what kind of test it is, currently we have:

Class name Description
BP Best practice, opinionated recommendations
WARN Warnings: things that are possibly wrong
ERR Errors: things that are very likely to be wrong
SEC Potential security issues

The ID is currently formatted as YYYY_NNN, where YYYY is the year the test was written, and NNN is a counter. The ID must be unique per product/class combination.

Each test also has a short_description and a long_description. The short description is a statement about the good state that is being verified to be true (i.e. we don't test for errors, we test for compliance, i.e. an problem not to be present).

Available Rules

Product Class ID Short description Long description
gce BP 2021_001 Serial port logging is enabled. Serial port output can be often useful for troubleshooting, and enabling serial logging makes sure that you don't lose the information when the VM is restarted. Additionally, serial port logs are timestamped, which is useful to determine when a particular serial output line was printed. Reference: https://cloud.google.com/compute/docs/instances/viewing-serial-port-output
gce ERR 2021_001 Managed instance groups are not reporting scaleup failures. Suggested Cloud Logging query: resource.type="gce_instance" AND log_id(cloudaudit.googleapis.com/activity) AND severity=ERROR AND protoPayload.methodName="v1.compute.instances.insert" AND protoPayload.requestMetadata.callerSuppliedUserAgent="GCE Managed Instance Group"
gce WARN 2021_001 GCE instance service account permissions for logging. The service account used by GCE instance should have the logging.logWriter permission, otherwise, if you install the logging agent, it won't be able to send the logs to Cloud Logging.
gce WARN 2021_002 GCE nodes have good disk performance. Verify that the persistent disks used by the GCE instances provide a "good" performance, where good is defined to be less than 100ms IO queue time. If it's more than that, it probably means that the instance would benefit from a faster disk (changing the type or making it larger).
gke BP 2021_001 GKE system logging and monitoring enabled. Disabling system logging and monitoring (aka "GKE Cloud Operations") severly impacts the ability of Google Cloud Support to troubleshoot any issues that you might have.
gke ERR 2021_001 GKE nodes service account permissions for logging. The service account used by GKE nodes should have the logging.logWriter role, otherwise ingestion of logs won't work.
gke ERR 2021_002 GKE nodes service account permissions for monitoring. The service account used by GKE nodes should have the monitoring.metricWriter role, otherwise ingestion of metrics won't work.
gke ERR 2021_003 App-layer secrets encryption is activated and Cloud KMS key is enabled. GKE's default service account cannot use a disabled Cloud KMS key for application-level secrets encryption.
gke ERR 2021_004 GKE nodes aren't reporting connection issues to apiserver. GKE nodes need to connect to the control plane to register and to report status regularly. If connection errors are found in the logs, possibly there is a connectivity issue, like a firewall rule blocking access. The following log line is searched: "Failed to connect to apiserver"
gke ERR 2021_005 GKE nodes aren't reporting connection issues to storage.google.com. GKE node need to download artifacts from storage.google.com:443 when booting. If a node reports that it can't connect to storage.google.com, it probably means that it can't boot correctly. The following log line is searched in the GCE serial logs: "Failed to connect to storage.googleapis.com"
gke ERR 2021_006 GKE Autoscaler isn't reporting scaleup failures. If the GKE autoscaler reported a problem when trying to add nodes to a cluster, it could mean that you don't have enough resources to accomodate for new nodes. E.g. you might not have enough free IP addresses in the GKE cluster network. Suggested Cloud Logging query: resource.type="gce_instance" AND log_id(cloudaudit.googleapis.com/activity) AND severity=ERROR AND protoPayload.methodName="v1.compute.instances.insert" AND protoPayload.requestMetadata.callerSuppliedUserAgent="GCE Managed Instance Group for GKE"
gke ERR 2021_007 Service Account used by the cluster exists and not disabled Disabling or deleting service account used by the nodepool will render this nodepool not functional. To fix - restore the default compute account or service account that was specified when nodepool was created.
gke SEC 2021_001 GKE nodes don't use the GCE default service account. The GCE default service account has more permissions than are required to run your Kubernetes Engine cluster. You should either use GKE Workload Identity or create and use a minimally privileged service account. Reference: Hardening your cluster's security https://cloud.google.com/kubernetes-engine/docs/how-to/hardening-your-cluster#use_least_privilege_sa
gke WARN 2021_001 GKE master version available for new clusters. The GKE master version should be a version that is available for new clusters. If a version is not available it could mean that it is deprecated, or possibly retired due to issues with it.
gke WARN 2021_002 GKE nodes version available for new clusters. The GKE nodes version should be a version that is available for new clusters. If a version is not available it could mean that it is deprecated, or possibly retired due to issues with it.
gke WARN 2021_003 GKE cluster size close to maximum allowed by pod range The maximum amount of nodes in a GKE cluster is limited based on its pod CIDR range. This test checks if the cluster is approaching the maximum amount of nodes allowed by the pod range. Users may end up blocked in production if they are not able to scale their cluster due to this hard limit imposed by the pod CIDR.
gke WARN 2021_004 GKE system workloads are running stable. GKE includes some system workloads running in the user-managed nodes which are essential for the correct operation of the cluster. We verify that restart count of containers in one of the system namespaces (kube-system, istio-system, custom-metrics) stayed stable in the last 24 hours.
gke WARN 2021_005 GKE nodes have good disk performance. Disk performance is essential for the proper operation of GKE nodes. If too much IO is done and the disk latency gets too high, system components can start to misbehave. Often the boot disk is a bottleneck because it is used for multiple things: the operating system, docker images, container filesystems (usually including /tmp, etc.), and EmptyDir volumes.
gke WARN 2021_006 GKE nodes aren't reporting conntrack issues. The following string was found in the serial logs: nf_conntrack: table full See also: https://cloud.google.com/kubernetes-engine/docs/troubleshooting
Comments
  • BP_EXT/2022_001 - Google Groups are Enabled

    BP_EXT/2022_001 - Google Groups are Enabled

    This pull request includes three changes.

    1. Adds has_authenticator_group_enabled to the GKE query class. This checks to see if the ['authenticatorGroupsConfig']['enabled']key exists. If authentictor groups were previously enabled, the previously used gke-security-groups email will persist but the enabled key will disappear.
    2. Added in the lint check for BP/2021_002.
    3. Provides the documentation and justification for why Google Groups for RBAC should be considered a best practice.
    cla: yes ready to pull 
    opened by taylorjstacey 5
  • Output format issue

    Output format issue

    Hello!

    I have an issue with adjusting a required output format. In the documentation I've noticed that gcpdiag have a support for multiple output formats (json, csv, terminal). But whenever I'm trying to issue the gcpdiag lint --output json --project PROJECT. I'm receiving the following error gcpdiag lint: error: unrecognized arguments: --output json. Could you please explain what can be the issue here?

    Regards

    opened by rvidxr666 4
  • Error: sqlite3.OperationalError: unable to open database file

    Error: sqlite3.OperationalError: unable to open database file

    Hi, tried using this for the first time on ubuntu 20.04, got this error.

    $ curl https://gcpdiag.dev/gcpdiag.sh >gcpdiag
    $ chmod +x gcpdiag
    $ ./gcpdiag lint --project=MYPROJECT
    Unable to find image 'us-docker.pkg.dev/gcpdiag-dist/release/gcpdiag:0.55' locally
    0.55: Pulling from gcpdiag-dist/release/gcpdiag
    <DOCKER PULL STUFF>
    Digest: sha256:0b5fcc0fd3e2f1b822cec492b0128f7e1df5173c19990570ee072c80cf6164c4
    Status: Downloaded newer image for us-docker.pkg.dev/gcpdiag-dist/release/gcpdiag:0.55
    gcpdiag 🩺 0.55
    
    Starting lint inspection (project: MYPROJECT)...
    
    Traceback (most recent call last):
      File "/opt/gcpdiag/bin/gcpdiag", line 64, in <module>
        main(sys.argv)
      File "/opt/gcpdiag/bin/gcpdiag", line 42, in main
        lint_command.run(argv)
      File "/opt/gcpdiag/gcpdiag/lint/command.py", line 237, in run
        apis.verify_access(context.project_id)
      File "/opt/gcpdiag/gcpdiag/queries/apis.py", line 253, in verify_access
        if not is_enabled(project_id, 'cloudresourcemanager'):
      File "/opt/gcpdiag/gcpdiag/queries/apis.py", line 246, in is_enabled
        return f'{service_name}.googleapis.com' in _list_apis(project_id)
      File "/opt/gcpdiag/gcpdiag/caching.py", line 145, in _cached_api_call_wrapper
        return lru_cached_func(*args, **kwargs)
      File "/opt/gcpdiag/gcpdiag/queries/apis.py", line 230, in _list_apis
        serviceusage = get_api('serviceusage', 'v1', project_id)
      File "/opt/gcpdiag/gcpdiag/caching.py", line 145, in _cached_api_call_wrapper
        return lru_cached_func(*args, **kwargs)
      File "/opt/gcpdiag/gcpdiag/queries/apis.py", line 197, in get_api
        credentials = _get_credentials()
      File "/opt/gcpdiag/gcpdiag/queries/apis.py", line 160, in _get_credentials
        return _get_credentials_oauth()
      File "/opt/gcpdiag/gcpdiag/queries/apis.py", line 119, in _get_credentials_oauth
        with caching.get_cache() as diskcache:
      File "/opt/gcpdiag/gcpdiag/caching.py", line 60, in get_cache
        _cache = diskcache.Cache(config.CACHE_DIR, tag_index=True)
      File "/opt/gcpdiag/.venv/lib/python3.9/site-packages/diskcache/core.py", line 456, in __init__
        sql = self._sql_retry
      File "/opt/gcpdiag/.venv/lib/python3.9/site-packages/diskcache/core.py", line 652, in _sql_retry
        sql = self._sql
      File "/opt/gcpdiag/.venv/lib/python3.9/site-packages/diskcache/core.py", line 648, in _sql
        return self._con.execute
      File "/opt/gcpdiag/.venv/lib/python3.9/site-packages/diskcache/core.py", line 623, in _con
        con = self._local.con = sqlite3.connect(
    sqlite3.OperationalError: unable to open database file
    
    opened by pbiggar 3
  • "docker: invalid reference format." during Github Actions usage

    I tried to use it with Github Actions - minimalistic setup to reproduce (I know configs like credentials are missing, but it's not required for the reproduction):

    name: GCP Diag
    
    on:
      push:
    
    jobs:
      build:
        runs-on: ubuntu-latest
    
        steps:
        - name: Run GCP diag
          run: |
            curl https://gcpdiag.dev/gcpdiag.sh > gcpdiag
            chmod +x gcpdiag
            ./gcpdiag lint --project=dummy-non-existing-project
    

    Output:

    Run
      curl https://gcpdiag.dev/gcpdiag.sh > gcpdiag
      chmod +x gcpdiag
      ./gcpdiag lint --project=dummy-non-existing-project
      shell: /usr/bin/bash -e {0}
      % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                     Dload  Upload   Total   Spent    Left  Speed
      0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
    100  4382  100  4382    0     0  20381      0 --:--:-- --:--:-- --:--:-- 20381
    docker: invalid reference format.
    See 'docker run --help'.
    Error: Process completed with exit code 125.
    

    I think "docker: invalid reference format." isn't an intended error message.

    opened by nbali 2
  • don't warn about GKE for projects not using it

    don't warn about GKE for projects not using it

    I get

    🔎  gke/ERR/2021_007: GKE service account permissions.
       - xyz                                                        [FAIL]
         service account: [email protected]
         missing role: roles/container.serviceAgent
    
       Verify that the Google Kubernetes Engine service account exists and has the
       Kubernetes Engine Service Agent role on the project.
    
       https://gcpdiag.dev/rules/gke/ERR/2021_007
    

    even for projects that don't use GKE. It'd be nice if the tool checked whether or not the corresponding API was enabled or not and changed the applied rules accordingly.

    opened by black-snow 1
  • [gcs/BP/2022_001] KeyError: 'iamConfiguration'

    [gcs/BP/2022_001] KeyError: 'iamConfiguration'

    gcpdiag version: 0.55

    It seems that some buckets does not have a iamConfiguration field which is making gcpdiag crashing with an exception (seems to happen only on old buckets)

    Ideally, gcpdiag need to test if the key iamConfiguration is defined or not and if the key is not defined it will returned a default value

    python exception:

    Traceback (most recent call last):
      File "/opt/gcpdiag/bin/gcpdiag", line 64, in <module>
        main(sys.argv)
      File "/opt/gcpdiag/bin/gcpdiag", line 42, in main
        lint_command.run(argv)
      File "/opt/gcpdiag/gcpdiag/lint/command.py", line 240, in run
        exit_code = repo.run_rules(context, report, include_patterns,
      File "/opt/gcpdiag/gcpdiag/lint/__init__.py", line 367, in run_rules
        rule.run_rule_f(context, rule_report)
      File "/opt/gcpdiag/gcpdiag/lint/gcs/bp_2022_001_bucket_access_uniform.py", line 37, in run_rule
        elif b.is_uniform_access():
      File "/opt/gcpdiag/gcpdiag/queries/gcs.py", line 48, in is_uniform_access
        return self._resource_data['iamConfiguration']['uniformBucketLevelAccess'][
    KeyError: 'iamConfiguration'
    

    Bucket json informations returned by the api for the exception:

    {
      "kind": "storage#bucket",
      "selfLink": "https://www.googleapis.com/storage/v1/b/mybucket",
      "id": "mybucket",
      "name": "mybucket",
      "projectNumber": "00000000000",
      "metageneration": "9",
      "location": "US",
      "storageClass": "STANDARD",
      "etag": "CAk=",
      "timeCreated": "2016-07-12T15:05:45.473Z",
      "updated": "2022-06-22T10:25:28.219Z",
      "locationType": "multi-region",
      "rpo": "DEFAULT"
    }
    

    Working bucket json informations returned by the api:

    {
      "kind": "storage#bucket",
      "selfLink": "https://www.googleapis.com/storage/v1/b/mybucket",
      "id": "mybucket",
      "name": "mybucket",
      "projectNumber": "00000000000",
      "metageneration": "3",
      "location": "EU",
      "storageClass": "STANDARD",
      "etag": "CAM=",
      "defaultEventBasedHold": false,
      "timeCreated": "2021-04-06T10:19:45.615Z",
      "updated": "2021-04-06T13:10:55.598Z",
      "iamConfiguration": {
        "bucketPolicyOnly": {
          "enabled": false
        },
        "uniformBucketLevelAccess": {
          "enabled": false
        },
        "publicAccessPrevention": "inherited"
      },
      "locationType": "multi-region",
      "satisfiesPZS": false,
      "rpo": "DEFAULT"
    }
    
    opened by alainknaebel 1
  • https://gcpdiag.dev/gcpdiag.sh not found

    https://gcpdiag.dev/gcpdiag.sh not found

    The installation method mentioned in the docs does not work currently:

    $ curl https://gcpdiag.dev/gcpdiag.sh
    <?xml version='1.0' encoding='UTF-8'?><Error><Code>NoSuchKey</Code><Message>The specified key does not exist.</Message></Error>%
    
    opened by bkw 1
  • Fix os version detection on err_2022_002_image_versions

    Fix os version detection on err_2022_002_image_versions

    This PR fix the following issue

    Traceback (most recent call last):
      File "/gcpdiag/bin/gcpdiag", line 64, in <module>
        main(sys.argv)
      File "/gcpdiag/bin/gcpdiag", line 42, in main
        lint_command.run(argv)
      File "/gcpdiag/gcpdiag/lint/command.py", line 267, in run
        exit_code = repo.run_rules(context, report, include_patterns,
      File "/gcpdiag/gcpdiag/lint/__init__.py", line 430, in run_rules
        rule.run_rule_f(context, rule_report)
      File "/gcpdiag/gcpdiag/lint/dataproc/err_2022_002_image_versions.py", line 96, in run_rule
        if ImageVersion(cluster.image_version).is_deprecated():
      File "/gcpdiag/gcpdiag/lint/dataproc/err_2022_002_image_versions.py", line 78, in is_deprecated
        if self.version.os == 'debian' and self.os_ver < 10:
    AttributeError: 'ImageVersion' object has no attribute 'os_ver'
    
    ready to pull 
    opened by DKbyo 1
  • Add shared VPC property to GKE model

    Add shared VPC property to GKE model

    And update private_google_access GKE test to take a shared VPC scenario into consideration.

    Currently this fails for a private cluster deployed in a shared VPC (if the VPC is not in the same project):

    🔎 gke/ERR/2022_002: GKE nodes of private clusters can access Google APIs and services.
       - foo-bar/europe-west4/foo                    [FAIL]
          subnet bar has Private Google Access disabled and Cloud NAT is not available
    
       Private GKE clusters must have Private Google Access enabled on the subnet
       where cluster is deployed.
    
       https://gcpdiag.dev/rules/gke/ERR/2022_002
    
    opened by eyalzek 1
  • fix: wrong project IDs for tests

    fix: wrong project IDs for tests

    All rules were tested using the same project ID:'gcpdiag-gke1-aaaa'. The tests skipped i.e no resources were found. I modified project IDs in the tests correspondingly.

    ready to pull 
    opened by faripple 1
  • [WARNING] Encountered 403 Forbidden with reason

    [WARNING] Encountered 403 Forbidden with reason "PERMISSION_DENIED"

    Crash when Identity and Access Management (IAM) API is not enabled.

     OS Config service account has the required permissions.
    [WARNING] Encountered 403 Forbidden with reason "PERMISSION_DENIED"
       ... fetching IAM roles: projects/my-projectTraceback (most recent call last):
      File "/opt/gcpdiag/gcpdiag/queries/iam.py", line 221, in get_project_policy
       ... executing monitoring query (project: my-project)    return ProjectPolicy(project_id)
      File "/opt/gcpdiag/gcpdiag/queries/iam.py", line 212, in __init__
        self._custom_roles = _fetch_roles('projects/' + self._project_id,
      File "/opt/gcpdiag/gcpdiag/queries/iam.py", line 58, in _fetch_roles
        response = request.execute(num_retries=config.API_RETRIES)
      File "/opt/gcpdiag/.venv/lib/python3.9/site-packages/googleapiclient/_helpers.py", line 131, in positional_wrapper
        return wrapped(*args, **kwargs)
      File "/opt/gcpdiag/.venv/lib/python3.9/site-packages/googleapiclient/http.py", line 937, in execute
        raise HttpError(resp, content, uri=self.uri)
    googleapiclient.errors.HttpError: <HttpError 403 when requesting https://iam.googleapis.com/v1/projects/my-project/roles?view=FULL&alt=json returned "Identity and Access Management (IAM) API has not been used in project 1234567890 before or it is disabled. Enable it by visiting https://console.developers.google.com/apis/api/iam.googleapis.com/overview?project=1234567890 then retry. If you enabled this API recently, wait a few minutes for the action to propagate to our systems and retry.". Details: "[{'@type': 'type.googleapis.com/google.rpc.Help', 'links': [{'description': 'Google developers console API activation', 'url': 'https://console.developers.google.com/apis/api/iam.googleapis.com/overview?project=1234567890'}]}, {'@type': 'type.googleapis.com/google.rpc.ErrorInfo', 'reason': 'SERVICE_DISABLED', 'domain': 'googleapis.com', 'metadata': {'consumer': 'projects/1234567890', 'service': 'iam.googleapis.com'}}]">
    
    The above exception was the direct cause of the following exception:
    
    Traceback (most recent call last):
      File "/opt/gcpdiag/bin/gcpdiag", line 65, in <module>
        main(sys.argv)
      File "/opt/gcpdiag/bin/gcpdiag", line 43, in main
        lint_command.run(argv)
      File "/opt/gcpdiag/gcpdiag/lint/command.py", line 194, in run
        exit_code = repo.run_rules(context, report, include_patterns,
      File "/opt/gcpdiag/gcpdiag/lint/__init__.py", line 351, in run_rules
        rule.prefetch_rule_future.result()
      File "/usr/local/lib/python3.9/concurrent/futures/_base.py", line 445, in result
        return self.__get_result()
      File "/usr/local/lib/python3.9/concurrent/futures/_base.py", line 390, in __get_result
        raise self._exception
      File "/usr/local/lib/python3.9/concurrent/futures/thread.py", line 58, in run
        result = self.fn(*self.args, **self.kwargs)
      File "/opt/gcpdiag/gcpdiag/lint/gce/err_2021_002_osconfig_perm.py", line 34, in prefetch_rule
        iam.get_project_policy(pid)
      File "/opt/gcpdiag/gcpdiag/caching.py", line 145, in _cached_api_call_wrapper
        return lru_cached_func(*args, **kwargs)
      File "/opt/gcpdiag/gcpdiag/queries/iam.py", line 223, in get_project_policy
        raise utils.GcpApiError(err) from err
    gcpdiag.utils.GcpApiError: can't fetch data, reason: Identity and Access Management (IAM) API has not been used in project 1234567890 before or it is disabled. Enable it by visiting https://console.developers.google.com/apis/api/iam.googleapis.com/overview?project=1234567890 then retry. If you enabled this API recently, wait a few minutes for the action to propagate to our systems and retry.
    [WARNING] Encountered 403 Forbidden with reason "PERMISSION_DENIED"
    [WARNING] Encountered 403 Forbidden with reason "PERMISSION_DENIED"
    [WARNING] Encountered 403 Forbidden with reason "PERMISSION_DENIED"
       ... still fetching logs (project: my-project, resource type: k8s_node, max wait: 112s)%
    
    opened by kevin-shelaga 1
  • Update Architecture

    Update Architecture

    • Removed "(see: Codelab: Logs-based Rule)" as this was a broken link
    • Removed "(see also: test product classes and ids)." as this was a broken link
    • updated example code to use current existing code that references gcpdiag and not gcp-doctor
    • Added a way to link to the rule in the documentation website
    • Added a link to the github code, to allow users to explore further
    opened by 444B 0
  • Running gcpdiag - recommendations for resource limitations

    Running gcpdiag - recommendations for resource limitations

    It would be good to have some info about the expected resource usage of running gcpdiag (how much memory the process may consume, or network bandwidth for example). This might scale in terms of the number of resources in the project being analyzed, so a few examples may be necessary.

    Also please include best practices for choosing an environment to run gcpdiag, considering

    • Cloud Shell would work for most cases
    • if the user needs to limit the resource usage (run within a container for example)
    • whether or not to run on a production machine vs a throw-away VM.
    opened by LukeStone 0
  • interactively/automatically enable required APIs

    interactively/automatically enable required APIs

    Thanks for the tool, quite handy!

    It'd be nice if instead of failing it'd just aks for permission and enable the required APIs by itself. I got prompted for three:

    gcloud services enable serviceusage.googleapis.com --project=
    gcloud services enable cloudresourcemanager.googleapis.com --project=
    gcloud services enable iam.googleapis.com --project=
    

    It's also missing from the README, I think.

    opened by black-snow 0
  • Even so, uniform bucket-level access is recommended.

    Even so, uniform bucket-level access is recommended.

    gcpdiag lint execution results are not as intended. It is recommended that uniform bucket-level access be used, even if the bucket configuration uses uniform bucket-level access.

    Execution result of gcpdiag lint

    🔎  gcs/BP/2022_001: Buckets are using uniform access
       - BUCKET_NAME                               [FAIL]
         it is recommend to use uniform access on your bucket
    
       Google recommends using uniform access for a Cloud Storage bucket IAM policy
       https://cloud.google.com/storage/docs/access-
       control#choose_between_uniform_and_fine-grained_access
    
       https://gcpdiag.dev/rules/gcs/BP/2022_001
    

    Commands to check settings and output results

    gsutil uniformbucketlevelaccess get gs://BUCKET_NAME
    
    Uniform bucket-level access setting for gs://BUCKET_NAME:
      Enabled: True
      LockedTime: 2023-01-01 03:04:30.427000+00:00
    
    opened by erina-nakajima 0
  • added a check for Public GKE clusters

    added a check for Public GKE clusters

    GKE Private clusters are almost always the right answer. NAT gateways and GLBs can be used to cover almost any use case where a cluster needs to communicate with the outside world.

    opened by kevingair 1
  • Publish to PyPI

    Publish to PyPI

    Please publish to PyPI so gcpdiag can be installed with pip (or pipx), and so that the community can build distro packages out of it (homebrew, AUR, other linux distros).

    Reasoning: Current installation instructions suggest installing this tool with curl. A more sophisticated distribution method would be to enable distros to build their own packaging, but for that a base requirement is for this tool being able to publish to PyPI (so it follows common python packaging standards).

    Packaging tools (like Homebrew) make it easier to package up tools written in python if it's install-able through PyPI.

    opened by reegnz 0
Releases(v0.58)
  • v0.58(Nov 8, 2022)

    Deprecation

    • Python 3.9+ is required for gcpdiag. Python 3.8 and older versions support is deprecated.
    • Deprecated authentication using OAuth (--auth-oauth) has been removed.

    New rules

    • apigee/ERR/2022_002: Verify whether Cloud KMS key is enabled and could be accessed by Apigee Service Agent
    • datafusion/ERR/2022_003: Private Data Fusion instance is peered to the tenant project
    • datafusion/ERR/2022_004: Cloud Data Fusion Service Account has necessary permissions
    • datafusion/ERR/2022_005: Private Data Fusion instance has networking permissions
    • datafusion/ERR/2022_006: Private Google Access enabled for private Data Fusion instance subnetwork
    • datafusion/ERR/2022_007: Cloud Data Fusion Service Account exists at a Project
    • gke/BP/2022_004: GKE clusters should have HTTP load balancing enabled to use GKE ingress

    Enhancements

    • Python dependencies updated

    Fixes

    • gke/ERR/2021_002: skip if there are no GKE clusters
    Source code(tar.gz)
    Source code(zip)
  • v0.57(Sep 29, 2022)

    Deprecation

    • Default authentication using OAuth (--auth-oauth) is now deprecated and Application Default Credentials (--auth-adc) will be used instead. Alternatively you can use Service Account private key (--auth-key FILE).

    New rules

    • apigee/WARN/2022_001: Verify whether all environments has been attached to Apigee X instances
    • apigee/WARN/2022_002: Environment groups are created in the Apigee runtime plane
    • cloudrun/ERR/2022_001: Cloud Run service agent has the run.serviceAgent role
    • datafusion/ERR/2022_001: Firewall rules allow for Data Fusion to communicate to Dataproc VMs
    • datafusion/ERR/2022_002: Private Data Fusion instance has valid host VPC IP range
    • dataproc/WARN/2022_001: Dataproc VM Service Account has necessary permissions
    • dataproc/WARN/2022_002: Job rate limit was not exceeded
    • gcf/ERR/2022_002: Cloud Function deployment failure due to Resource Location Constraint
    • gcf/ERR/2022_003: Function invocation interrupted due to memory limit exceeded
    • gke/WARN/2022/_008: GKE connectivity: possible dns timeout in some gke versions
    • gke/WARN/2022_007: GKE nodes need Storage API access scope to retrieve build artifacts
    • gke/WARN/2022_008: GKE connectivity: possible dns timeout in some gke versions

    Enhancements

    • New product: Cloud Run
    • New product: Data Fusion

    Fixes

    • gcf/WARN/2021_002: Added check for MATCH_STR
    • gcs/BP/2022_001: KeyError: 'iamConfiguration'
    • gke/ERR/2022_003: unhandled exception
    • gke/WARN/2022_005: Incorrectly report missing "nvidia-driver-installer" daemonset
    • iam/SEC/2021_001: unhandled exception
    Source code(tar.gz)
    Source code(zip)
  • v0.56(Jul 18, 2022)

    New rules

    • bigquery/ERR/2022_001: BigQuery is not exceeding rate limits
    • bigquery/ERR/2022_001: BigQuery jobs not failing due to concurrent DML updates on the same table
    • bigquery/ERR/2022_002: BigQuery jobs are not failing due to results being larger than the maximum response size
    • bigquery/ERR/2022_003: BigQuery jobs are not failing while accessing data in Drive due to a permission issue
    • bigquery/ERR/2022_004: BigQuery jobs are not failing due to shuffle operation resources exceeded
    • bigquery/WARN/2022_002: BigQuery does not violate column level security
    • cloudsql/WARN/2022_001: Docker bridge network should be avoided
    • composer/WARN/2022_002: fluentd pods in Composer enviroments are not crashing
    • dataproc/ERR/2022_003: Dataproc Service Account permissions
    • dataproc/WARN/2022_001: Dataproc clusters are not failed to stop due to the local SSDs
    • gae/WARN/2022_002: App Engine Flexible versions don't use deprecated runtimes
    • gcb/ERR/2022_002: Cloud Build service account registry permissions
    • gcb/ERR/2022_003: Builds don't fail because of retention policy set on logs bucket
    • gce/BP/2022_003: detect orphaned disks
    • gce/ERR/2022_001: Project limits were not exceeded
    • gce/WARN/2022_004: Cloud SQL Docker bridge network should be avoided
    • gce/WARN/2022_005: GCE CPU quota is not near the limit
    • gce/WARN/2022_006: GCE GPU quota is not near the limit
    • gce/WARN/2022_007: VM has the proper scope to connect using the Cloud SQL Admin API
    • gce/WARN/2022_008: GCE External IP addresses quota is not near the limit
    • gce/WARN/2022_009: GCE disk quota is not near the limit
    • gcf/ERR/2022_001: Cloud Functions service agent has the cloudfunctions.serviceAgent role
    • gcf/WARN/2021_002: Cloud Functions have no scale up issues
    • gke/BP_EXT/2022_001: Google Groups for RBAC enabled (github #12)
    • gke/WARN/2022_006: GKE NAP nodes use a containerd image
    • tpu/WARN/2022_001: Cloud TPU resource availability
    • vpc/WARN/2022_001: Cross Project Networking Service projects quota is not near the limit

    Updated rules

    • dataproc/ERR/2022_002: fix os version detection (github #26)
    • gke/BP/2022_003: update GKE EOL schedule
    • gke/ERR/2022_001: fix KeyError exception
    • gke/BP/2022_002: skip legacy VPC

    Enhancements

    • Add support for multiple output formats (--output=csv, --output=json)
    • Better handle CTRL-C signal
    • Org policy support
    • New product: CloudSQL
    • New product: VPC
    • Renamed product "GAES" to "GAE" (Google App Engine)
    • Publish internal API documentation on https://gcpdiag.dev/docs/development/api/
    • Update Python dependencies
    Source code(tar.gz)
    Source code(zip)
  • v0.54(Apr 25, 2022)

    New rules

    • apigee/ERR/2022_001: Apigee Service Agent permissions

    Enhancements

    • dynamically load gcpdiag lint rules for all products
    • support IAM policy retrieval for Artifact Registry
    • move gcpdiag release buckets to new location

    Fixes

    • gke/ERR/2022_002: use correct network for shared VPC scenario (#24)
    • error out early if service accounts of inspected projects can't be retrieved
    • fix docker wrapper script for --config and --auth-key options
    • allow to create test projects in an org folder
    • ignore more system service accounts (ignore all accounts starting with gcp-sa)

    Note: gcpdiag 0.55 was also released with the same code. The release was used to facilitate the transition of binaries to another location.

    Source code(tar.gz)
    Source code(zip)
  • v0.53(Mar 31, 2022)

    New rules

    • composer/ERR/2022_001: Composer Service Agent permissions
    • composer/ERR/2022_002: Composer Environment Service Account permissions
    • composer/WARN/2022_001: Composer Service Agent permissions for Composer 2.x
    • gce/BP_EXT/2022_001: GCP project has VM Manager enabled
    • gce/WARN/2022_003: GCE VM instances quota is not near the limit
    • gke/BP/2022_002: GKE clusters are using unique subnets
    • gke/BP/2022_003: GKE cluster is not near to end of life
    • gke/WARN/2022_003: GKE service account permissions to manage project firewall rules
    • gke/WARN/2022_004: Cloud Logging API enabled when GKE logging is enabled
    • gke/WARN/2022_005: NVIDIA GPU device drivers are installed on GKE nodes with GPU

    Enhancements

    • Support IAM policies for service accounts and subnetworks
    • Skip rules using logs if Cloud Logging API is disabled
    • New option: --logs-query-timeout
    • Add support for configuration files (see https://gcpdiag.dev/docs/usage/#configuration-file)

    Fixes

    • Fix various unhandled exceptions
    Source code(tar.gz)
    Source code(zip)
  • v0.52(Feb 11, 2022)

    New rules

    • dataproc/BP/2022_001: Cloud Monitoring agent is enabled.
    • dataproc/ERR/2022_002: Dataproc is not using deprecated images.
    • gce/WARN/2022_001: IAP service can connect to SSH/RDP port on instances.
    • gce/WARN/2022_002: Instance groups named ports are using unique names.
    • gke/ERR/2022_002: GKE nodes of private clusters can access Google APIs and services.
    • gke/ERR/2022_003: GKE connectivity: load balancer to node communication (ingress).

    Updated rules

    • gcb/ERR/2022_001: Fix false positive when no build is configured.
    • gke/WARN/2021_008: Improve Istio deprecation message

    Enhancements

    • Introduce "extended" rules (BP_EXT, ERR_EXT, etc.), disabled by default and which can be enabled with --include-extended.
    • Large IAM policy code refactorings in preparation for org-level IAM policy support.

    Fixes

    • More API retry fixes.
    • Fix --billing-project which had no effect before.
    • Fix exception related to GCE instance scopes.
    Source code(tar.gz)
    Source code(zip)
  • v0.51(Feb 11, 2022)

  • v0.50(Jan 21, 2022)

    New rules

    • gcb/ERR/2022_001: The Cloud Build logs do not report permission issues
    • gce/BP/2021_002: GCE nodes have an up to date ops agent
    • gce/BP/2021_003: Secure Boot is enabled
    • gce/ERR/2021_004: Serial logs don’t contain Secure Boot errors
    • gce/ERR/2021_005: Serial logs don't contain mount error messages
    • gce/WARN/2021_005: Serial logs don't contain out-of-memory messages
    • gce/WARN/2021_006: Serial logs don't contain "Kernel panic" messages
    • gce/WARN/2021_007: Serial logs don't contain "BSOD" messages
    • gcs/BP/2022_001: Buckets are using uniform access
    • gke/BP/2022_001: GKE clusters are regional
    • gke/ERR/2022_016: GKE connectivity: pod to pod communication
    • gke/WARN/2022_001: GKE clusters with workload identity are regional
    • gke/WARN/2022_002: GKE metadata concealment is not in use

    Updated rules

    • gcf/WARN/2021_001: add one more deprecated runtime Nodejs6 (github #17)

    Enhancements

    • New product: App Engine Standard
    • New product: Cloud Build
    • New product: Cloud Pub/Sub
    • New product: Cloud Storage

    Fixes

    • Verify early that IAM API is enabled
    • Catch API errors in prefetch_rule
    • Disable italic in Cloud Shell
    • Implement retry logic for batch API failures
    Source code(tar.gz)
    Source code(zip)
  • v0.49(Dec 20, 2021)

    New / updated rules

    • dataproc/BP/2021_001: Dataproc Job driver logs are enabled
    • composer/WARN/2021_001: Composer environment status is running (b/207615409)
    • gke/ERR/2021_013: GKE cluster firewall rules are configured. (b/210407018)
    • gke/ERR/2021_014: GKE masters of can reach the nodes. (b/210407018)
    • gke/ERR/2021_015: GKE connectivity: node to pod communication. (b/210407018)
    • gce/WARN/2021_001: verify logging access scopes (b/210711351)
    • gce/WARN/2021_003: verify monitoring access scopes (b/210711351)

    Enhancements

    • New product: Cloud Composer (b/207615409)
    • Simplify API testing by using ephemeral projects (b/207484323)
    • gcpdiag.sh wrapper script now verifies the minimum version of current script
    • Add support for client-side firewall connectivity tests (b/210407018)
    Source code(tar.gz)
    Source code(zip)
  • v0.48(Nov 23, 2021)

    New rules

    • apigee/WARN/2021_001: Every env. group has at least one env. (b/193733957)
    • dataproc/WARN/2021_001: Dataproc cluster is in RUNNING state (b/204850980)

    Enhancements

    • Use OAuth authentication by default (b/195908593)
    • New product: Dataproc (b/204850980)
    • New product: Apigee (b/193733957)
    Source code(tar.gz)
    Source code(zip)
  • v0.47(Nov 1, 2021)

    New rules

    • gce/WARN/2021_004: check serial output for 'disk full' messages (b/193383069)

    Enhancements

    • Add podman support in wrapper script

    Fixes

    • Fix gcf KeyError when API enabled but no functions defined (b/204516746)
    Source code(tar.gz)
    Source code(zip)
  • v0.46(Oct 27, 2021)

    New rules

    • gce/WARN/2021_003: gce service account monitoring permissions (b/199277342)
    • gcf/WARN/2021_001: cloud functions deprecated runtimes
    • gke/WARN/2021_009: deprecated node image types (b/202405661)

    Enhancements

    • New website! https://gcpdiag.dev
    • Rule documentation permalinks added to lint output (b/191612825)
    • Added --include and --exclude arguments to filter rules to run (b/183490284)
    Source code(tar.gz)
    Source code(zip)
  • v0.45(Oct 8, 2021)

  • v0.44(Oct 7, 2021)

    New rules

    • gke/ERR/2021_009: gke cluster and node pool version skew (b/200559114)
    • gke/ERR/2021_010: clusters are not facing ILB quota issues (b/193382041)
    • gke/ERR/2021_011: ip-masq-agent errors (b/199480284)
    • iam/SEC/2021_001: no service account has owner role (b/201526416)

    Enhancements

    • Improve error message for --auth-adc authentication errors (b/202091830)
    • Suggest gcloud command if CRM API is not enabled
    • Use --auth-adc by default in Cloud Shell (b/201996404)
    • Improve output with hidden items
    • Update docker image to python:3.9-slim

    Fixes

    • Make the docker wrapper macos-compatible (GH-10)
    • Exclude fleet workload identities from SA disabled check (b/201631248)
    Source code(tar.gz)
    Source code(zip)
Owner
Google Cloud Platform
Google Cloud Platform
touch command for Windows

Touch touch command for Windows Setup: Clone the repository git clone https://github.com/g-paras/touch.git cd touch Install touch module python setup.

Paras Gupta 5 Jan 04, 2022
A Tempmail Tool for Terminal and Termux.

A Tempmail Tool for Terminal and Termux.

MAO-COMMUNITY 8 Oct 19, 2022
Simple CLI for managing Postgres databases in Flask.

Overview Simple CLI that provides the following commands: flask psql create flask psql init flask psql drop flask psql setup: create → init flask psql

Daniel Reeves 21 Oct 03, 2022
👻 Ghoul is an easy to use information service, allowing you to get/add information on someone or something directly from your terminal.

👻 Ghoul is an easy to use information service, allowing you to get/add information on someone or something directly from your terminal. It c

Billy 11 Nov 10, 2021
(BionicLambda Universal SHell) A simple shell made in Python. Docs and possible C port incoming.

blush 😳 (BionicLambda Universal SHell) A simple shell made in Python. Docs and possible C port incoming. Note: The Linux executables were made on Ubu

3 Jun 30, 2021
Install python modules from pypi from a previous date in history

pip-rewind is a command-line tool that can rewind pypi module versions (given as command-line arguments or read from a requirements.txt file) to a previous date in time.

Amar Paul 4 Jul 03, 2021
Tools hacking termux in the name ant-attack

Hello friends, I am ama.player0000. Web developer, software, Android command line (termux). (1)=Well, ant-attack tool is a tool to attack sites and disable them. (2)=You can use those CCTV servers, s

༺AMA.PLAYER༻ 1 Dec 17, 2021
A simple cli tool to commit Conventional Commits

convmoji A simple cli tool to commit Conventional Commits. Requirements Install pip install convmoji convmoji --help Examples A conventianal commit co

3 Jul 04, 2022
Interactive Python interpreter for executing commands within Node.js

Python Interactive Interactive Python interpreter for executing commands within Node.js. This module provides a means of using the Python interactive

Louis Lefevre 2 Sep 21, 2022
An open source terminal project made in python

Calamity-Terminal An open source terminal project made in python. Calamity Terminal is a free and open source lightweight terminal. Its made 100% off

1 Mar 08, 2022
Cli tool to browse and play anime

browse and watch anime (scrape from gogoanime) (wip) basically ani-cli but in python cuz python good demo dependencies mpv installation from pypi pip

sheep padowo 2 Apr 20, 2022
A command-line based, minimal torrent streaming client made using Python and Webtorrent-cli. Stream your favorite shows straight from the command line.

A command-line based, minimal torrent streaming client made using Python and Webtorrent-cli. Installation pip install -r requirements.txt It use

Jonardon Hazarika 17 Dec 11, 2022
Fast as FUCK nvim completion. SQLite, concurrent scheduler, hundreds of hours of optimization.

Fast as FUCK nvim completion. SQLite, concurrent scheduler, hundreds of hours of optimization.

i love my dog 2.8k Jan 05, 2023
Very nice SMS & Mail Bomber for Termux and Linux.

Very nice SMS & Mail Bomber for Termux and Linux. Coded with love)))

nordbearbot.dev 5 Nov 06, 2022
Python Command Line Application (CLI) using Typer, SQLModel, Async-PostgrSQL, and FastAPI

pyflycli is a command-line interface application built with Typer that allows you to view flights above your location.

Kevin Zehnder 14 Oct 01, 2022
Run an FFmpeg command and see the percentage progress and ETA.

Run an FFmpeg command and see the percentage progress and ETA.

25 Dec 22, 2022
Bear-Shell is a shell based in the terminal or command prompt.

Bear-Shell is a shell based in the terminal or command prompt. You can navigate files, run python files, create files via the BearUtils text editor, and a lot more coming up!

MichaelBear 6 Dec 25, 2021
A command-line utility that, given a markdown file, checks whether all its links work.

A command-line utility written in Python that checks validity of links in a markdown file.

Teclado 2 Dec 08, 2021
An ZFS administration tool inspired on Midnight commander

ZC - ZFS Commander An ZFS administration tool inspired on Midnight commander Work in Progress Description ZFS Commander is a simple front-end for the

34 Dec 07, 2022
A terminal client for connecting to hack.chat servers

A terminal client for connecting to hack.chat servers.

V9 2 Sep 21, 2022