Architecting DevOps CI/CD in Telco Cloud

Building a telco cloud is the first step, The next “logical” step should be how to monetize that “Huge, lengthy” investment by running & deploying telco services over that cloud. But remember something. The time you took to build the infrastructure (SDN, IP-Fabric, VIM and all MANO components ) plus the time spent to do a VNF/CNF certification over it shouldn’t be repeated again when delivering a telco service to the end user/customer otherwise, you will find your cloud is turned into a true pain and huge escalations especially in day-to-day operation. Trust me, I was there 😦

Here’s the question raise, How do you ship a high quality & better telco service while reducing the time taken to rollout new service (TTM)?. The answer is, Architecting a good DevOps CI/CD strategy in Telco!

So in this blog post series, we will deep dive on that topic. This post will be on how to design the pipeline and architect it while the next post will be a Demo and steps used to build it.

What’s the final result will look like?

At the end, the CI/CD concept is translated into building a Pipeline. our pipeline will contains series of Stages. Each stage contains multiple jobs that might be executed or not depending on the final service requirements.

Gitlab CI/CD Telco pipeline Stages and Jobs

The above pipeline allow us to select the following options for the service deployment.

  • Which SDN controller version will be used (based on container tags). Here we have Juniper Contrail Networking as SDN controller but you can enrich the pipeline and select any other SDN controller
  • The Deployment size (All-in-One or Full fledge cluster)
  • The VIM type (Openstack, Kubernetes, Openshift or RHOSP)
  • How the fabric will look like? will you use 1 EVPN/VxLAN DC or two DCs with DC Interconnect between them? or will you use Segment routing? or no need for a fabric at all?

Based on above selections at each stage, The pipeline will trigger actions on staging environment and at the end of the pipeline, we should have a successful deployment of a Telco Cloud that mimic our production cloud.

Deploying the pipeline over the Telco Cloud pod

Then at the end, the pipeline will post a message to our Slack channel as a notification to other team members (via ChapOps) mentioning the status of the deployment whether it’s succeeded or failed and other useful info.

Gitlab send notification to Slack

Slack is just an example of a collaboration platform between the Telco Cloud teams, However you can do a integration between your pipeline and any other services (via Webhooks) used in your organization such as Microsoft teams, Trello, Jira…etc

One of things that could be used to trigger this pipeline, is another pipeline!, this is common in case the whole telco pod is just part of a big pipeline but each one has it’s own developers/maintainers which are different from the Telco Cloud team. The below are some screenshots from Azure Devops pipeline used to trigger the above Gitlab CI. I will demonstrate that in the next post.

Tasks Executed in external pipeline
Azure DevOps Pipeline Dashboard

CI/CD Design Principles in Telco Cloud

Below are some basic guidance and principles that you need to have a successful implementation of the pipeline. it’s a generic that you can apply on any Telco operator. but you can add as much as you see it suite your case.

Step1: Understand the Processes..in Telco

Unlike other environments, Deploying Telco-grade service require involvement of many design and operation teams within the Telco operator so you should have a representive/SPOC from each team. but beware!. Too much involvements without a clear governance may lead to slow down the rollout. Start with small group of people rather than inviting everyone from the beginning. The goal of this group is to agree on a high level view of the CI/CD pipeline stages that respect the Telco operator general processes and get the blessing to “go-on”.

Processes

Step2: Have a Broker, or Build One!

The telco cloud environment is multi-vendors by default. You might have a IP-fabric & SDN controller from Juniper, NFVO from Cisco, VIM from Red Hat and many VNFM(s) depending on the onboarded VNFs. You need to have an Automation Framework acting as a broker between your Business/Operation support system(s) and telco cloud components ) to abstract the instructions. The pipeline doesn’t need to go to low level in order to query some nodes for example.

Below is an example of that framework that we build in our pipeline, it allow you to communicate with Openstack, Juniper Contrail, IP-Fabric (Juniper also) while providing northbound interface (either CLI or through REST to configure things) to any upstream nodes

Automation framework acting as a broker

one of concepts that you need to make sure that’s exist in that framework is to be idempotent by nature. Meaning executing the same operation won’t give different results otherwise you might be end having un-predictable environment.

Step3: Design how to mimic the production?

The successful pipeline require the latest config from production so you can reproduce it to the staging as close as possible. You can decide to do that job on regular basis (at midnight for example or after/before the planned Network Activities) or you can do it just before the start of pipeline. You can build something like “scanners” that query the nodes config and update the “Golden Config” directory at each VNF repo regularly.

Updating the golden config

The Gitlab provide some useful options in it’s CI tool that allow to make jobs independent on each others so you won’t miss collecting the updated config before running the pipeline, we will see that in next post.

Step4: Design how to mange the Multi-sites Telco Services?

Well, this is somehow related to previous step. Usually Telco service are deployed as active in one site while having a Geo-redundant backup in ther site. But this is not the case in Cloud, the services are usually distributed in multiple sites and deployed to be close to the user as much as possible (This is essential in Edge Cloud Computing). So you need to design the repo such that each VNF (for example vEPC) contains directories for each site config (Site01, Site02, Site03…etc) and hence you can easily re-produce any config to your staging servers and start the testing.

Multi-site config Management

Step5: Design where to host the CI Agents?

Nearly all CI/CD tools provides on-perm agents to deploy your config in the servers. in case of Gitlab CI it’s called runner, while in case of Azure Devops they call it Agent. in call cases it’s just a tool that connect servers to the pipeline and make sure it execute it’s jobs while sending the feedback to the pipeline. We need here to design where those runners/agents will be located and whether will be deployed for specific VNFs in the telco (vIMS, vBNG, vIMS…etc) or only for a few of them. This is depending on what do you try to achieve from that system?

Runners/Agents connected to CI/CD pipeline

Step6: Design where to store your Artifacts?

After each stage run in the pipeline, the configuration generated for SDN such as TripleO templates, instances.yml or command-servers.yml (which is used for deploying Contrail Networking) or even the fabric JunOS configuration need to be stored somewhere and associated with the commit & pipeline IDs.

Artifacts

The network engineers will use these artifacts to compare the changes made after pipeline execution with what actually deployed in production and hence decide to proceed or not based on the result. it’s like a snapshots for all tries and errors that we passed till we deployed the Telco service

Step7: Design how to do Automated Acceptance Testing?

After the deployment, you need to do some tests to make sure the “critical” parts of your service are not broken due to config/code change. This allow you to discover any potential bugs earlier before rollout. You can read more about this concept on my previous blog post

https://basimaly.wordpress.com/2019/05/17/automated-testing-in-telco-cloud-using-robot-framework/

Automated Acceptance Testing using Robot framework

Step8: Design how to rollback?

If you decided to allow the pipeline to onboard the staging result into the production, then you need to have a strategy on how to rollback the config too!. This is should be easy in case on network operating systems (like JunOS or IOS-XR) for example as it allow config versioning by default and you just need to do rollback, However in case of VNFs and services parameters inside, You need to have some Golden config available in the repo so it’s being deployed when things go south or revert VNFs to specific snapshot.

Step9: Adapt Blameless Postmortem Culture

While this is not really a technical decision to make but rather a way of communication between Telco teams whether they are internal inside the service provider department (Design, Operations, Provisioning, Revenue Assurance…etc) or different vendors, partners and PSI

You can read more about this concept in Google SRE book Here but I will quote one important paragraph from it.

Blameless postmortems are a tenet of SRE culture. For a postmortem to be truly blameless, it must focus on identifying the contributing causes of the incident without indicting any individual or team for bad or inappropriate behavior. A blamelessly written postmortem assumes that everyone involved in an incident had good intentions and did the right thing with the information they had. If a culture of finger pointing and shaming individuals or teams for doing the “wrong” thing prevails, people will not bring issues to light for fear of punishment.

Google SRE Book

Wrapping up

The list can go really long, but this is the most principles I see useful for starting your second step after having an operational Telco Cloud. In next post will see a live demo on it.

Finally, I hope this will be informative for you, and I’d like to thank you for reading.

10 responses to “Architecting DevOps CI/CD in Telco Cloud”

  1. Very informative Basim,
    I appreciate your share.

    Wish you all the best.

    1. Thanks Abderrahmane :), Glad you like it!

  2. […] Bassem Aly’den çok ilham verici bir makale bu konuyu da ele almak istememin nedeni buydu, bu nedenle “DevOps ve Telco” yazılım serisini yarattım. Bassem’in makalesi CI / CD’den ve nasıl yardımcı olduğundan bahsediyor yüksek kaliteli ve daha iyi telekomünikasyon hizmeti sunarken yeni bir hizmeti sunmak için harcanan zamanı, yani Pazara Çıkma Süresi (TTM). […]

  3. […] A very inspirational article from Bassem Aly was the reason I wanted to cover this topic too, hence I created the “DevOps and Telco” softwarisation series. Bassem’s article talks about CI/CD and how it help to ship high quality and better telco service while reducing the time taken to rollout a new service i.e. Time To Market (TTM). […]

  4. […] A very inspirational article from Bassem Aly was the reason I wanted to cover this topic too, hence I created the “DevOps and Telco” softwarisation series. Bassem’s article talks about CI/CD and how it help to ship high quality and better telco service while reducing the time taken to rollout a new service i.e. Time To Market (TTM). […]

  5. […] A very inspirational article from Bassem Aly was the reason I wanted to cover this topic too, hence I created the “DevOps and Telco” softwarisation series. Bassem’s article talks about CI/CD and how it help to ship high quality and better telco service while reducing the time taken to rollout a new service i.e. Time To Market (TTM). […]

  6. […] A very inspirational article from Bassem Aly was the reason I wanted to cover this topic too, hence I created the “DevOps and Telco” softwarisation series. Bassem’s article talks about CI/CD and how it help to ship high quality and better telco service while reducing the time taken to rollout a new service i.e. Time To Market (TTM). […]

  7. […] A very inspirational article from Bassem Aly was the reason I wanted to cover this topic too, hence I created the “DevOps and Telco” softwarisation series. Bassem’s article talks about CI/CD and how it help to ship high quality and better telco service while reducing the time taken to rollout a new service i.e. Time To Market (TTM). […]

  8. Really awesome Bro! thanks for sharing

Share you opinion to benefit others :)

Create a website or blog at WordPress.com