Tag Archives: MySQL

End of an Era: Neither MC nor Continuent Attending Percona Live

Continuent have been a long term sponsor of the Percona Live conference, and the MySQL conference as it was before that, for many years. We have attended the conference both as a Diamond sponsor, and members of our staff attending and presenting our products and experience at the conference.

The nature of these conferences always changes over time, and we have seen over the last few years how the Percona Live conference has moved from being a pure MySQL conference to an open source database conference. Although Continuent continue to provide open source software and integrate with many open source databases, our core operation still revolves around MySQL clustering and replication for MySQL and Oracle.

Continuent is also evolving and changing and we are increasingly deploying and moving towards pure cloud-based environments, building and developing products that are used on the cloud or explicitly leverage cloud computing technology. We have a number of new products and initiatives specifically targeting these areas.

Over the course of the next year we will be releasing cloud editions of our clustering, replication and new backup and proxy services both directly and through our partners.

As such, this year we have made the difficult decision not to sponsor or attend the Percona Live conference, directing our energies to other conferences, webinars and meetups. We will be attending the AWS conference, for example, and we fully intend to be at some other select conferences this year dealing with analytics, in-memory computing, and cloud-based deployments.

To stay up-to-date with what Continuent are doing, keep reading the Continuent blog and follow us on Twitter and Facebook.

Analytical Replication Performance from MySQL to Vertica on MemCloud

I’ve recently been trying to improve the performance of the Vertica replicator, particularly in the form of the of the new single schema replication. We’ve done a lot in the new Tungsten Replicator 5.3.0 release to improve (and ultimately support) the new single schema model.

As part of that, I’ve also been personally looking to Kodiak MemCloud as a deployment platform. The people at Kodiak have been really helpful (disclaimer: I’ve worked with some of them in the past). MemCloud is a high-performance cloud platform that is based on hardware with high speed (and volume) RAM, SSD and fast Ethernet connections. This means that even without any adjustment and tuning you’ve got a fast platform to work on.

However, if you are willing to put in some extra time, you can tune things further. Once you have a super quick environment, you find you can tweak and update the settings a little more because you have more options available.  Ultimately you can then make use of that faster environment to stretch things a little bit further. And that’s exactly what I did when trying to determine how quickly I could get data into Vertica from MySQL.

In fact, the first time I ran my high-load test suite on MemCloud infrastructure, replicating data from MySQL into Vertica, I made this comment to my friend at Kodiak:

The whole thing went so quick I thought it hadn’t executed at all!

In general, there are two things you want to test when using replication to move data from a transactional environment into an analytical one:

  • Latency of moving the data
  • Apply rate for moving the data

The two are subtly different. The first measures how long it takes to get the data from the source to the target. The second measures how much data you can move in a set period of time.

Depending on your deployment and application, either, or both, can be critical. For example, if you are using analytics to perform real-time analysis and charging on your data, the first one is the most important, because you want the info up to date as quickly as possible. If you performing log analysis or longer-term trends, the second is probably more important. You may not worry about being a few seconds, but you want many thousands of transactions to be transferred. I concentrated on the former rather than the latter, because latency in the batch applier is something you can control by setting the batch interval.

So what did I test?

At a basic level, I was replicating data from MySQL directly into Vertica. That is, extracting data from the MySQL binary log, and writing that into a table within HPE Vertica cluster using 3 nodes. Each is running in MemCloud, which each running with 64GB of RAM and 2TB of SSD disk space. I’ve deliberately made no configuration changes to Vertica at this point.

The first thing I did was set-up a basic replication pipeline between the two. Replication into Vertica works by batch-loading data through CSV files into Vertica tables and then ‘materialising’ the changes into the carbon copy tables. Because it’s done in batches, the latency is effectively governed by the batch apply settings, which were configured for 10,000 rows or 5 seconds.

To generate the load, I’ve written a script that generates 100,000 rows of random data, then updates about 70% of those randomly and deletes the other 30%. So each schema is generating about 200,000 rows of changes for each load. This is designed to test the specific batch replication scenario. Ultimately it does this across multiple schemas (same structure). I specifically use this because I match this with the replication to get replication (rather than transaction) statistics. I need to be able to effectively monitor the apply rate into Vertica from MySQL.

The first time I ran the process of just generating the data and inserting into MySQL, the command returned almost immediately. I seriously thought it had failed because I couldn’t believe I’d just inserted 200,000 rows into MySQL that quick. Furthermore, over on the Vertica side, I’m monitoring the application through the trepctl perf command, which provides the live output of the process. And for a second I see the blip as the data is replicated and then applied. I thought it was so quick, it was a single row (or even failed transaction) that caused the blip.

The first time I ran the tests, I got some good results with 20 simultaneous schemas:

  • 460,000 rows/minute from a single MySQL source into Vertica. 

Then I doubled up the source MySQL servers, so two servers, 40 simultaneous schemas, and ultimately writing in about 8 million rows:

  • 986,000 rows/minute into Vertica across 40 schemas from 2 sources

In both cases, the latency was between 3-7 seconds for each batch write (remember we are handling 10,000 rows or 5s per batch). We are also doing this across *different* schemas at this stage. These are not bad figures.

I did some further tweaks, this time, reconfiguring the batch writes to do larger blocks and larger intervals. This increases the potential latency (because there will be bigger gaps between writes into Vertica), but increases the overall row-apply performance. Now we are handling 100,000 rows or 10s intervals. The result? A small bump for a single source server:

  • 710,000 rows/minute into Vertica across 20 schemas from 1 source

Latency has increased though, with us topping out at around 11.5s for the write when performing the very big blocks. Remember this is single-source, and so I know that the potential is there to basically double that with a second MySQL source since the scaling seems almost linear.

Now I wanted to move on to test a specific scenario I added into the applier, which is the ability to replicate from multiple source schemas into a single target schema. Each source is identical, and to ensure that the ‘materialise’ step works correctly, and that we can still analyse the data, a filter is inserted into the replication that adds the source schema to each row.

The sample data inserts look like this:

insert into msg values (0,"RWSAjXaQEtCf8nf5xhQqbeta");
insert into msg values (0,"4kSmbikgaeJfoZ6gLnkNbeta");
insert into msg values (0,"YSG4yeG1RI6oDW0ohG6xbeta");

With the filter, what gets inserted is;

0, "RWSAjXaQEtCf8nf5xhQqbeta", "sales1"
0,"4kSmbikgaeJfoZ6gLnkNbeta", "sales1"

Etc, where ‘sales1’ is the source schema, added as an extra column to each row.

This introduces two things we need to handle on the Vertica side:

  1. We now have to merge taking into account the source schema (since the ID column of the data could be the same across multiple schemas). For example, whereas before we did ‘DELETE WHERE ID IN (xxxx)’, and now we have to do ‘DELETE WHERE ID IN (xxxx) AND dbname = ‘sales1”.
  2. It increases the contention ratio on the Vertica table because we now effectively write into only one table. This increases the locks and the extents processed by Vertica.

The effect of this change is that the overall apply rate slows down slightly due to the increased contention on a single table. Same tests, 20 schemas from one MySQL source database and we get the following:

  • 760,000 rows/minute into Vertica with a single target table

This is actually not as bad as I was expecting when you consider that we are modifying every row of incoming data, and are no longer able to multi-thread the apply.

I then tried increasing that using the two sources and 40 schemas into the same single table. Initially, the performance was no longer linear, and I failed to get any improvement beyond about 10% above that 760K/min figure.

Now it was time to tune other things. First of all, I changed some of the properties on the Vertica side in terms of the queries I was running, tweaking the selecting and query parameters for the DELETE operations. For batch loading, what we do is DELETE and then INSERT, or, in some case, DELETE and UPDATE if you’ve configured it that way. Tweaking the subquery that is being used increased the performance a little.

Changing the projections used also increased the performance of the single schema apply. But the biggest gains, perhaps unsurprisingly, were to change the way the tables were defined in the first place and to use partitions in the table definition. To do this, I modified the original filter that was adding the schema name, and instead had it add the schema hash, a unique Java ID for the string. Then I created the staging and base tables in Vertica using the integer hash as the partition. Then I modified the queries to ensure that the partitions would be used effectively.

The result was that the 760k/min rate was now scalable. I couldn’t get any faster when writing into a single schema, but the rate remains relatively constant whether I am replicating five schemas into a single one, or 40 schemas from two or three sources into the same. In fact, it does ultimately start to dip slightly as you add more source schemas.

Even better, the changes I’d made to the queries and the overall Vertica batch applier also improved the speed of the standard (i.e. multi-schema) applier. I also added partitions to the ID field for too to improve the general apply rate. After testing for a couple days, the average rate:

  • 1,900,000 rows/minute from a single MySQL source into Vertica.

This was also scalable. Five MySQL sources elicited a rate of 8.8 million rows/minute into Vertica, making the applier rate linear with a 1% penalty for each additional MySQL source. The latency stayed the same, hovering around the same level as before of around 11s for most of the time. Occasionally you’d get a spike because Vertica was having trouble keeping up, but

Essentially, we are replicating data from MySQL into Vertica for analytics at a rate I simply called ‘outstandingly staggering’. I still do.

The new single schema applier, database name filter (rowadddbname) and performance improvements are all incorporated into the new Tungsten Replicator.

Keynote and Session at Percona Live Dublin 2017

On Sunday I will travel over to Dublin for Percona Live 2017.

I have two sessions, a keynote on the Wednesday morning where I’ll be talking about all the fun new stuff we have planned at Continuent and some new directions we’re working on.

I also have a more detailed session on our new appliers for Kafka, Elasticsearch and Cassandra, that’s Tuesday morning.

MCBrown-Keynote.jpg

If you haven’t already booked to come along, feel free to use the discount code SeeMeSpeakPLE17 which will get you 15% off!

Upcoming Webinar, 19th July, What is New in Tungsten Replicator 5.2 and Tungsten Clustering 5.2?

Continuent Tungsten 5.2 is just around the corner. This is one of our most exciting Tungsten product releases for some time!

In this webinar we’re going to have a look at a host of new features in the new release, including
Three new Replication Applier Targets (Kafka, Cassandra and Elasticsearch)
New improvements to our core command-line tools trepctl and thl
New foundations for our filtering services, and
Improvements to the compatibility between replication and clustering

This webinar is going be a packed session and we’ll show all the exciting stuff with more in-depth follow-up sessions in the coming weeks.

 

You’ll also learn about some more exciting changes coming in the upcoming Tungsten releases (5.2.1 and 5.3), and our major Tungsten 6.0 release due out by the end of the year.

So come and join us to get the low down on everything related to Tungsten Replicator 5.2 and Tungsten Clustering 5.2. on Wednesday, July 19, 2017 9:00 AM – 9:30 AM PDT at https://attendee.gotowebinar. com/register/ 4108437731342545667

New Continuent Webinar Wednesdays and Training Tuesdays

We are just starting to get into the swing of setting up our new training and webinar schedule.
Initially, there will be one Webinar session (typically on a Wednesday) and one training session (on a Tuesday) every week from now. We’ll be covering a variety of different topics at each.
Typically our webinars will be about products and features, comparisons to other products, mixed in with product news (new releases, new features) and individual sessions based on what is going on at Continuent and the market in general.
Our training, by comparison, is going to be a hands-on, step-by-step sequence covering all of the different areas of our product. So we’ll cover everything from the basics of how the products work, how to deploy them, typical functionality (switching, start/stop, etc), and troubleshooting.
All of the sessions are going to be recorded and we’ll produce a suitable archive page so that you can go and view the past sessions. Need a refresher on re-provisioning a node in your cluster? There’s going to be a video for it and documentation to back it up.
Our first webinar is actually next Thursday (the Wednesday rule wouldn’t be a good one without an exception) and is all about MySQL Multi-Site/Multi-Master Done Right:
In this webinar, we discuss what makes Tungsten Clustering better than other alternatives (AWS RDS, Galera, MySQL InnoDB Cluster, and XtraDBCluster), especially for geographically distributed multi-site deployments, both for disaster recovery and multi-site/multi-master needs.
If you want to attend, please go ahead and register using this link: http://go.continuent.com/n0JI04Q03EAV0RD0i000Vo4
Keep your eyes peeled for the other upcoming sessions. More details soon.

Hadoop BoF Session at OSCON

I have a BoF session next week at OSCON next week:

Migrating Data from MySQL and Oracle into Hadoop

The session is at 7pm Tuesday night – look for rooms D135 and/or D137/138.

Correction: We are now in  E144 on Tuesday with the Hadoop get together first at 7pm, and the Data Migration to follow at 8pm.

I’m actually going to be joined by Gwen Shapira from Cloudera, who has a BoF session on Hadoop next door at the same time, along with Eric Herman from Booking.com. We’ll use the opportunity to talk all things Hadoop, but particularly the ingestion of data from MySQL and other databases into the Hadoop datastore.

As always, it’d be great to meet anybody interested in Hadoop at the BoF, please come along and introduce yourselves, and hopefully I’ll see you next week!

Replicating Oracle Webinar Question Follow-up

We had really great webinar on Replicating to/from Oracle earliest this month, and you can view the recording of that Webinar here.

A good sign of how great a Webinar was is the questions that come afterwards, and we didn’t get through them all. so here are all the questions and answers for the entire webinar.

Q: What is the overhead of Replicator on source database with asynchronous CDC?

A: With asynchronous operation there is no substantial CPU overhead (as with synchronous), but the amount of generated redo logs becomes bigger requiring more disk space and better log management to ensure that the space is used effectively.

Q: Do you support migration from Solaris/Oracle to Linux/Oracle?

A: The replication is not certified for use on Solaris, however, it is possible to configure a replicator to operate remotely and extract from a remote Oracle instance. This is achieved by installing Tungsten Replicator on Linux and then extracting from the remote Oracle instance.

Q: Are there issues in supporting tables without Primary Keys on Oracle to Oracle replication?

A: Non-primary key tables will work, but it is not recommended for production as it implies significant overhead when applying to a target database.

Q: On Oracle->Oracle replication, if there are triggers on source tables, how is this handled?

A: Tungsten Replicator does not automatically disable triggers. The best solution is to remove triggers on slaves, or rewrite triggers to identify whether a trigger is being executed on the master or slave and skip it accordingly, although this requires rewriting the triggers in question.

Q: How is your offering different/better than Oracle Streams replication?

A: We like to think of ourselves as GoldenGate without the price tag. The main difference is the way we extract the information from Oracle, otherwise, the products offer similar functionality. For Tungsten Replicator in particular, one advantage is the open and flexible nature, since Tungsten Replicator is open source, released under a GPL V2 license, and available at https://code.google.com/p/tungsten-replicator/.

Q: How is the integrity of the replica maintained/verified?

A: Replicator has built-in real-time consistency checks: if an UPDATE or DELETE doesn’t update any rows, Replicator will go OFFLINE:ERROR, as this indicates an inconsistent dataset.

Q: Can configuration file based passwords be specified using some form of encrypted value for security purposes to keep them out of the clear?

A: We support an INI file format so that you do not have to use the command-line installation process. There is currently no supported option for an encrypted version of these values, but the INI file can be secured so it is only readable by the Tungsten user.

Q: Our source DB is Oracle RAC with ~10 instances. Is coherency maintained in the replication from activity in the various instances?

A: We do not monitor the information that has been replicated; but CDC replicates row-based data, not statements, so typical sequence insertion issues that might occur with statement based replication should not apply.

Q: Is there any maintenance of Oracle sequence values between Oracle and replicas?

A: Sequence values are recorded into the row data as extracted by Tungsten Replicator. Because the inserted values, not the sequence itself, is replicated, there is no need to maintain sequences between hosts.

Q: How timely is the replication? Particularly for hot source tables receiving millions of rows per day?

A: CDC is based on extracting the data at an interval, but the interval can be configured. In practice, assuming there are regular inserts and updates on the Oracle side, the data is replicated in real-time. See https://docs.continuent.com/tungsten-replicator-3.0/deployment-oracle-cdctuning.html for more information on how this figure can be tuned.

Q: Can parallel extractor instances be spread across servers rather than through threads on the same server (which would be constrained by network or HBA)?

A: Yes. We can install multiple replicators and tune the extraction of the parallel extractor accordingly. However, that selection would need to be manual, but certainly that is possible.

Q: Do you need the CSV file (to select individual tables with the setupCDC.sh configuration) on the master setup if you want all tables?

A: No.

Q: If you lose your slave down the road, do you need to re-provision from the initial SCN number or is there a way to start from a later point?

A: This is the reason for the THL Sequence Number introduced in the extractor. If you lose your slave, you can install a new slave and have it start at the transaction number where the failed slave stopped if you know it, since the information will be in the THL. If not, you can usually determine this by examining the THL directly. There should be no need to re-provision – just to restart from the transaction in the THL on the master.

Q: Regarding a failed slave, what if it failed such that we don’t have a backup or wanted to provision a second slave such that it had no initial data.

A: If you had no backups or data, yes, you would need to re-provision with the parallel extractor in order to seed the target database.

Q: Would you do that with the original SCN? If it had been a month or two, is there a way to start at a more recent SCN (e.g. you have to re-run the setupCDC process)?

A: The best case is to have two MySQL slaves and when one fails, you re-provision it from the healthy one. This avoids setupCDC stage.

However, the replication can always be started from a specific event (SCN) provided that SCN is available in the Oracle undo log space.

Q: How does Tungsten handle Oracle’s CLOB and BLOB data types

A: Providing you are using asynchronous CDC these types are supported; for synchronous CDC these types are not supported by Oracle.

Q: Can different schemas in Oracle be replicated at different times?

A: Each schema is extracted by a separate service in Replicator, so they are independent.

Q: What is the size limit for BLOB or CLOB column data types?

A: This depends on the CDC capabilities in Oracle, and is not limited within Tungsten Replicator. You may want to refer to the Oracle Docs for more information on CDC: http://docs.oracle.com/cd/B28359_01/server.111/b28313/cdc.htm

Q: With different versions of Oracle e.g. enterprise edition and standard edition one be considered heterogeneous environments?

A: Essentially yes, although the nomenclature is really only a categorization, it does not affect the operation, deployment or functionality of the replicator. All these features are part of the open source product.

Q: Can a 10g database (master) send the data to a 11g database (slave) for use in an upgrade?

A: Yes.

Q: Does the Oracle replicator require the Oracle database to be in archive mode?

A: Yes. This is a requirement for Oracle’s CDC implementation.

Q: How will be able to revisit this recorded webinar?

A: Slides and a recording from today’s webinar will be available at http://www.slideshare.net/Continuent_Tungsten