当前位置：网站首页>PostgreSQL source code (5) buffer management

PostgreSQL source code (5) buffer management

2022-07-18 17:22:00 【mingjie】

Postgresql Source code （5） Buffer management

Learning notes ：https://www.interdb.jp/pg/pgsql08.html

Preface

Imagine a simple buffer Functions of manager ：

Write
- You can apply for a free PAGE Used to write , And corresponds to a PAGE
- When it is full, one brush plate can be automatically eliminated PAGE
- After writing, you can decide to brush the disk immediately or lazy Brush set
read
- You can read a cached PAGE
- You can read on a disk PAGE
- When it is full, one brush plate can be automatically eliminated PAGE, Then read what you need

have a look PG How did it happen ？

PG Realization

1 TAG

{(16821, 16384, 37721), 1, 3}

Express

tablespace=16821
db=16384
table=37721
freespace map
In the document 3 Block

typedef struct buftag
{
	RelFileNode rnode;			/* physical relation identifier */
  
    typedef struct RelFileNode
    {
	    Oid			spcNode;		/* tablespace */
	    Oid			dbNode;			/* database */
	    Oid			relNode;		/* relation */
    } RelFileNode;
  
	ForkNumber	forkNum;    // tables, freespace maps and visibility maps are defined in 0, 1 and 2
	BlockNumber blockNum;		/* blknum relative to begin of reln */
} BufferTag;

2 structure

Three tier structure （ The second level is logical ）

buffer table Can be directly from Hash Find buffer id
In the use of buffer slot front , Need to use buffer id find desc, see slot Of meta Information , Can be used
So logically, there is a layer in the middle desc Array

2.1 Buffer Table

Here we need to pay attention to , Note that this is the core entry data structure , yes tag Yes id The mapping relation of .

key
- BufferTag
  - RelFileNode rnode
  - ForkNumber forkNum
  - BlockNumber blockNum
value
- BufferLookupEnt
  - BufferTag key
  - int id
Partitioned hash table , The granularity of locking is smaller .

void
InitBufTable(int size)
{
	HASHCTL		info;

	/* assume no locking is needed yet */

	/* BufferTag maps to Buffer */
	info.keysize = sizeof(BufferTag);
	info.entrysize = sizeof(BufferLookupEnt);
	info.num_partitions = NUM_BUFFER_PARTITIONS;

	SharedBufHash = ShmemInitHash("Shared Buffer Lookup Table",
								  size, size,
								  &info,
								  HASH_ELEM | HASH_BLOBS | HASH_PARTITION);
}

2.2 Buffer Descriptor

This is also an array , It's initialized here ：

void
InitBufferPool(void)
{
	bool		foundBufs,
				foundDescs,
				foundIOLocks,
				foundBufCkpt;

	/* Align descriptors to a cacheline boundary. */
	BufferDescriptors = (BufferDescPadded *)
		ShmemInitStruct("Buffer Descriptors",
						NBuffers * sizeof(BufferDescPadded),
						&foundDescs);
...

The initialization of the BufferDescPadded, One buffer Corresponding to one desc：

typedef union BufferDescPadded
{
	BufferDesc	bufferdesc;
	char		pad[BUFFERDESC_PAD_TO_SIZE];
} BufferDescPadded;

The specific content ：

/*
 *	BufferDesc -- shared descriptor/state data for a single shared buffer.
 *
 * Note: Buffer header lock (BM_LOCKED flag) must be held to examine or change
【 Add BM_LOCKED（Buffer header lock） Lock can read and write tag、state、wait_backend_pid】

 * the tag, state or wait_backend_pid fields.  In general, buffer header lock
 * is a spinlock which is combined with flags, refcount and usagecount into
 * single atomic variable.  This layout allow us to do some operations in a
 * single atomic operation, without actually acquiring and releasing spinlock;
 * for instance, increase or decrease refcount.  buf_id field never changes
 * after initialization, so does not need locking.  freeNext is protected by
 * the buffer_strategy_lock not buffer header lock.  The LWLock can take care
 * of itself.  The buffer header lock is *not* used to control access to the
 * data in the buffer!
 *
 * It's assumed that nobody changes the state field while buffer header lock
 * is held.  Thus buffer header lock holder can do complex updates of the
 * state variable in single write, simultaneously with lock release (cleaning
 * BM_LOCKED flag).  On the other hand, updating of state without holding
 * buffer header lock is restricted to CAS, which insure that BM_LOCKED flag
 * is not set.  Atomic increment/decrement, OR/AND etc. are not allowed.
 *
 【 hypothesis BM_LOCKED（Buffer header lock） When held , Others cannot update state】
 【 Then the person who holds the lock can state Do a lot of updates , Then release the lock 】
 
 * An exception is that if we have the buffer pinned, its tag can't change
 * underneath us, so we can examine the tag without locking the buffer header.
 * Also, in places we do one-time reads of the flags without bothering to
 * lock the buffer header; this is generally for situations where we don't
 * expect the flag bit being tested to be changing.
 *
 * We can't physically remove items from a disk page if another backend has
 * the buffer pinned.  Hence, a backend may need to wait for all other pins
 * to go away.  This is signaled by storing its own PID into
 * wait_backend_pid and setting flag bit BM_PIN_COUNT_WAITER.  At present,
 * there can be only one such waiter per buffer.
 *
 * We use this same struct for local buffer headers, but the locks are not
 * used and not all of the flag bits are useful either. To avoid unnecessary
 * overhead, manipulations of the state field should be done without actual
 * atomic operations (i.e. only pg_atomic_read_u32() and
 * pg_atomic_unlocked_write_u32()).
 *
 * Be careful to avoid increasing the size of the struct when adding or
 * reordering members.  Keeping it below 64 bytes (the most common CPU
 * cache line size) is fairly important for performance.
 */
typedef struct BufferDesc
{
	BufferTag	tag;			/* ID of page contained in buffer */
	int			buf_id;			/* buffer's index number (from 0) */

	/* state of the tag, containing flags, refcount and usagecount */
	pg_atomic_uint32 state;

	int			wait_backend_pid;	/* backend PID of pin-count waiter */
	int			freeNext;		/* link in freelist chain */

	LWLock		content_lock;	/* to lock access to buffer contents */
} BufferDesc;

tag/buf_id： It says that
state
- flags
  - dirty bit：indicates whether the stored page is dirty.
  - valid bit： The current page is readable （1）slot There's data 、 Corresponding desc There's data , You can read .（2）invalid：desc No data or in the process of page replacement .
  - io_in_progress bit： Whether the buffer manager is from / Read from storage / Write the associated page . let me put it another way , This bit indicates whether a single process holds the descriptor io_in_progress_lock.
- recount
  - Record the number of processes accessing the current page , Also called pin count. Access to the page must pin count++, After use, you must pin count–.
  - pin count=0 When called unpinned; Not 0 When called pinned.
- usagecount： Record since loading , Number of times visited , Clock algorithm will use .
freeNext： Next free buffer, The logic of adding a free linked list to the array

【desc Describe the three states of the page 】

Empty
- When the corresponding buffer pool slot does not store pages （ namely refcount and usage_count by 0）, The descriptor is empty .
Pinned
- When the corresponding buffer pool slot stores a page and any PostgreSQL The process is accessing the page （ namely refcount and usage_count Greater than or equal to 1） when , The state of the buffer descriptor is locked .
Unpinned
- When the corresponding buffer pool slot stores a page but does not PostgreSQL When the process accesses the page （ namely usage_count Greater than or equal to 1, but refcount by 0）, The state of this buffer descriptor is unpinned.

2.3 Buffer Descriptor Logic layer

BufferDescriptors Array initialization ,freelist initialization buf->freeNext = i + 1;

...
BufferDescPadded *BufferDescriptors;
...
  
  
void
InitBufferPool(void)
{
	bool		foundBufs,
				foundDescs,
				foundIOLocks,
				foundBufCkpt;

	/* Align descriptors to a cacheline boundary. */
	BufferDescriptors = (BufferDescPadded *)
		ShmemInitStruct("Buffer Descriptors",
						NBuffers * sizeof(BufferDescPadded),
						&foundDescs);
...
...
		/*
		 * Initialize all the buffer headers.
		 */
		for (i = 0; i < NBuffers; i++)
		{
			BufferDesc *buf = GetBufferDescriptor(i);

			CLEAR_BUFFERTAG(buf->tag);

			pg_atomic_init_u32(&buf->state, 0);
			buf->wait_backend_pid = 0;

			buf->buf_id = i;

			/*
			 * Initially link all the buffers together as unused. Subsequent
			 * management of this list is done by freelist.c.
			 */
			buf->freeNext = i + 1;

			LWLockInitialize(BufferDescriptorGetContentLock(buf),
							 LWTRANCHE_BUFFER_CONTENT);

			LWLockInitialize(BufferDescriptorGetIOLock(buf),
							 LWTRANCHE_BUFFER_IO_IN_PROGRESS);
		}
...
...

The process of loading the first page ：

freelist Take a free Of desc,pin live （refcount++, usage_count++）
buffertable Newly added entry, Record tag : buffer_id
Read page contents from storage to memory
to update desc Medium meta Information

desc After use, it will not be added to freelist It's in , Unless ：

surface or Indexes Been deleted
db Been deleted
surface or Indexes By vacuum full It's empty

2.4 Buffer Pool

A block of memory space , The size is 8K * NBuffers

	BufferBlocks = (char *)
		ShmemInitStruct("Buffer Blocks",
						NBuffers * (Size) BLCKSZ, &foundBufs);

3 lock

These locks are all in shared memory .

3.1 Buffer Table Locks

BufMappingLock

Partition lock of hash table , branch s/e

3.2 Desc lock

content_lock

Reading and writing PAGE Light weight lock , branch s/e

e The mode appears in the following cases
- Insert page、 modify tuple Of t_xmin/t_xmax Field
- Physical delete tuple、 Compress the remaining space of the page （vacuum）
- In page freeze

io_in_progress_lock

For etc PAGE Of IO Action complete , When the process starts from / Load into storage / When writing page data , The process holds the exclusive of the corresponding descriptor when accessing the storage io_in_progress lock .

spinlock（ Now it is BM_LOCK Sign a ）

desc Of flags and fields When modifying, it will add spinlock.

for example PIN：

LockBufHdr(bufferdesc);    /* Acquire a spinlock */
bufferdesc->refcont++;
bufferdesc->usage_count++;
UnlockBufHdr(bufferdesc); /* Release the spinlock */

for example set the dirty bit to ‘1’：

#define BM_DIRTY             (1 << 0)    /* data needs writing */
#define BM_VALID             (1 << 1)    /* data is valid */
#define BM_TAG_VALID         (1 << 2)    /* tag is assigned */
#define BM_IO_IN_PROGRESS    (1 << 3)    /* read or write in progress */
#define BM_JUST_DIRTIED      (1 << 5)    /* dirtied since write started */

LockBufHdr(bufferdesc);
bufferdesc->flags |= BM_DIRTY;
UnlockBufHdr(bufferdesc);

4 Elimination strategy

Four strategies ：

typedef enum BufferAccessStrategyType
{
	BAS_NORMAL,					/* Normal random access */
	BAS_BULKREAD,				/* Large read-only scan (hint bit updates are
								 * ok) */
	BAS_BULKWRITE,				/* Large multi-block write (e.g. COPY IN) */
	BAS_VACUUM					/* VACUUM */
} BufferAccessStrategyType;

BufferAccessStrategyType	Use scenarios	Replacement algorithm
BAS_NORMAL	Random reading and writing in general	clock sweep Algorithm
BAS_BULKREAD	Read in bulk	ring Algorithm , The ring size is 256 * 1024 / BLCKSZ
BAS_BULKWRITE	Batch write	ring Algorithm , The ring size is 16 * 1024 * 1024 / BLCKSZ
BAS_VACUUM	VACUUM process	ring Algorithm , The ring size is 256 * 1024 / BLCKSZ

clock sweep Algorithm

/*
 * The shared freelist control information.
 */
typedef struct
{
	/* Spinlock: protects the values below */
  
  //  spinlocks , To protect the members below 
	slock_t		buffer_strategy_lock;

	/*
	 * Clock sweep hand: index of next buffer to consider grabbing. Note that
	 * this isn't a concrete buffer - we only ever increase the value. So, to
	 * get an actual buffer, it needs to be used modulo NBuffers.
	 */
  
  //  Next traverse position 
	pg_atomic_uint32 nextVictimBuffer;

  //  Free buffer The head of the chain 
	int			firstFreeBuffer;	/* Head of list of unused buffers */
  
  //  Free buffer The tail of the list 
	int			lastFreeBuffer; /* Tail of list of unused buffers */

	/*
	 * NOTE: lastFreeBuffer is undefined when firstFreeBuffer is -1 (that is,
	 * when the list is empty)
	 */

	/*
	 * Statistics.  These counters should be wide enough that they can't
	 * overflow during a single bgwriter cycle.
	 */
  
  //  Record the number of times the array is traversed 
	uint32		completePasses; /* Complete cycles of the clock sweep */
	pg_atomic_uint32 numBufferAllocs;	/* Buffers allocated since last reset */

	/*
	 * Bgworker process to be notified upon activity or -1 if none. See
	 * StrategyNotifyBgWriter.
	 */
	int			bgwprocno;
} BufferStrategyControl;

Poll every time from the last position , Then check that the buffer The number of citations refcount And number of visits usagecount.

If refcount,usagecount All zeros , Then go straight back .
If refcount zero ,usagecount Not zero , So it usagecount reduce 1, Traverse to the next buffer.
If refcount Not zero , Then traverse the next .

clock sweep The algorithm is an endless loop algorithm , Until you find one refcount,usagecount All zero buffer.

Free list

To speed up the search for idle buffer The speed of ,postgresql Use the linked list to save these buffer.
The head and tail of the linked list are composed of BufferStrategyControl Structure of the firstFreeBuffer and lastFreeBuffer Member designation .
The link list node consists of BufferDesc The structure represents , its freeNext The member points to the next node .
When there is new leisure buffer, It will be added to the end of the linked list . When free space is needed , Then directly return to the head of the linked list .

Reference counter

there state Before 18 bits Record the number of references in shared memory .

typedef struct BufferDesc
{
	BufferTag	tag;			/* ID of page contained in buffer */
	int			buf_id;			/* buffer's index number (from 0) */

	/* state of the tag, containing flags, refcount and usagecount */
	pg_atomic_uint32 state;

	int			wait_backend_pid;	/* backend PID of pin-count waiter */
	int			freeNext;		/* link in freelist chain */

	LWLock		content_lock;	/* to lock access to buffer contents */
} BufferDesc;

/*
 * Buffer state is a single 32-bit variable where following data is combined.
 *
 * - 18 bits refcount 
【 The process itself uses an array + The hash table is used as the second level cache pined, Record the result of the last refresh here 】
 * - 4 bits usage count
 * - 10 bits of flags
 *
 * Combining these values allows to perform some operations without locking
 * the buffer header, by modifying them together with a CAS loop.
 *
 * The definition of buffer state components is below.
 */
#define BUF_REFCOUNT_ONE 1
#define BUF_REFCOUNT_MASK ((1U << 18) - 1)
#define BUF_USAGECOUNT_MASK 0x003C0000U
#define BUF_USAGECOUNT_ONE (1U << 18)
#define BUF_USAGECOUNT_SHIFT 18
#define BUF_FLAG_MASK 0xFFC00000U

/* Get refcount and usagecount from buffer state */
#define BUF_STATE_GET_REFCOUNT(state) ((state) & BUF_REFCOUNT_MASK)
#define BUF_STATE_GET_USAGECOUNT(state) (((state) & BUF_USAGECOUNT_MASK) >> BUF_USAGECOUNT_SHIFT)

The real reference counter is here , Every BUFFER One ：

typedef struct PrivateRefCountEntry
{
	Buffer		buffer;
	int32		refcount;
} PrivateRefCountEntry;

In order to quickly find the specified buffer Reference count of ,PrivateRefCountEntry Array as first level cache , Use hash table as L2 cache .

Be careful ： Process private cache ！

/*
 * Backend-Private refcount management:
 *
 * Each buffer also has a private refcount that keeps track of the number of
 * times the buffer is pinned in the current process.  This is so that the
 * shared refcount needs to be modified only once if a buffer is pinned more
 * than once by an individual backend.  It's also used to check that no buffers
 * are still pinned at the end of transactions and when exiting.
【 The current process is recorded in private memory ： The use of buffer times pin How many times 】
【 function 1： So if the current process pin Many times , Finally, in the shared memory, you only need pin once 】
【 function 2： Also used to check when the transaction ends 、 The process exited without pin Resident buffer】
 *
 * To avoid - as we used to - requiring an array with NBuffers entries to keep
 * track of local buffers, we use a small sequentially searched array
 * (PrivateRefCountArray) and an overflow hash table (PrivateRefCountHash) to
 * keep track of backend local pins.
 *
【 To avoid using NBuffers A large array of elements to track local pin cache , Here we use a 8 Array of elements + One overflow Hash table to record pin】

 * Until no more than REFCOUNT_ARRAY_ENTRIES buffers are pinned at once, all
 * refcounts are kept track of in the array; after that, new array entries
 * displace old ones into the hash table. That way a frequently used entry
 * can't get "stuck" in the hashtable while infrequent ones clog the array.
 *
 * Note that in most scenarios the number of pinned buffers will not exceed
 * REFCOUNT_ARRAY_ENTRIES.
【pinned No more than 8 A couple of times ago , be-all refcounts Will be tracked in the array , Another new pin Will replace the old one into the hash table 】
【 In this way, the frequently used ones will always be in the array , Less commonly used ones will be in the hash table 】

 *
 *
 * To enter a buffer into the refcount tracking mechanism first reserve a free
 * entry using ReservePrivateRefCountEntry() and then later, if necessary,
 * fill it with NewPrivateRefCountEntry(). That split lets us avoid doing
 * memory allocations in NewPrivateRefCountEntry() which can be important
 * because in some scenarios it's called with a spinlock held...
【 To use this cache tracking mechanism , First use ReservePrivateRefCountEntry Reserve a free array location 】
【 Use... When using NewPrivateRefCountEntry Fill this position 】
【 Why split ？ To avoid the ReservePrivateRefCountEntry Do memory allocation , Because sometimes I hold it spinlock Call this function 】
 */

// 【 First level cache 】：8 Elements 
static struct PrivateRefCountEntry PrivateRefCountArray[REFCOUNT_ARRAY_ENTRIES];
//  Free space index
static uint32 PrivateRefCountClock = 0;
//  Point to the empty position in the array 
static PrivateRefCountEntry *ReservedRefCountEntry = NULL;



// 【 Second level cache 】：key=buffer_id  value=PrivateRefCountEntry
static HTAB *PrivateRefCountHash = NULL;
//  The hash table contains entry Number of 
static int32 PrivateRefCountOverflowed = 0;

...
void
InitBufferPoolAccess(void)
{
	HASHCTL		hash_ctl;

	memset(&PrivateRefCountArray, 0, sizeof(PrivateRefCountArray));

	MemSet(&hash_ctl, 0, sizeof(hash_ctl));
	hash_ctl.keysize = sizeof(int32);
	hash_ctl.entrysize = sizeof(PrivateRefCountEntry);

	PrivateRefCountHash = hash_create("PrivateRefCount", 100, &hash_ctl,
									  HASH_ELEM | HASH_BLOBS);
}

ReservePrivateRefCountEntry

（1） Apply for one in the first level cache The initial state of ReservedRefCountEntry（buffer_id = 0, ref_count = 0）

（2） If the array is full , hold PrivateRefCountClock Positional entry Insert hash surface , Then empty and use .

（3） Note that this function does not return anything , Just maintain ReservedRefCountEntry, Let this pointer point to a free entry Save locally ref_count.

static void
ReservePrivateRefCountEntry(void)
{
	/* Already reserved (or freed), nothing to do */
	if (ReservedRefCountEntry != NULL)
		return;

	/*
	 * First search for a free entry the array, that'll be sufficient in the
	 * majority of cases.
	 */
	{
		int			i;

		for (i = 0; i < REFCOUNT_ARRAY_ENTRIES; i++)
		{
			PrivateRefCountEntry *res;

			res = &PrivateRefCountArray[i];

			if (res->buffer == InvalidBuffer)
			{
				ReservedRefCountEntry = res;
				return;
			}
		}
	}

	/*
	 * No luck. All array entries are full. Move one array entry into the hash
	 * table.
	 */
	{
		/*
		 * Move entry from the current clock position in the array into the
		 * hashtable. Use that slot.
		 */
		PrivateRefCountEntry *hashent;
		bool		found;

		/* select victim slot */
		ReservedRefCountEntry =
			&PrivateRefCountArray[PrivateRefCountClock++ % REFCOUNT_ARRAY_ENTRIES];

		/* Better be used, otherwise we shouldn't get here. */
		Assert(ReservedRefCountEntry->buffer != InvalidBuffer);

		/* enter victim array entry into hashtable */
		hashent = hash_search(PrivateRefCountHash,
							  (void *) &(ReservedRefCountEntry->buffer),
							  HASH_ENTER,
							  &found);
		Assert(!found);
		hashent->refcount = ReservedRefCountEntry->refcount;

		/* clear the now free array slot */
		ReservedRefCountEntry->buffer = InvalidBuffer;
		ReservedRefCountEntry->refcount = 0;

		PrivateRefCountOverflowed++;
	}
}

NewPrivateRefCountEntry

（1） fill buffer_id

（2） Return to one PrivateRefCountEntry, new ref_count=0

static PrivateRefCountEntry *
NewPrivateRefCountEntry(Buffer buffer)
{
	PrivateRefCountEntry *res;

	/* only allowed to be called when a reservation has been made */
	Assert(ReservedRefCountEntry != NULL);

	/* use up the reserved entry */
	res = ReservedRefCountEntry;
	ReservedRefCountEntry = NULL;

	/* and fill it */
	res->buffer = buffer;
	res->refcount = 0;

	return res;
}

GetPrivateRefCountEntry

（1） Pass in buffer_id look for PrivateRefCountEntry

（2） If buffer_id Already in the array , Directly return the pointer of the element in the array (buffer_id, ref_count)

（3） If not in the array , Go to hash Query in table , If there is no direct return NULL

（4） If we find out do_move==true ？ Clean up a position in the array and put the records in the hash table into the array ： Directly return the found (buffer_id, ref_count)

static PrivateRefCountEntry *
GetPrivateRefCountEntry(Buffer buffer, bool do_move)
{
	PrivateRefCountEntry *res;
	int			i;

	Assert(BufferIsValid(buffer));
	Assert(!BufferIsLocal(buffer));

	/*
	 * First search for references in the array, that'll be sufficient in the
	 * majority of cases.
	 */
	for (i = 0; i < REFCOUNT_ARRAY_ENTRIES; i++)
	{
		res = &PrivateRefCountArray[i];

		if (res->buffer == buffer)
			return res;
	}

	/*
	 * By here we know that the buffer, if already pinned, isn't residing in
	 * the array.
	 *
	 * Only look up the buffer in the hashtable if we've previously overflowed
	 * into it.
	 */
	if (PrivateRefCountOverflowed == 0)
		return NULL;

	res = hash_search(PrivateRefCountHash,
					  (void *) &buffer,
					  HASH_FIND,
					  NULL);

	if (res == NULL)
		return NULL;
	else if (!do_move)
	{
		/* caller doesn't want us to move the hash entry into the array */
		return res;
	}
	else
	{
		/* move buffer from hashtable into the free array slot */
		bool		found;
		PrivateRefCountEntry *free;

		/* Ensure there's a free array slot */
		ReservePrivateRefCountEntry();

		/* Use up the reserved slot */
		Assert(ReservedRefCountEntry != NULL);
		free = ReservedRefCountEntry;
		ReservedRefCountEntry = NULL;
		Assert(free->buffer == InvalidBuffer);

		/* and fill it */
		free->buffer = buffer;
		free->refcount = res->refcount;

		/* delete from hashtable */
		hash_search(PrivateRefCountHash,
					(void *) &buffer,
					HASH_REMOVE,
					&found);
		Assert(found);
		Assert(PrivateRefCountOverflowed > 0);
		PrivateRefCountOverflowed--;

		return free;
	}
}

5 SRC

ReadBufferExtended

/*
 * ReadBufferExtended -- returns a buffer containing the requested
 *		block of the requested relation.  If the blknum
 *		requested is P_NEW, extend the relation file and
 *		allocate a new block.  (Caller is responsible for
 *		ensuring that only one backend tries to extend a
 *		relation at the same time!)
【 Return requested PAGE, If blknum==P_NEW, Expand the table file to apply for a new page to read into memory 】

 *
 * Returns: the buffer number for the buffer containing
 *		the block read.  The returned buffer has been pinned.
 *		Does not return on error --- elog's instead.
 *
【 Return requested 、 Available pages , Note that this page has PIN 了 】

 * Assume when this function is called, that reln has been opened already.
 *
 * In RBM_NORMAL mode, the page is read from disk, and the page header is
 * validated.  An error is thrown if the page header is not valid.  (But
 * note that an all-zero page is considered "valid"; see PageIsVerified().)
 
 【RBM_NORMAL Pattern , The page is read from the disk and verified page header】
 
 * RBM_ZERO_ON_ERROR is like the normal mode, but if the page header is not
 * valid, the page is zeroed instead of throwing an error. This is intended
 * for non-critical data, where the caller is prepared to repair errors.
 *
 【RBM_ZERO_ON_ERROR Pattern , If page header Validation failed , Clear directly without error , Applicable to non core data scenarios 】
 
 * In RBM_ZERO_AND_LOCK mode, if the page isn't in buffer cache already, it's
 * filled with zeros instead of reading it from disk.  Useful when the caller
 * is going to fill the page from scratch, since this saves I/O and avoids
 * unnecessary failure if the page-on-disk has corrupt page headers.
 * The page is returned locked to ensure that the caller has a chance to
 * initialize the page before it's made visible to others.
 * Caution: do not use this mode to read a page that is beyond the relation's
 * current physical EOF; that is likely to cause problems in md.c when
 * the page is modified and written out. P_NEW is OK, though.
 
 【RBM_ZERO_AND_LOCK High performance mode ： The page is not in the buffer , Do not read from disk , Directly fill in 0. Page locking prevents others from reading a bunch of things before initialization 0】
 
 * RBM_ZERO_AND_CLEANUP_LOCK is the same as RBM_ZERO_AND_LOCK, but acquires
 * a cleanup-strength lock on the page.
 *
 * RBM_NORMAL_NO_LOG mode is treated the same as RBM_NORMAL here.
 *
 * If strategy is not NULL, a nondefault buffer access strategy is used.
 * See buffer/README for details.
 */
Buffer
ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
				   ReadBufferMode mode, BufferAccessStrategy strategy)
{
	bool		hit;
	Buffer		buf;

	/* Open it at the smgr level if not already done */
	RelationOpenSmgr(reln);

I'm going to use mdopen Open physical file , And in reln Of md_seg_fds Record the opened vfd.

1、 from reln Take out the file name of the record return &reln->md_seg_fds[forknum][0], If there is no need to use VFD open

2、 Open file PathNameOpenFile, And record VFD To md_seg_fds in .

	/*
	 * Reject attempts to read non-local temporary relations; we would be
	 * likely to get wrong data since we have no visibility into the owning
	 * session's local buffers.
	 */
	if (RELATION_IS_OTHER_TEMP(reln))
		ereport(ERROR,
				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
				 errmsg("cannot access temporary tables of other sessions")));

	/*
	 * Read the buffer, and update pgstat counters to reflect a cache hit or
	 * miss.
	 */
	pgstat_count_buffer_read(reln);
	buf = ReadBuffer_common(reln->rd_smgr, reln->rd_rel->relpersistence,
							forkNum, blockNum, mode, strategy, &hit);
	if (hit)
		pgstat_count_buffer_hit(reln);
	return buf;
}

ReadBufferExtended

/*
 * ReadBuffer_common -- common logic for all ReadBuffer variants
 *
 * *hit is set to true if the request was satisfied from shared buffer cache.
 */
static Buffer
ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
				  BlockNumber blockNum, ReadBufferMode mode,
				  BufferAccessStrategy strategy, bool *hit)
{
...
...
		bufHdr = BufferAlloc(smgr, relpersistence, forkNum, blockNum,
							 strategy, &found);
...
...
}

BufferAlloc

/*
 * BufferAlloc -- subroutine for ReadBuffer.  Handles lookup of a shared
 *		buffer.  If no buffer exists already, selects a replacement
 *		victim and evicts the old page, but does NOT read in new page.
 *
【 Find one buffer, If one is not eliminated 】

 * "strategy" can be a buffer replacement strategy object, or NULL for
 * the default strategy.  The selected buffer's usage_count is advanced when
 * using the default strategy, but otherwise possibly not (see PinBuffer).
 *
 * The returned buffer is pinned and is already marked as holding the
 * desired page.  If it already did have the desired page, *foundPtr is
 * set TRUE.  Otherwise, *foundPtr is set FALSE and the buffer is marked
 * as IO_IN_PROGRESS; ReadBuffer will now need to do I/O to fill it.
 *
【 Back to buffer Meeting pin live , And the page data has been filled and available 】
【 If the page already exists in the buffer , Go straight back to 】
【 If not, put foundPtr=false,buffer Marked as IO_IN_PROGRESS, Upper level functions “ There is no need to ” Doing it IO】

 * *foundPtr is actually redundant with the buffer's BM_VALID flag, but
 * we keep it for simplicity in ReadBuffer.
 *
 * No locks are held either at entry or exit.
 */
static BufferDesc *
BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
			BlockNumber blockNum,
			BufferAccessStrategy strategy,
			bool *foundPtr)
{
	BufferTag	newTag;			/* identity of requested block */
	uint32		newHash;		/* hash value for newTag */
	LWLock	   *newPartitionLock;	/* buffer partition lock for it */
	BufferTag	oldTag;			/* previous identity of selected buffer */
	uint32		oldHash;		/* hash value for oldTag */
	LWLock	   *oldPartitionLock;	/* buffer partition lock for it */
	uint32		oldFlags;
	int			buf_id;
	BufferDesc *buf;
	bool		valid;
	uint32		buf_state;

	/* create a tag so we can lookup the buffer */
	INIT_BUFFERTAG(newTag, smgr->smgr_rnode.node, forkNum, blockNum);

	/* determine its hash code and partition lock ID */
	newHash = BufTableHashCode(&newTag);

use hash value % NUM_BUFFER_PARTITIONS(128), Then go to MainLWLockArray Take the lock in the array ：

#define BufTableHashPartition(hashcode) ((hashcode) % NUM_BUFFER_PARTITIONS)

#define BufMappingPartitionLock(hashcode) (&MainLWLockArray[BUFFER_MAPPING_LWLOCK_OFFSET + BufTableHashPartition(hashcode)].lock)

	newPartitionLock = BufMappingPartitionLock(newHash);

	/* see if the block is in the buffer pool already */
	LWLockAcquire(newPartitionLock, LW_SHARED);
	buf_id = BufTableLookup(&newTag, newHash);

【hash In the table tag --> buf_id, Found the description is already in buffer It's in 】
	if (buf_id >= 0)
	{
		/*
		 * Found it.  Now, pin the buffer so no one can steal it from the
		 * buffer pool, and check to see if the correct data has been loaded
		 * into the buffer.
		 */
【 eureka ！ direct pinbuffer】
		buf = GetBufferDescriptor(buf_id);

【 Here is the expansion analysis of this function , In shared memory desc And local cache ref_count do ++】
		valid = PinBuffer(buf, strategy);

		/* Can release the mapping lock as soon as we've pinned it */
【PIN You can put the lock when you live 】
		LWLockRelease(newPartitionLock);

		*foundPtr = TRUE;

【 Found in hash table , But after locking, I found that the page was unavailable , You need to reread the page 】
		if (!valid)
		{
			/*
			 * We can only get here if (a) someone else is still reading in
			 * the page, or (b) a previous read attempt failed.  We have to
			 * wait for any active read attempt to finish, and then set up our
			 * own read attempt if the page is still not BM_VALID.
			 * StartBufferIO does it all.
			 */
			if (StartBufferIO(buf, true))
			{
				/*
				 * If we get here, previous attempts to read the buffer must
				 * have failed ... but we shall bravely try again.
				 */
				*foundPtr = FALSE;
			}
		}

		return buf;
	}

【 Coming here means hash I can't find , It's not in the cache 】
	/*
	 * Didn't find it in the buffer pool.  We'll have to initialize a new
	 * buffer.  Remember to unlock the mapping lock while doing the work.
	 */
【 Need to initialize a new buffer, First put the lock on initialization 】
	LWLockRelease(newPartitionLock);

	/* Loop here in case we have to try another victim buffer */
	for (;;)
	{
		/*
		 * Ensure, while the spinlock's not yet held, that there's a free
		 * refcount entry.
		 */
【 Take one from the private L1 cache array （buffer_id,ref_count） Location 】
【 If there is no place , Swap out one to the L2 cache hash table , Then take out a position 】
		ReservePrivateRefCountEntry();

		/*
		 * Select a victim buffer.  The buffer is returned with its header
		 * spinlock still held!
		 */
【 Find one. buffer Location , return ID and buf_state, Look for the freelist There is no clocksweep Eliminate 】
【 There is an expansion under this function 】
		buf = StrategyGetBuffer(strategy, &buf_state);

		Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 0);

		/* Must copy buffer flags while we still hold the spinlock */
		oldFlags = buf_state & BUF_FLAG_MASK;

		/* Pin the buffer and then release the buffer spinlock */
【 Both shared and local buffers ref all ++】
		PinBuffer_Locked(buf);

		/*
		 * If the buffer was dirty, try to write it out.  There is a race
		 * condition here, in that someone might dirty it after we released it
		 * above, or even while we are writing it out (since our share-lock
		 * won't prevent hint-bit updates).  We will recheck the dirty bit
		 * after re-locking the buffer header.
		 */
【 Need to brush dirty 】
		if (oldFlags & BM_DIRTY)
		{
			/*
			 * We need a share-lock on the buffer contents to write it out
			 * (else we might write invalid data, eg because someone else is
			 * compacting the page contents while we write).  We must use a
			 * conditional lock acquisition here to avoid deadlock.  Even
			 * though the buffer was not pinned (and therefore surely not
			 * locked) when StrategyGetBuffer returned it, someone else could
			 * have pinned and exclusive-locked it by the time we get here. If
			 * we try to get the lock unconditionally, we'd block waiting for
			 * them; if they later block waiting for us, deadlock ensues.
			 * (This has been observed to happen when two backends are both
			 * trying to split btree index pages, and the second one just
			 * happens to be trying to split the page the first one got from
			 * StrategyGetBuffer.)
			 */
			if (LWLockConditionalAcquire(BufferDescriptorGetContentLock(buf),
										 LW_SHARED))
			{
				/*
				 * If using a nondefault strategy, and writing the buffer
				 * would require a WAL flush, let the strategy decide whether
				 * to go ahead and write/reuse the buffer or to choose another
				 * victim.  We need lock to inspect the page LSN, so this
				 * can't be done inside StrategyGetBuffer.
				 */
				if (strategy != NULL)
				{
					XLogRecPtr	lsn;

					/* Read the LSN while holding buffer header lock */
					buf_state = LockBufHdr(buf);
					lsn = BufferGetLSN(buf);
					UnlockBufHdr(buf, buf_state);

					if (XLogNeedsFlush(lsn) &&
						StrategyRejectBuffer(strategy, buf))
					{
						/* Drop lock/pin and loop around for another buffer */
						LWLockRelease(BufferDescriptorGetContentLock(buf));
						UnpinBuffer(buf, true);
						continue;
					}
				}
				/* OK, do the I/O */
				TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_START(forkNum, blockNum,
														  smgr->smgr_rnode.node.spcNode,
														  smgr->smgr_rnode.node.dbNode,
														  smgr->smgr_rnode.node.relNode);

				FlushBuffer(buf, NULL);
				LWLockRelease(BufferDescriptorGetContentLock(buf));

				ScheduleBufferTagForWriteback(&BackendWritebackContext,
											  &buf->tag);

				TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_DONE(forkNum, blockNum,
														 smgr->smgr_rnode.node.spcNode,
														 smgr->smgr_rnode.node.dbNode,
														 smgr->smgr_rnode.node.relNode);
			}
			else
			{
				/*
				 * Someone else has locked the buffer, so give it up and loop
				 * back to get another one.
				 */
				UnpinBuffer(buf, true);
				continue;
			}
		}
【 The pieces you take out don't need to be dirty   Or the top has been painted 】

【 Then this piece TAG If BM_TAG_VALID, Add back to hash Inside the watch 】
		/*
		 * To change the association of a valid buffer, we'll need to have
		 * exclusive lock on both the old and new mapping partitions.
		 */
		if (oldFlags & BM_TAG_VALID)
		{
			/*
			 * Need to compute the old tag's hashcode and partition lock ID.
			 * XXX is it worth storing the hashcode in BufferDesc so we need
			 * not recompute it here?  Probably not.
			 */
			oldTag = buf->tag;
			oldHash = BufTableHashCode(&oldTag);
			oldPartitionLock = BufMappingPartitionLock(oldHash);

			/*
			 * Must lock the lower-numbered partition first to avoid
			 * deadlocks.
			 */
			if (oldPartitionLock < newPartitionLock)
			{
				LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
				LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
			}
			else if (oldPartitionLock > newPartitionLock)
			{
				LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
				LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
			}
			else
			{
				/* only one partition, only one lock */
				LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
			}
		}
		else
【 Otherwise, this old TAG It's invalid , Don't worry about the past , Just lock the new partition 】
		{
			/* if it wasn't valid, we need only the new partition */
			LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
			/* remember we have no old-partition lock or tag */
			oldPartitionLock = NULL;
			/* this just keeps the compiler quiet about uninit variables */
			oldHash = 0;
		}

		/*
		 * Try to make a hashtable entry for the buffer under its new tag.
		 * This could fail because while we were writing someone else
		 * allocated another buffer for the same block we want to read in.
		 * Note that we have not yet removed the hashtable entry for the old
		 * tag.
		 */
【 new TAG Insert hash table 】
		buf_id = BufTableInsert(&newTag, newHash, buf->buf_id);

		if (buf_id >= 0)
		{
			/*
			 * Got a collision. Someone has already done what we were about to
			 * do. We'll just handle this as if it were found in the buffer
			 * pool in the first place.  First, give up the buffer we were
			 * planning to use.
			 */
			UnpinBuffer(buf, true);

			/* Can give up that buffer's mapping partition lock now */
			if (oldPartitionLock != NULL &&
				oldPartitionLock != newPartitionLock)
				LWLockRelease(oldPartitionLock);

			/* remaining code should match code at top of routine */

			buf = GetBufferDescriptor(buf_id);

			valid = PinBuffer(buf, strategy);

			/* Can release the mapping lock as soon as we've pinned it */
			LWLockRelease(newPartitionLock);

			*foundPtr = TRUE;

			if (!valid)
			{
				/*
				 * We can only get here if (a) someone else is still reading
				 * in the page, or (b) a previous read attempt failed.  We
				 * have to wait for any active read attempt to finish, and
				 * then set up our own read attempt if the page is still not
				 * BM_VALID.  StartBufferIO does it all.
				 */
				if (StartBufferIO(buf, true))
				{
					/*
					 * If we get here, previous attempts to read the buffer
					 * must have failed ... but we shall bravely try again.
					 */
					*foundPtr = FALSE;
				}
			}

			return buf;
		}

		/*
		 * Need to lock the buffer header too in order to change its tag.
		 */
		buf_state = LockBufHdr(buf);

		/*
		 * Somebody could have pinned or re-dirtied the buffer while we were
		 * doing the I/O and making the new hashtable entry.  If so, we can't
		 * recycle this buffer; we must undo everything we've done and start
		 * over with a new victim buffer.
		 */
		oldFlags = buf_state & BUF_FLAG_MASK;
		if (BUF_STATE_GET_REFCOUNT(buf_state) == 1 && !(oldFlags & BM_DIRTY))
			break;

		UnlockBufHdr(buf, buf_state);
		BufTableDelete(&newTag, newHash);
		if (oldPartitionLock != NULL &&
			oldPartitionLock != newPartitionLock)
			LWLockRelease(oldPartitionLock);
		LWLockRelease(newPartitionLock);
		UnpinBuffer(buf, true);
	}

	/*
	 * Okay, it's finally safe to rename the buffer.
	 *
	 * Clearing BM_VALID here is necessary, clearing the dirtybits is just
	 * paranoia.  We also reset the usage_count since any recency of use of
	 * the old content is no longer relevant.  (The usage_count starts out at
	 * 1 so that the buffer can survive one clock-sweep pass.)
	 *
	 * Make sure BM_PERMANENT is set for buffers that must be written at every
	 * checkpoint.  Unlogged buffers only need to be written at shutdown
	 * checkpoints, except for their "init" forks, which need to be treated
	 * just like permanent relations.
	 */
	buf->tag = newTag;
	buf_state &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED |
				   BM_CHECKPOINT_NEEDED | BM_IO_ERROR | BM_PERMANENT |
				   BUF_USAGECOUNT_MASK);
	if (relpersistence == RELPERSISTENCE_PERMANENT || forkNum == INIT_FORKNUM)
		buf_state |= BM_TAG_VALID | BM_PERMANENT | BUF_USAGECOUNT_ONE;
	else
		buf_state |= BM_TAG_VALID | BUF_USAGECOUNT_ONE;

	UnlockBufHdr(buf, buf_state);

【 If you are old TAG There are effective , It needs to be deleted from the hash table 】
	if (oldPartitionLock != NULL)
	{
		BufTableDelete(&oldTag, oldHash);
		if (oldPartitionLock != newPartitionLock)
			LWLockRelease(oldPartitionLock);
	}

	LWLockRelease(newPartitionLock);

	/*
	 * Buffer contents are currently invalid.  Try to get the io_in_progress
	 * lock.  If StartBufferIO returns false, then someone else managed to
	 * read it before we did, so there's nothing left for BufferAlloc() to do.
	 */
	if (StartBufferIO(buf, true))
		*foundPtr = FALSE;
	else
		*foundPtr = TRUE;

	return buf;
}

PinBuffer

summary ：

use buf_id Check whether the local cache has pin 了
If already pin Local ref_count++
If there is no local pin, stay desc Update in shared memory state（ref_count++,usage_count++ Maximum to 5）, Local ref_count++
Be careful ： Locked page data is not necessarily available , The return value is ：(buf_state & BM_VALID) != 0

static bool
PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
{
	Buffer		b = BufferDescriptorGetBuffer(buf);

here b Than desc Of buf_id many 1.

desc Of buf_id It's from 0 The beginning of the calculation .

#define BufferDescriptorGetBuffer(bdesc) ((bdesc)->buf_id + 1)

	bool		result;
	PrivateRefCountEntry *ref;
【 Look in the two-level cache of numbers and hash tables （buffer_id, ref_count）】
	ref = GetPrivateRefCountEntry(b, true);

	if (ref == NULL)
	{
		uint32		buf_state;
		uint32		old_buf_state;
【 Failed to apply for one in the array , If the array is full 8 Kick one into the hash table 】
		ReservePrivateRefCountEntry();
    
【 In the array b Fill in 】
		ref = NewPrivateRefCountEntry(b);
    
【 to update buf->state】
		old_buf_state = pg_atomic_read_u32(&buf->state);
		for (;;)
		{
 
【 If locked , Poll and bring out the new status 】
			if (old_buf_state & BM_LOCKED)
				old_buf_state = WaitBufHdrUnlocked(buf);

			buf_state = old_buf_state;

			/* increase refcount */
			buf_state += BUF_REFCOUNT_ONE;

 【 here strategy == NULL Refers to clock sweep Normal elimination , Otherwise, use ring buffer For batch reading and writing data 】
			if (strategy == NULL)
			{
				/* Default case: increase usagecount unless already max. */
				if (BUF_STATE_GET_USAGECOUNT(buf_state) < BM_MAX_USAGE_COUNT)
					buf_state += BUF_USAGECOUNT_ONE;
			}
			else
			{
				/*
				 * Ring buffers shouldn't evict others from pool.  Thus we
				 * don't make usagecount more than 1.
				 */
				if (BUF_STATE_GET_USAGECOUNT(buf_state) == 0)
					buf_state += BUF_USAGECOUNT_ONE;
			}

			if (pg_atomic_compare_exchange_u32(&buf->state, &old_buf_state,
											   buf_state))
			{
				result = (buf_state & BM_VALID) != 0;
				break;
			}
		}
	}
	else
【 If already pin 了 , The local refcount++ that will do , For shared memory desc Come on , A process pin How many times is once 】
	{
		/* If we previously pinned the buffer, it must surely be valid */
		result = true;
	}

	ref->refcount++;
	Assert(ref->refcount > 0);
	ResourceOwnerRememberBuffer(CurrentResourceOwner, b);
	return result;
}

StrategyGetBuffer

/*
 * StrategyGetBuffer
 *
 *	Called by the bufmgr to get the next candidate buffer to use in
 *	BufferAlloc(). The only hard requirement BufferAlloc() has is that
 *	the selected buffer must not currently be pinned by anyone.
 *
 *	strategy is a BufferAccessStrategy object, or NULL for default strategy.
 *
 *	To ensure that no one else can pin the buffer before we do, we must
 *	return the buffer with the buffer header spinlock still held.
 */
BufferDesc *
StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
{
	BufferDesc *buf;
	int			bgwprocno;
	int			trycounter;
	uint32		local_buf_state;	/* to avoid repeated (de-)referencing */

	/*
	 * If given a strategy object, see whether it can select a buffer. We
	 * assume strategy objects don't need buffer_strategy_lock.
	 */

【 If strategy Not empty , Explain go ring buffer,strategy The structure is equipped with ring buffer Although it has structure 】
	if (strategy != NULL)
	{
		buf = GetBufferFromRing(strategy, buf_state);
		if (buf != NULL)
			return buf;
	}

	/*
	 * If asked, we need to waken the bgwriter. Since we don't want to rely on
	 * a spinlock for this we force a read from shared memory once, and then
	 * set the latch based on that value. We need to go through that length
	 * because otherwise bgprocno might be reset while/after we check because
	 * the compiler might just reread from memory.
	 *
	 * This can possibly set the latch of the wrong process if the bgwriter
	 * dies in the wrong moment. But since PGPROC->procLatch is never
	 * deallocated the worst consequence of that is that we set the latch of
	 * some arbitrary process.
	 */

【StrategyControl This is clocksweep Core algorithm data structure , There is an introduction above 】
  // (gdb) p	*StrategyControl
// $6 = {buffer_strategy_lock = 0 '\000', nextVictimBuffer = {value = 0}, firstFreeBuffer = 324, lastFreeBuffer = 16383, completePasses = 0, numBufferAllocs = {value = 0}, bgwprocno = 113}
	bgwprocno = INT_ACCESS_ONCE(StrategyControl->bgwprocno);
	if (bgwprocno != -1)
	{
		/* reset bgwprocno first, before setting the latch */
		StrategyControl->bgwprocno = -1;

		/*
		 * Not acquiring ProcArrayLock here which is slightly icky. It's
		 * actually fine because procLatch isn't ever freed, so we just can
		 * potentially set the wrong process' (or no process') latch.
		 */
		SetLatch(&ProcGlobal->allProcs[bgwprocno].procLatch);
	}

	/*
	 * We count buffer allocation requests so that the bgwriter can estimate
	 * the rate of buffer consumption.  Note that buffers recycled by a
	 * strategy object are intentionally not counted here.
	 */
	pg_atomic_fetch_add_u32(&StrategyControl->numBufferAllocs, 1);

	/*
	 * First check, without acquiring the lock, whether there's buffers in the
	 * freelist. Since we otherwise don't require the spinlock in every
	 * StrategyGetBuffer() invocation, it'd be sad to acquire it here -
	 * uselessly in most cases. That obviously leaves a race where a buffer is
	 * put on the freelist but we don't see the store yet - but that's pretty
	 * harmless, it'll just get used during the next buffer acquisition.
	 *
	 * If there's buffers on the freelist, acquire the spinlock to pop one
	 * buffer of the freelist. Then check whether that buffer is usable and
	 * repeat if not.
	 *
	 * Note that the freeNext fields are considered to be protected by the
	 * buffer_strategy_lock not the individual buffer spinlocks, so it's OK to
	 * manipulate them without holding the spinlock.
	 */
	if (StrategyControl->firstFreeBuffer >= 0)
	{
		while (true)
		{
			/* Acquire the spinlock to remove element from the freelist */
			SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
      
【 This is a critical scenario ,firstFreeBuffer above >0 But maybe the lock was used up after I came in 】
			if (StrategyControl->firstFreeBuffer < 0)
			{
				SpinLockRelease(&StrategyControl->buffer_strategy_lock);
				break;
			}

			buf = GetBufferDescriptor(StrategyControl->firstFreeBuffer);
			Assert(buf->freeNext != FREENEXT_NOT_IN_LIST);

【freelist One of the first desc Not in the linked list ： 1、 Header pointer modification  2、 It came out of it desc Of freeNext invalid 】
			/* Unconditionally remove buffer from freelist */
			StrategyControl->firstFreeBuffer = buf->freeNext;
			buf->freeNext = FREENEXT_NOT_IN_LIST;

			/*
			 * Release the lock so someone else can access the freelist while
			 * we check out this buffer.
			 */
			SpinLockRelease(&StrategyControl->buffer_strategy_lock);

			/*
			 * If the buffer is pinned or has a nonzero usage_count, we cannot
			 * use it; discard it and retry.  (This can only happen if VACUUM
			 * put a valid buffer in the freelist and then someone else used
			 * it before we got to it.  It's probably impossible altogether as
			 * of 8.3, but we'd better check anyway.)
			 */
【 Get state And lock BM_LOCKED】
			local_buf_state = LockBufHdr(buf);

【 It's unlikely to fail here , Equivalent to doing an anomaly detection 】
【 Failure can only be vacuum Put one in , Others got it before us and used it directly 】
			if (BUF_STATE_GET_REFCOUNT(local_buf_state) == 0
				&& BUF_STATE_GET_USAGECOUNT(local_buf_state) == 0)
			{
				if (strategy != NULL)
					AddBufferToRing(strategy, buf);
				*buf_state = local_buf_state;
				return buf;
			}
			UnlockBufHdr(buf, local_buf_state);

		}
	}
  
【 Come here to explain freelist There is nothing to use , Need to be eliminated 】

	/* Nothing on the freelist, so run the "clock sweep" algorithm */
	trycounter = NBuffers;
	for (;;)
	{
【ClockSweepTick The function takes StrategyControl->nextVictimBuffer】
【 Because to ensure atomicity , So I wrote a big lump 】
  
【 The following loop logic is relatively simple , Traverse all of buffer, Did you find pinned Of 】
【 look for usage_count==0 Return , Traverse it buffer Of usage_count Metropolis --, So the first round did not traverse 5 There must be 】
【usage_count The upper limit is 5 了 , Prevent involution ～】
		buf = GetBufferDescriptor(ClockSweepTick());

		/*
		 * If the buffer is pinned or has a nonzero usage_count, we cannot use
		 * it; decrement the usage_count (unless pinned) and keep scanning.
		 */
		local_buf_state = LockBufHdr(buf);

		if (BUF_STATE_GET_REFCOUNT(local_buf_state) == 0)
		{
			if (BUF_STATE_GET_USAGECOUNT(local_buf_state) != 0)
			{
				local_buf_state -= BUF_USAGECOUNT_ONE;

				trycounter = NBuffers;
			}
			else
			{
				/* Found a usable buffer */
				if (strategy != NULL)
					AddBufferToRing(strategy, buf);
				*buf_state = local_buf_state;
				return buf;
			}
		}
		else if (--trycounter == 0)
		{
			/*
			 * We've scanned all the buffers without making any state changes,
			 * so all the buffers are pinned (or were when we looked at them).
			 * We could hope that someone will free one eventually, but it's
			 * probably better to fail than to risk getting stuck in an
			 * infinite loop.
			 */
			UnlockBufHdr(buf, local_buf_state);
			elog(ERROR, "no unpinned buffers available");
		}
		UnlockBufHdr(buf, local_buf_state);
	}
}

原网站

版权声明
本文为[mingjie]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/199/202207161128313363.html